Tuesday, December 24, 2024

ALPP 03-XX (16) -- Multiplying by Powers of Two (Shift Left)

I'm leaving this here for reference for a little while. Eventually, I plan to do a chapter on shifts, and most of this will be taken up there.

Multiplying by Powers of Two
(Shift Left)

(Title Page/Index)

 

Now that we've seen a little motivation and covered a little theory about multiplying by constants, it's time to look at multiplying by powers of two.

(But this chapter will also lean a bit more to theory than practice, even though there is practice. It also gets a little long. Please bear with me.)

Here's how to multiply a byte in accumulator B by 2 on the 6800, 6801, or 6809: 

	LSLB

And, of course,  there's LSLA. And LSL n,X (indexed mode) or LSL <address> (extended mode) allows you to quickly multiply any byte in memory by two. It's faster in an accumulator, but direct shifts on memory avoid saving and storing whatever is in the accumulators.

Unfortunately, we can't use the abbreviated direct page addressing on the 6800/6801 to save a byte and a fetch cycle, but since the direct page is the lowest 256 bytes of memory, we can still shift operands there in extended mode. 

(Incidentally, the DP mode on the 6809 actually uses more cycles than extended mode, darn it. Saves a byte and is as fast as indexed mode, anyway.) 

(As another aside, in the 6805 microcontroller, which is sort of a half a 6800, direct page mode is provided for the read-modify-write instructions -- including shifts -- instead of extended mode. And you get the speedup. This was definitely a good trade-off for the 6805.)

Why logical shift left instead of arithmetic shift left? Actually, on Motorola's 68XX and 680XX series microprocessors, ASL is a synonym of LSL We didn't see a good reason to make a distinction, and Motorola and other companies didn't. We'll talk about that more when we get to division by constants.

On the 68000, we can multiply a byte in D0 by two with

	LSL.B	#1,D0

I'll save the full addressing modes discussion for later, but the 68000 allows logical shifts of any width on all data registers D0-D7. 

However, byte-width shifts on operands in memory are not provided.

So how about a 16-bit integer?

On the 6800 and 6809, for two bytes together in the accumulators, with the high byte in A, it's

	LSLB	; less significant byte in B
	ROLA	; more significant byte in A

This works on the 6801, too, of course. But on the 6801, we also have

	LSLD	; high in A, low in B

There is no LSLD on the 6809.

On the 6800, it's only a matter of convention to put the more significant byte in A, and often the convention has been reversed in existing code. On the 6801 and 6809, you can still change or reverse the convention, but the ability to treat the accumulator pair, A:B, as the single double accumulator D means you usually want to follow the double-accumulator convention.

Now you can combine those with bytes in memory if you need to for some reason. For instance, if you need to keep a counter in A, you can keep the more significant byte on the stack and reference it by indexed mode, as below, assuming the parameter stack pointer PSP is in X:

	LSLB	; less significant byte
	ROL	0,X	; more significant byte on stack (X has PSP)

On the 68000, we can multiply sixteen bits in D0 by two with

	LSL.W	#1,D0

But I already said that, didn't I? 

However (Surprise?), the 68000 does provide single-bit shifts on 16-bit wide word only operands in memory, accessed via (most) normal indexing or absolute modes.

	LSL.W	(A6)	; shift top 16-bit word of parameter stack 1 bit left.

This was perhaps because the 68000 instruction set was originally designed for 16-bit wide memory designs. (I think they thought it was an optimization, and it probably was at the time.) Again, we'll look at this more later.

Four-byte wide shifts? Just in case it's not clear, I'll show you one way you might want to do it on the 6800, again assuming PSP in X and the more significant bytes on top of stack:

	LSLB	; least significant byte
	ROLA	; next less significant byte
	ROL	1,X	; next more significant byte on stack
	ROL	0,X	; most significant byte on stack

On the 6801, when we use the accumulators for 32-bit shifts, we probably want to use the double shift where we can, on the first shift:

	LSLD	; least significant 16 bits
	ROL	1,X	; next more significant byte on stack
	ROL	0,X	; most significant byte on stack

This allows us to grab the carry from the double shift and pass it into the following rotate instructions, which only exist in byte form on the 6801.

On the 6809, assuming we are using U for the parameter stack, and that the more significant bytes are the topmost on the stack, it would be

	LSLB	; least significant byte
	ROLA	; next less significant byte
	ROL	1,U	; next more significant byte on stack
	ROL	,U	; most significant byte on stack

And on the 68000, as I have already said, we can multiply 32 bits in D0 by two with

	LSL.L	#1,D0

32-bit CPUs are nice, aren't they?

What if you need to do shift a 32-bit wide integer on top of stack without bringing it into a register for some reason? Since there is no 32-bit version of LSL for operands in memory, you'll need to do this:

	LSL.W	2(A6)	; shift less-significant 16-bit word 1 bit left, carry to C and X.
	ROXL.W	(A6)	; rotate X carry into more-significant 16-bit word, with 1-bit shift.

For the times you need it, it can be done.

And, incidentally, we see that the 68000 has something called an eXtended carry for extending 1-bit carries. The C carry is for branching, and the X carry is for extending. It's a bit weird, but useful.

What if you need to shift 64 bits left?

On the 8-bit CPUs, you'll start with LSL on the least significant byte, then perform 7 more ROLs preceding in order from less to most significant. Because at least 6 of the shifts will be in memory, it'll take at least 12  bytes of ROL instructions plus the one (6801) or two (6800, 6809) for the accumulator bytes. 

I think I'd better show this doing it all in memory, and you can consider whether it would be worth bringing the least significant bytes into the accumulators:

	LSL	7,X	; least significant byte (byte 7)
	ROL	6,X	; next less significant byte (byte 6)
	ROL	5,X	; next more significant byte (byte 5)
	ROL	4,X	; next more significant byte (byte 4)
	ROL	3,X	; next more significant byte (byte 3)
	ROL	2,X	; next more significant byte (byte 2)
	ROL	1,X	; next more significant byte (byte 1)
	ROL	0,X	; most significant byte on (byte 0)

On the 6801, if you already have the least significant two bytes in D, you could start it with a LSLD and save maybe three bytes. Or you could do LDD 6,X; LSLD and save one byte, maybe. 

But if you do it without moving the least significant bytes into the accumulators, you can do the whole 8 bytes without touching A or B.

Let's look at trying to save bytes by making a loop. We'll assume X is already pointing to the least significant byte. 

But if this isn't the case, we'll have to adjust X, and that will likely cost at least 8 bytes of code on a 6800. (A series of 8 INX instructions is less code than saving X in a DP variable, adding 8 to it using one or both accumulators, and then loading it back to X.) Or it will take at least 3 bytes of code on a 6801 (where it can be done with a LDAB #8 and an ABX.) 

Here's the loop, for either 6800 or 6801:

* Assume X pointing to least significant byte (not likely)
	LDAA	#7	; bytes to ROL
	LSL	0,X	; least significant byte starts with LSL
SHL64L	DEX		; carry not affected
	ROL	0,X	; next more significant byte
	DECA		; count down, carry not affected
	BNE	SHL64L	; do next
* Ends with X pointing to most significant byte

Is it worth saving two bytes, max, to use the loop? Or costing one to seven extra bytes for the loop in code to pre-adjust X? 

(Well, on the 6801, you may actually be able to shave a couple of INX instructions by shifting the least significant bytes in D. I'll let you calculate that out.)

The unrolled version is most likely better, although, if you need to do this kind of shift on a variable number of bytes, you have now seen a bit of code to start from.

The loop form can improve on the 6809, because of the index math the 6809 provides. But there are other approaches, and you may need to choose between code size and speed for things like this on the 6809. 

Hmm. I think I'd better show the loop and index adjustment on the 6809.

[JMR202412311146 addenda part 1:]

But first I want to show you another way, that uses accumulator offset:

[JMR202412311146 addenda part 1 end.]

* The 64 bit number to shift is on the parameter stack (U)
	LDA	#6	; bytes to ROL minus 1
	LSL	7,U	; least significant byte
SHL64L	ROL	A,U	; next more significant byte
	DECA		; count down, carry not affected
	BPL	SHL64L	; do next (want to do 0, too)

Nine bytes for the loop, including the pointer math. There is another approach that uses auto-decrement and LEA, that should be similar in cycle and byte count, but this might be worth the three bytes saved in a really tight ROM or something. 

[JMR202412311146 addenda part 2:]

This is one way the auto-decrement method could be mapped into the 6809:

* The 64 bit number to shift is on the parameter stack (U)
	LDAA	#7	; bytes to ROL
	LEAX	A,U
	LSL	,X	; least significant byte starts with LSL
SHL64L	ROL	,-X	; next more significant byte
	DECA		; count down, carry not affected
	BNE	SHL64L	; do next
* Ends with X pointing to most significant byte

[JMR202412311146 addenda part 2 end.] 

For shifting 64 bits left by 1 on the 68000, it's nicely straightforward:

	LSL.L	#1,D1	; less significant long word, carry to C and X
	ROXL.L	#1,D0	; rotate X carry into more-significant long word
I mentioned the eXtend carry above, and we can use it here, too. Doing it in memory only would use the 16-bit shift and three of the 16-bit rotates.

How about multiplying 8 bits by 4?

For the 6800, 6801, and 6809:

	LSLB	; multiply by 2
	LSLB	; and again

The second time you multiply by 2, don't use ROL. That would be feeding the top bit back into the bottom, which would be a different operation. Each time you multiply by 2, you have to start with LSL, not a ROL. ROL is for catching the bits between bytes.

For the 68000, oh, fun!:

	LSL.B	#2,D0	; times 2^2

One instruction and done! WHEEEEE!!!

For multiplying by 2n, shift n times. Just be aware that the more you shift without catching bits off the top, the more you lose bits.

What about multiplying 16 bits by 4? On the 8-bit processors, you have to repeat the entire chain of shifts to avoid losing bits between the bytes. On the 6800 and 6809, this is what you usually want:

	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again
	ROLA	; catch the carry again

On the 6801,

	LSLD	; multiply by 2
	LSLD	; and again

I hope this is enough that you can see how to multiply 32-bit integers by the constant 4 using shifts on the 8-bit processors. I could show them anyway as an excuse to talk about optimization, but we need to move on.

On the 68000, you can specify the number of bits to shift, up to 8, as the immediate argument to the shift, for all widths, byte, 16-bit, and full 32-bit. That means you can multiply a full 32-bit integer in a register by any power of 2, up to 28, in a single instruction. The only cost is that each bit costs an additional couple of CPU internal cycles, but compared to the cost of fetching and encoding an additional instruction, it's not a great cost.

Just remember that if you shift a byte 8 bits, every bit in the byte gets shifted out. Thus

	LSL.B	#8,D0	; times 2^8 -- WHEEEEE!?!?!???

is an expensive way to clear the lowest byte of D0. 

Of course,

	LSL.W	#8,D0	; 16 bits times 2^8

is one way to get the lowest byte in D0 up to the next higher byte position (bits 15 to 8). (That's the difference between .B and .W here.)

How about multiplying by 16? That's 24, so it's 4 shifts. 

As we've already noted, on the 68000, it's just one instruction. LSL.B or LSL.W or LSL.L and the operands are #4,Dn to shift Dn by 4 bits left.

On the 8-bit processors, if it's a byte value, that's 4 shifts on the same operand in a row, which is not too bad. 4 instructions, 4 bytes. One more byte than a JSR in extended mode.

	LSLB	; multiply by 2
	LSLB	; again by 2 is by 4
	LSLB	; and again by 2 is by 8
	LSLB	; and again by 2 is by 16

If it's a 16-bit value, for the 6800 and 6809, it's 4 pairs of LSLB; ROLA in a row, 8 instructions in 8 bytes. That might be worth the performance hit of making it into a loop, depending on how tight memory is:

	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again, to make it by 4
	ROLA	; catch the carry again
	LSLB	; and again, to make it by 8
	ROLA	; catch the carry
	LSLB	; and again, to make it by 16
	ROLA	; catch the carry again

vs., say, on the 6809 where it's easiest to construct the loop:

MUL16W	LDB	#4
	PSHU	B
	LDD	1,U	; operand on parameter stack
MUL16WL	LSLB	; multiply by 2
	ROLA	; catch the carry
	DEC	,U
	BNE	MUL16WL
	LEAU	1,U
	STD	,U

But, as you can see, for the cost of setting up the loop, you might as well just do it.

On the other hand, that code easily turns into a general loop for shifting left n bytes, so don't forget it.

For the 6801, it's again,

	LSLD	; multiply by 2
	LSLD	; and again, making it by 4
	LSLD	; and again, making it by 8
	LSLD	; and again, making it by 16

On the 6801, it won't be worth making a loop for 16-bit integers until we try to multiply by constants greater than 28, which are pretty rare, really. And ... 

Speaking of multiplying by 28, if we are following the A:B is D convention,

	TBA	; multiply by 256
	CLRB	; don't forget to zero out low byte.

does the trick for rotating by 8 bits. And you may not even need to explicitly do this -- may be good enough to just move it in memory.

While we're thinking about this, let's look at another way to multiply by 256 on the 68000 (or to get the lowest byte of Dn into bits 15 to 8:

	MOVE.B	D0,-(A6)	; this does NOT work on A7!
	CLR.B	-(A6)
	MOVE.W	(A6)+,D0

Usually, you would not want to do it this way, because you're having to read and write memory, but there are times it can be useful.

The reason it doesn't work on A7? A7 absolutely must stay 16-bit aligned on the 68000, because of return addresses. So when you push a byte to A7 -- any MOVE.B Dn,-(A7) --  the 68000 magically double decrements A7 for you. And when you pop a byte  from A7 -- any MOVE.B (A7)+,Dn -- it automatically double increments A7 for you. 

(Don't get frustrated about this. As I have hinted before, you really don't want to store strings on the A7 stack.)

Multiplying by 32, 64, and 128 are also not nearly as common as by 2, 4, 8, and 16. It may be a better use of resources to just let them go to a general multiply routine (which, as I say, I will show you shortly). 

Except, we do actually want to look at multiplying a byte by 25, and, by implication, 26, and  27, because I can demonstrate now a little about how multiplying is dividing.

Multiply B by 128, leaving result in A:

	CLRA	; for result
	LSRB	; bit 0 to carry
	RORA	; bit 0 of B now in bit 7 of A

Note that A now has the less significant byte, and B has the more significant byte.

The following also works, but takes an extra byte of code while leaving A untouched, and loses any of the more significant bits than bit 0:

	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	ANDB	#$80	; chop off the lost, double-shifted high bits

[JMR202401010106 addendum:]

The above could be useful in saturation math, or if we know the input will only be 0 or 1.

[JMR202501071039 edit:]

If we want to keep all the bits that we might be losing, we would have to store the intermediate result away first, probably using A, after all:

If we want to start with saturation math, but keep all the bits that we would otherwise be losing, we might try to store the intermediate result away in like this: 

	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	TBA
	ANDB	#$80	; chop off the high bits
	ANDA	#$7F	; chop off the low bit (but ...)

But the high bits were already shifted too many times, so we have to un-shift them. Fortunately, the carry has not been altered by the TBA or AND instructions, so we can rotate the carry back in at several points. Taking the earliest point, we could do it like this:

	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	TBA
	ROLA		; bring the high bits back into position
	ANDB	#$80	; chop off the high bits
	ANDA	#$7F	; chop off the low bit 

But if we back up and take the copy first,

	TBA		; make two halves
	LSRA		; bit 0 to carry, high bits
	RORB		; now to bit 7
	ANDB	#$80	; chop off the high bits

Does it seem like we did that above?

Back up and look at the code and see what's different and think about why the effect is the same -- when done correctly.

[JMR202501071039 edit end.]

[JMR202401010106 addendum end.]

It could also be interpreted as a divide, which we will talk about later.

Either way can be extended to multiply by 64 as in the following, leaving the result in A:

	CLRA	; for result
	LSRB	; bit 0 to carry
	RORA	; old bit 0 of B now in bit 7 of A
	LSRB	; old bit 1 of B to carry
	RORA	; old bit 1,0 of B now in bit 7,6 of A

Not touching A to multiply by 64:

	LSRB		; bit 0 to carry
	RORB		; now to bit 7, old bit 1 to carry
	RORB		; now to bit 7,6 in order
	ANDB	#$C0	; chop off the remainder

And we can see that using A in byte inversion to multiply by 32 costs more bytes and cycles than not touching A to multiply by 32:

	LSRB		; bit 0 to carry
	RORB		; now to bit 7, old bit 1 to carry
	RORB		; now to bit 7,6 in order, old bit 2 to carry
	RORB		; now to bit 7,6,5 in order
	ANDB	#$E0	; chop off the remainder

This works for 16-bit integers as well, but for multiplying by 215, 214, 213, and so forth. We'll show multiplying 213:

	LSRB		; bit 0 to carry
	RORA		; now to bit 15, bit 8 to carry
	RORB		; bit 8 to bit 7, old bit 1 to carry
	RORA		; old bit 1 to bit 15 in order
	RORB		; old bit 9 to bit 7, old bit 2 to carry
	RORA		; old bit 2 to bit 15 in order
	RORB		; old bit 10 to bit 7 (old bit 3 to carry)
	ANDA	#$E0	; chop off the top bytes, ignore carry

Yes, that does lose 13 bits off the top. If you wanted them, you needed to allocate another couple of bytes to save them in.

Now, how does this play out on the 68000?

Nicely enough:

	ROR.B	#3,D0	; 8-bit rotate wraps around!
	AND.B	#$E0,D0	; mask out remainder.

ROR and ROL are register-wide rotations -- 8-, 16-, and 32-bits wide. They copy the bit that wraps to the other end into both the eXtend carry and the test/branch Carry bits.

ROXR and ROXL are register-plus-carry rotations -- 9-, 17-, and 33- bits wide. They rotate the bit that comes out of the one end into the X carry, rotating the old X carry into the other end, and copy the new X carry into the C carry.

This is worth remembering to help shave cycle counts -- not so much for 5 bit shifts, but for 6 and 7 on bytes, yes. And, definitely, 3 bits right is faster than 13 bits left:

	ROR.W	#3,D0	; 16-bit rotate wraps around!
	AND.W	#$E000,D0	; mask out remainder.

Wait. I said you can shift by immediate bit counts up to 8. 13 is more than eight. But Motorola did leave us another way, that we don't have to do

	LSL.W	#8,D0	; First shift 8
	LSL.W	#5,D0	; Then shift the rest.

Not that we wanted to shift a 16-bit word by 13 bits, but we can also do it in one shift, using variable shift counts:

	MOVEQ	#13,D1	; shift count
	LSL.W	D1,D0	; 16 bit shift by count in D1

So there are several ways to do it. You'll want to look at op-code byte counts and cycle timing when there isn't any other reason to choose between them. 

Backing up a bit to register-wide rotations, 8- and 16-bit wide rotate instructions are missing in the 6800, 6801, and 6809. You have to explicitly move the old carry bit(s) in when you need to do register-wide. The general approach is to make a copy and do two logical shifts, one in the desired direction and count, the other in the complementary count in the opposite direction, then bit-or the results together:

* accumulator-wide ROL by 3 / ROR by 5:
	LDAA	0,X	; copy
	LSL	0,X	; shift left by 3
	LSL	0,X
	LSL	0,X
	LSRA		; shift right by 5
	LSRA		; (use the faster shift accumulator)
	LSRA
	LSRA
	LSRA
	ORAA	0,X	; put results together

This works because logical shifts leave 0 bits behind in the vacated bits.

[JMR202501040801 addendum: (Shifting to 6800 on purpose here. We'll see why below.) ]

The 6800, 6801, and 6809 do not have an instruction to OR the accumulators together, but when you know the bits are set in only one result or the other, you can add them together to get the same result:

* accumulator-wide ROL by 3 / ROR by 5:
	LDAB	0,X	; copy
	TBA
	LSLA	; shift left by 3
	LSLA
	LSLA
	LSRB	; shift right by 5
	LSRB
	LSRB
	LSRB
	LSRB
	ABA	; put the results together

[JMR202501041102 correction:] 

Instead of LSRB above, I had originally written RORB, which is, of course, going to mix bits in a semi-arbitrary non-random way, thus, a bug. Erk.

[JMR202501041102 correction end.]

[JMR202501040801 addendum:] 

(6809 does not even have an ABA instruction. You have to push it to a stack and add post-increment. So you should OR post-increment, so that you remember that ADDing was a substitute for ORring.

It's easy to think it was unnecessary cost cutting, and I tend to lean toward the solution that the 6309's hidden instruction set used, of having the ADDR (and ORR) register to register instructions with arguments like TFR and EXG use. But I also recognize the awkwardness of providing an addition path between D and the index registers outside the LEA instructions. Whether or not you include a tertiary ALU, the data paths get pretty complex. With modern design tools, it's not so hard, but it was not all that easy with the tools they had in 1978.)

[JMR202501040801 addendum end.]

When it's a single bit left rotation, there's a neat trick:

ROLB8	LSLB		; 8-bit wide rotation
	ADCB	#0	; move the carry in

Single bit right doesn't fair so well, however:

ROR8B	LSRB
	BCC	ROR8BN
	ORAB	#$80
ROR8BN	NOP	; next instruction

or, using the other accumulor to get the low bit in the carry first,

ROR8B	TBA
	LSRA	; get low bit in carry first
	RORB

are the best I'm aware of.

What about 16-bit and 32-bit integer rotation? I think we have enough here to figure them out on the 6800, 6801, and 6809.

And Motorola gave us another way to optimize 32-bit shifts and rotates on the 68000:

	SWAP	D0	; Effectively a 16-bit rotate on D0!

SWAP exchanges the low and high halves of the data register. It's a 16-bit instruction and only takes four cycles, so it's a good way to avoid 2 cycles per bit for really long shifts and rotates. For instance, if we want to shift a 32-bit integer in D0 left by 20 bits, we can use SWAP to speed it up:

	SWAP	D0	; rotate by 16 bits
	CLR.W	D0	; mask out low half
	LSL.L	#4,D0	; shift the remaining 4 bits.

This is all very interesting, but it doesn't really seem to be taking us any closer to multiplying by ten? Have we just gotten lost in shifting and masking? Is this all just shifty business?

Shifting bits is not something people do every day, so I'm trying to expose you to a lot of things that you can do shifting bits before we use them for the real stuff.

If there's something you really want to check in the above, make up your own test code and check it. (I may have made a mistake? ;-) 

If you find mistakes, leave me a note in the comments, please.

Otherwise, hang on for the ride. 

It'll make more sense in the chapter after next, where we do some fast multiplying by a few small constants that are not powers of 2.

In the next chapter, you can look at some (incompletely tested) demonstration code for the 6800.

 

(Title Page/Index)

 

 

 

 

No comments:

Post a Comment