I'm leaving this here for reference for a little while. Eventually, I plan to do a chapter on shifts, and most of this will be taken up there.
Multiplying by Powers of Two
(Shift Left)
(Title Page/Index)
Now that we've seen
a little motivation and covered a little theory about multiplying by
constants, it's time to look at multiplying by powers of two.
(But this chapter will also lean a bit more to theory than practice, even though there is practice. It also gets a little long. Please bear with me.)
Here's how to multiply a byte in accumulator B by 2 on the 6800, 6801, or
6809:
LSLB
And, of course, there's LSLA. And LSL n,X (indexed mode) or LSL
<address> (extended mode) allows you to quickly multiply any byte in
memory by two. It's faster in an accumulator, but direct shifts on memory
avoid saving and storing whatever is in the accumulators.
Unfortunately, we can't use the abbreviated direct page addressing on the
6800/6801 to save a byte and a fetch cycle, but since the direct page is the
lowest 256 bytes of memory, we can still shift operands there in extended
mode.
(Incidentally, the DP mode on the 6809 actually uses more cycles than extended
mode, darn it. Saves a byte and is as fast as indexed mode, anyway.)
(As another aside, in the 6805 microcontroller, which is sort of a half a
6800, direct page mode is provided for the read-modify-write instructions --
including shifts -- instead of extended mode. And you get the speedup.
This was definitely a good trade-off for the 6805.)
Why logical shift left instead of arithmetic shift left? Actually, on
Motorola's 68XX and 680XX series microprocessors, ASL is a synonym of LSL We
didn't see a good reason to make a distinction, and Motorola and other
companies didn't. We'll talk about that more when we get to division by
constants.
On the 68000, we can multiply a byte in D0 by two with
LSL.B #1,D0
I'll save the full addressing modes discussion for later, but the 68000 allows
logical shifts of any width on all data registers D0-D7.
However, byte-width shifts on operands in memory are
not provided.
So how about a 16-bit integer?
On the 6800 and 6809, for two bytes together in the accumulators, with the
high byte in A, it's
LSLB ; less significant byte in B
ROLA ; more significant byte in A
This works on the 6801, too, of course. But on the 6801, we also have
LSLD ; high in A, low in B
There is no LSLD on the 6809.
On the 6800, it's only a matter of convention to put the more significant byte
in A, and often the convention has been reversed in existing code. On the 6801
and 6809, you can still change or reverse the convention, but the ability to
treat the accumulator pair, A:B, as the single double accumulator D means you
usually want to follow the double-accumulator convention.
Now you can combine those with bytes in memory if you need to for some reason.
For instance, if you need to keep a counter in A, you can keep the more
significant byte on the stack and reference it by indexed mode, as below,
assuming the parameter stack pointer PSP is in X:
LSLB ; less significant byte
ROL 0,X ; more significant byte on stack (X has PSP)
On the 68000, we can multiply sixteen bits in D0 by two with
LSL.W #1,D0
But I already said that, didn't I?
However (Surprise?), the 68000 does provide single-bit shifts on 16-bit
wide word only operands in memory, accessed via (most) normal indexing
or absolute modes.
LSL.W (A6) ; shift top 16-bit word of parameter stack 1 bit left.
This was perhaps because the 68000 instruction set was originally designed for
16-bit wide memory designs. (I think they thought it was an optimization, and
it probably was at the time.) Again, we'll look at this more later.
Four-byte wide shifts? Just in case it's not clear, I'll show you one way you
might want to do it on the 6800, again assuming PSP in X and the more
significant bytes on top of stack:
LSLB ; least significant byte
ROLA ; next less significant byte
ROL 1,X ; next more significant byte on stack
ROL 0,X ; most significant byte on stack
On the 6801, when we use the accumulators for 32-bit shifts, we probably want
to use the double shift where we can, on the first shift:
LSLD ; least significant 16 bits
ROL 1,X ; next more significant byte on stack
ROL 0,X ; most significant byte on stack
This allows us to grab the carry from the double shift and pass it into the
following rotate instructions, which only exist in byte form on the 6801.
On the 6809, assuming we are using U for the parameter stack, and that the
more significant bytes are the topmost on the stack, it would be
LSLB ; least significant byte
ROLA ; next less significant byte
ROL 1,U ; next more significant byte on stack
ROL ,U ; most significant byte on stack
And on the 68000, as I have already said, we can multiply 32 bits in D0 by two
with
LSL.L #1,D0
32-bit CPUs are nice, aren't they?
What if you need to do shift a 32-bit wide integer on top of stack without
bringing it into a register for some reason? Since there is no 32-bit version
of LSL for operands in memory, you'll need to do this:
LSL.W 2(A6) ; shift less-significant 16-bit word 1 bit left, carry to C and X.
ROXL.W (A6) ; rotate X carry into more-significant 16-bit word, with 1-bit shift.
For the times you need it, it can be done.
And, incidentally, we see that the 68000 has something called an eXtended
carry for extending 1-bit carries. The C carry is for branching, and the X
carry is for extending. It's a bit weird, but useful.
What if you need to shift 64 bits left?
On the 8-bit CPUs, you'll start with LSL on the least significant byte, then
perform 7 more ROLs preceding in order from less to most significant. Because
at least 6 of the shifts will be in memory, it'll take at least 12 bytes
of ROL instructions plus the one (6801) or two (6800, 6809) for the
accumulator bytes.
I think I'd better show this doing it all in memory, and you can consider
whether it would be worth bringing the least significant bytes into the
accumulators:
LSL 7,X ; least significant byte (byte 7)
ROL 6,X ; next less significant byte (byte 6)
ROL 5,X ; next more significant byte (byte 5)
ROL 4,X ; next more significant byte (byte 4)
ROL 3,X ; next more significant byte (byte 3)
ROL 2,X ; next more significant byte (byte 2)
ROL 1,X ; next more significant byte (byte 1)
ROL 0,X ; most significant byte on (byte 0)
On the 6801, if you already have the least significant two bytes in D, you
could start it with a LSLD and save maybe three bytes. Or you could do LDD
6,X; LSLD and save one byte, maybe.
But if you do it without moving the least significant bytes into the
accumulators, you can do the whole 8 bytes without touching A or B.
Let's look at trying to save bytes by making a loop. We'll assume X is already
pointing to the least significant byte.
But if this isn't the case, we'll have to adjust X, and that will likely cost
at least 8 bytes of code on a 6800. (A series of 8 INX instructions is less
code than saving X in a DP variable, adding 8 to it using one or both
accumulators, and then loading it back to X.) Or it will take at least 3 bytes
of code on a 6801 (where it can be done with a LDAB #8 and an ABX.)
Here's the loop, for either 6800 or 6801:
* Assume X pointing to least significant byte (not likely)
LDAA #7 ; bytes to ROL
LSL 0,X ; least significant byte starts with LSL
SHL64L DEX ; carry not affected
ROL 0,X ; next more significant byte
DECA ; count down, carry not affected
BNE SHL64L ; do next
* Ends with X pointing to most significant byte
Is it worth saving two bytes, max, to use the loop? Or costing one to seven
extra bytes for the loop in code to pre-adjust X?
(Well, on the 6801, you may actually be able to shave a couple of INX
instructions by shifting the least significant bytes in D. I'll let you
calculate that out.)
The unrolled version is most likely better, although, if you need to do this
kind of shift on a variable number of bytes, you have now seen a bit of code
to start from.
The loop form can improve on the 6809, because of the index math the 6809
provides. But there are other approaches, and you may need to choose between
code size and speed for things like this on the 6809.
Hmm. I think I'd better show the loop and index adjustment on the 6809.
[JMR202412311146 addenda part 1:]
But first I want to show you another way, that uses accumulator offset:
[JMR202412311146 addenda part 1 end.]
* The 64 bit number to shift is on the parameter stack (U)
LDA #6 ; bytes to ROL minus 1
LSL 7,U ; least significant byte
SHL64L ROL A,U ; next more significant byte
DECA ; count down, carry not affected
BPL SHL64L ; do next (want to do 0, too)
Nine bytes for the loop, including the pointer math. There is another approach
that uses auto-decrement and LEA, that should be similar in cycle and byte
count, but this might be worth the three bytes saved in a really tight ROM or
something.
[JMR202412311146 addenda part 2:]
This is one way the auto-decrement method could be mapped into the 6809:
* The 64 bit number to shift is on the parameter stack (U)
LDAA #7 ; bytes to ROL
LEAX A,U
LSL ,X ; least significant byte starts with LSL
SHL64L ROL ,-X ; next more significant byte
DECA ; count down, carry not affected
BNE SHL64L ; do next
* Ends with X pointing to most significant byte
[JMR202412311146 addenda part 2 end.]
For shifting 64 bits left by 1 on the 68000, it's nicely straightforward:
LSL.L #1,D1 ; less significant long word, carry to C and X
ROXL.L #1,D0 ; rotate X carry into more-significant long word
I mentioned the eXtend carry above, and we can use it here, too. Doing it in
memory only would use the 16-bit shift and three of the 16-bit rotates.
How about multiplying 8 bits by 4?
For the 6800, 6801, and 6809:
LSLB ; multiply by 2
LSLB ; and again
The second time you multiply by 2, don't use ROL. That would be feeding the
top bit back into the bottom, which would be a different operation. Each time
you multiply by 2, you have to start with LSL, not a ROL. ROL is for catching
the bits between bytes.
For the 68000, oh, fun!:
LSL.B #2,D0 ; times 2^2
One instruction and done! WHEEEEE!!!
For multiplying by 2n, shift n times. Just be aware that the more
you shift without catching bits off the top, the more you lose bits.
What about multiplying 16 bits by 4? On the 8-bit processors, you have to
repeat the entire chain of shifts to avoid losing bits between the bytes. On
the 6800 and 6809, this is what you usually want:
LSLB ; multiply by 2
ROLA ; catch the carry
LSLB ; and again
ROLA ; catch the carry again
On the 6801,
LSLD ; multiply by 2
LSLD ; and again
I hope this is enough that you can see how to multiply 32-bit integers by the
constant 4 using shifts on the 8-bit processors. I could show them anyway as
an excuse to talk about optimization, but we need to move on.
On the 68000, you can specify the number of bits to shift, up to 8, as the
immediate argument to the shift, for all widths, byte, 16-bit, and full
32-bit. That means you can multiply a full 32-bit integer in a register by any
power of 2, up to 28, in a single instruction. The only cost is
that each bit costs an additional couple of CPU internal cycles, but compared
to the cost of fetching and encoding an additional instruction, it's not a
great cost.
Just remember that if you shift a byte 8 bits, every bit in the byte gets
shifted out. Thus
LSL.B #8,D0 ; times 2^8 -- WHEEEEE!?!?!???
is an expensive way to clear the lowest byte of D0.
Of course,
LSL.W #8,D0 ; 16 bits times 2^8
is one way to get the lowest byte in D0 up to the next higher byte position
(bits 15 to 8). (That's the difference between .B and
.W here.)
How about multiplying by 16? That's 24, so it's 4 shifts.
As we've already noted, on the 68000, it's just one instruction. LSL.B or
LSL.W or LSL.L and the operands are #4,Dn to shift Dn by 4 bits left.
On the 8-bit processors, if it's a byte value, that's 4 shifts on the same
operand in a row, which is not too bad. 4 instructions, 4 bytes. One more byte
than a JSR in extended mode.
LSLB ; multiply by 2
LSLB ; again by 2 is by 4
LSLB ; and again by 2 is by 8
LSLB ; and again by 2 is by 16
If it's a 16-bit value, for the 6800 and 6809, it's 4 pairs of LSLB; ROLA in a
row, 8 instructions in 8 bytes. That might be worth the performance hit of
making it into a loop, depending on how tight memory is:
LSLB ; multiply by 2
ROLA ; catch the carry
LSLB ; and again, to make it by 4
ROLA ; catch the carry again
LSLB ; and again, to make it by 8
ROLA ; catch the carry
LSLB ; and again, to make it by 16
ROLA ; catch the carry again
vs., say, on the 6809 where it's easiest to construct the loop:
MUL16W LDB #4
PSHU B
LDD 1,U ; operand on parameter stack
MUL16WL LSLB ; multiply by 2
ROLA ; catch the carry
DEC ,U
BNE MUL16WL
LEAU 1,U
STD ,U
But, as you can see, for the cost of setting up the loop, you might as well
just do it.
On the other hand, that code easily turns into a general loop for shifting
left n bytes, so don't forget it.
For the 6801, it's again,
LSLD ; multiply by 2
LSLD ; and again, making it by 4
LSLD ; and again, making it by 8
LSLD ; and again, making it by 16
On the 6801, it won't be worth making a loop for 16-bit integers until we try
to multiply by constants greater than 28, which are pretty rare,
really. And ...
Speaking of multiplying by 28, if we are following the A:B is D
convention,
TBA ; multiply by 256
CLRB ; don't forget to zero out low byte.
does the trick for rotating by 8 bits. And you may not even need to explicitly
do this -- may be good enough to just move it in memory.
While we're thinking about this, let's look at another way to multiply by 256
on the 68000 (or to get the lowest byte of Dn into bits 15 to 8:
MOVE.B D0,-(A6) ; this does NOT work on A7!
CLR.B -(A6)
MOVE.W (A6)+,D0
Usually, you would not want to do it this way, because you're having to read
and write memory, but there are times it can be useful.
The reason it doesn't work on A7? A7 absolutely must stay 16-bit aligned on
the 68000, because of return addresses. So when you push a byte to A7 -- any
MOVE.B Dn,-(A7) -- the 68000 magically double decrements A7 for you. And
when you pop a byte from A7 -- any MOVE.B (A7)+,Dn -- it automatically
double increments A7 for you.
(Don't get frustrated about this. As I have hinted before, you really don't
want to store strings on the A7 stack.)
Multiplying by 32, 64, and 128 are also not nearly as common as by 2, 4, 8,
and 16. It may be a better use of resources to just let them go to a general
multiply routine (which, as I say, I will show you shortly).
Except, we do actually want to look at multiplying a byte by 25,
and, by implication, 26, and 27, because I can
demonstrate now a little about how multiplying is dividing.
Multiply B by 128, leaving result in A:
CLRA ; for result
LSRB ; bit 0 to carry
RORA ; bit 0 of B now in bit 7 of A
Note that A now has the less significant byte, and B has the more significant
byte.
The following also works, but takes an extra byte of code while leaving A
untouched, and loses any of the more significant bits than bit 0:
LSRB ; bit 0 to carry
RORB ; now to bit 7
ANDB #$80 ; chop off the lost, double-shifted high bits
[JMR202401010106 addendum:]
The above could be useful in saturation math, or if we know the input will
only be 0 or 1.
[JMR202501071039 edit:]
If we want to keep all the bits that we might be losing, we would have to
store the intermediate result away first, probably using A, after all:
If we want to start with saturation math, but keep all the bits that we would
otherwise be losing, we might try to store the intermediate result away in like this:
LSRB ; bit 0 to carry
RORB ; now to bit 7
TBA
ANDB #$80 ; chop off the high bits
ANDA #$7F ; chop off the low bit (but ...)
But the high bits were already shifted too many times, so we have to un-shift
them. Fortunately, the carry has not been altered by the TBA or AND
instructions, so we can rotate the carry back in at several points. Taking the
earliest point, we could do it like this:
LSRB ; bit 0 to carry
RORB ; now to bit 7
TBA
ROLA ; bring the high bits back into position
ANDB #$80 ; chop off the high bits
ANDA #$7F ; chop off the low bit
But if we back up and take the copy first,
TBA ; make two halves
LSRA ; bit 0 to carry, high bits
RORB ; now to bit 7
ANDB #$80 ; chop off the high bits
Does it seem like we did that above?
Back up and look at the code and see what's different and think about why the effect is the same -- when done correctly.
[JMR202501071039 edit end.]
[JMR202401010106 addendum end.]
It could also be interpreted as a divide, which we will talk about later.
Either way can be extended to multiply by 64 as in the following, leaving the
result in A:
CLRA ; for result
LSRB ; bit 0 to carry
RORA ; old bit 0 of B now in bit 7 of A
LSRB ; old bit 1 of B to carry
RORA ; old bit 1,0 of B now in bit 7,6 of A
Not touching A to multiply by 64:
LSRB ; bit 0 to carry
RORB ; now to bit 7, old bit 1 to carry
RORB ; now to bit 7,6 in order
ANDB #$C0 ; chop off the remainder
And we can see that using A in byte inversion to multiply by 32 costs more
bytes and cycles than not touching A to multiply by 32:
LSRB ; bit 0 to carry
RORB ; now to bit 7, old bit 1 to carry
RORB ; now to bit 7,6 in order, old bit 2 to carry
RORB ; now to bit 7,6,5 in order
ANDB #$E0 ; chop off the remainder
This works for 16-bit integers as well, but for multiplying by 215,
214, 213, and so forth. We'll show multiplying
213:
LSRB ; bit 0 to carry
RORA ; now to bit 15, bit 8 to carry
RORB ; bit 8 to bit 7, old bit 1 to carry
RORA ; old bit 1 to bit 15 in order
RORB ; old bit 9 to bit 7, old bit 2 to carry
RORA ; old bit 2 to bit 15 in order
RORB ; old bit 10 to bit 7 (old bit 3 to carry)
ANDA #$E0 ; chop off the top bytes, ignore carry
Yes, that does lose 13 bits off the top. If you wanted them, you needed to
allocate another couple of bytes to save them in.
Now, how does this play out on the 68000?
Nicely enough:
ROR.B #3,D0 ; 8-bit rotate wraps around!
AND.B #$E0,D0 ; mask out remainder.
ROR and ROL are register-wide rotations -- 8-, 16-, and 32-bits wide. They
copy the bit that wraps to the other end into both the eXtend carry and the
test/branch Carry bits.
ROXR and ROXL are register-plus-carry rotations -- 9-, 17-, and 33- bits wide.
They rotate the bit that comes out of the one end into the X carry, rotating the
old X carry into the other end, and copy the new X carry into the C carry.
This is worth remembering to help shave cycle counts -- not so much for 5 bit
shifts, but for 6 and 7 on bytes, yes. And, definitely, 3 bits right is faster
than 13 bits left:
ROR.W #3,D0 ; 16-bit rotate wraps around!
AND.W #$E000,D0 ; mask out remainder.
Wait. I said you can shift by immediate bit counts up to 8. 13 is more than
eight. But Motorola did leave us another way, that we don't have to do
LSL.W #8,D0 ; First shift 8
LSL.W #5,D0 ; Then shift the rest.
Not that we wanted to shift a 16-bit word by 13 bits, but we can also do it in
one shift, using variable shift counts:
MOVEQ #13,D1 ; shift count
LSL.W D1,D0 ; 16 bit shift by count in D1
So there are several ways to do it. You'll want to look at op-code byte counts
and cycle timing when there isn't any other reason to choose between
them.
Backing up a bit to register-wide rotations, 8- and 16-bit wide rotate
instructions are missing in the 6800, 6801, and 6809. You have to explicitly
move the old carry bit(s) in when you need to do register-wide. The general
approach is to make a copy and do two logical shifts, one in the desired
direction and count, the other in the complementary count in the opposite
direction, then bit-or the results together:
* accumulator-wide ROL by 3 / ROR by 5:
LDAA 0,X ; copy
LSL 0,X ; shift left by 3
LSL 0,X
LSL 0,X
LSRA ; shift right by 5
LSRA ; (use the faster shift accumulator)
LSRA
LSRA
LSRA
ORAA 0,X ; put results together
This works because logical shifts leave 0 bits behind in the vacated bits.
[JMR202501040801 addendum: (Shifting to 6800 on purpose here. We'll see why
below.) ]
The 6800, 6801, and 6809 do not have an instruction to OR the accumulators
together, but when you know the bits are set in only one result or the other,
you can add them together to get the same result:
* accumulator-wide ROL by 3 / ROR by 5:
LDAB 0,X ; copy
TBA
LSLA ; shift left by 3
LSLA
LSLA
LSRB ; shift right by 5
LSRB
LSRB
LSRB
LSRB
ABA ; put the results together
[JMR202501041102 correction:]
Instead of LSRB above, I had originally written RORB, which is, of course,
going to mix bits in a semi-arbitrary non-random way, thus, a bug. Erk.
[JMR202501041102 correction end.]
[JMR202501040801 addendum:]
(6809 does not even have an ABA instruction. You have to push it to a stack and
add post-increment. So you should OR post-increment, so that you remember that
ADDing was a substitute for ORring.
It's easy to think it was unnecessary cost cutting, and I tend to lean toward
the solution that the
6309's hidden instruction set
used, of having the ADDR (and ORR) register to register instructions with
arguments like TFR and EXG use. But I also recognize the awkwardness of
providing an addition path between D and the index registers outside the LEA
instructions. Whether or not you include a tertiary ALU, the data paths get
pretty complex. With modern design tools, it's not so hard, but it was not all
that easy with the tools they had in 1978.)
[JMR202501040801 addendum end.]
When it's a single bit left rotation, there's a neat trick:
ROLB8 LSLB ; 8-bit wide rotation
ADCB #0 ; move the carry in
Single bit right doesn't fair so well, however:
ROR8B LSRB
BCC ROR8BN
ORAB #$80
ROR8BN NOP ; next instruction
or, using the other accumulor to get the low bit in the carry first,
ROR8B TBA
LSRA ; get low bit in carry first
RORB
are the best I'm aware of.
What about 16-bit and 32-bit integer rotation? I think we have enough here to
figure them out on the 6800, 6801, and 6809.
And Motorola gave us another way to optimize 32-bit shifts and rotates on the
68000:
SWAP D0 ; Effectively a 16-bit rotate on D0!
SWAP exchanges the low and high halves of the data register. It's a 16-bit
instruction and only takes four cycles, so it's a good way to avoid 2 cycles
per bit for really long shifts and rotates. For instance, if we want to shift
a 32-bit integer in D0 left by 20 bits, we can use SWAP to speed it up:
SWAP D0 ; rotate by 16 bits
CLR.W D0 ; mask out low half
LSL.L #4,D0 ; shift the remaining 4 bits.
This is all very interesting, but it doesn't really seem to be taking us any
closer to multiplying by ten? Have we just gotten lost in shifting and masking? Is this all just shifty business?
Shifting bits is not something people do every day, so I'm trying to expose
you to a lot of things that you can do shifting bits before we use them for
the real stuff.
If there's something you really want to check in the above, make up your own
test code and check it. (I may have made a mistake? ;-)
If you find mistakes, leave me a note in the comments, please.
Otherwise, hang on for the ride.
It'll make more sense in the chapter after next, where we do some fast multiplying
by a few small constants that are not powers of 2.
In the next chapter, you can look at some (incompletely tested) demonstration code for the 6800.
(Title Page/Index)