I'm leaving this here for reference for a little while. Eventually, I plan to do a chapter on shifts, and most of this will be taken up there.
Multiplying by Powers of Two
(Shift Left)
Now that we've seen a little motivation and covered a little theory about multiplying by constants, it's time to look at multiplying by powers of two.
(But this chapter will also lean a bit more to theory than practice, even though there is practice. It also gets a little long. Please bear with me.)
Here's how to multiply a byte in accumulator B by 2 on the 6800, 6801, or 6809:
LSLB
And, of course, there's LSLA. And LSL n,X (indexed mode) or LSL <address> (extended mode) allows you to quickly multiply any byte in memory by two. It's faster in an accumulator, but direct shifts on memory avoid saving and storing whatever is in the accumulators.
Unfortunately, we can't use the abbreviated direct page addressing on the 6800/6801 to save a byte and a fetch cycle, but since the direct page is the lowest 256 bytes of memory, we can still shift operands there in extended mode.
(Incidentally, the DP mode on the 6809 actually uses more cycles than extended mode, darn it. Saves a byte and is as fast as indexed mode, anyway.)
(As another aside, in the 6805 microcontroller, which is sort of a half a
6800, direct page mode is provided for the read-modify-write instructions --
including shifts -- instead of extended mode. And you get the speedup.
This was definitely a good trade-off for the 6805.)
Why logical shift left instead of arithmetic shift left? Actually, on
Motorola's 68XX and 680XX series microprocessors, ASL is a synonym of LSL We
didn't see a good reason to make a distinction, and Motorola and other
companies didn't. We'll talk about that more when we get to division by
constants.
On the 68000, we can multiply a byte in D0 by two with
LSL.B #1,D0
I'll save the full addressing modes discussion for later, but the 68000 allows logical shifts of any width on all data registers D0-D7.
However, byte-width shifts on operands in memory are
not provided.
So how about a 16-bit integer?
On the 6800 and 6809, for two bytes together in the accumulators, with the high byte in A, it's
LSLB ; less significant byte in B
ROLA ; more significant byte in A
This works on the 6801, too, of course. But on the 6801, we also have
LSLD ; high in A, low in B
There is no LSLD on the 6809.
On the 6800, it's only a matter of convention to put the more significant byte in A, and often the convention has been reversed in existing code. On the 6801 and 6809, you can still change or reverse the convention, but the ability to treat the accumulator pair, A:B, as the single double accumulator D means you usually want to follow the double-accumulator convention.
Now you can combine those with bytes in memory if you need to for some reason. For instance, if you need to keep a counter in A, you can keep the more significant byte on the stack and reference it by indexed mode, as below, assuming the parameter stack pointer PSP is in X:
LSLB ; less significant byte
ROL 0,X ; more significant byte on stack (X has PSP)
On the 68000, we can multiply sixteen bits in D0 by two with
LSL.W #1,D0
But I already said that, didn't I?
However (Surprise?), the 68000 does provide single-bit shifts on 16-bit wide word only operands in memory, accessed via (most) normal indexing or absolute modes.
LSL.W (A6) ; shift top 16-bit word of parameter stack 1 bit left.
This was perhaps because the 68000 instruction set was originally designed for 16-bit wide memory designs. (I think they thought it was an optimization, and it probably was at the time.) Again, we'll look at this more later.
Four-byte wide shifts? Just in case it's not clear, I'll show you one way you might want to do it on the 6800, again assuming PSP in X and the more significant bytes on top of stack:
LSLB ; least significant byte
ROLA ; next less significant byte
ROL 1,X ; next more significant byte on stack
ROL 0,X ; most significant byte on stack
On the 6801, when we use the accumulators for 32-bit shifts, we probably want
to use the double shift where we can, on the first shift:
LSLD ; least significant 16 bits
ROL 1,X ; next more significant byte on stack
ROL 0,X ; most significant byte on stack
This allows us to grab the carry from the double shift and pass it into the
following rotate instructions, which only exist in byte form on the 6801.
On the 6809, assuming we are using U for the parameter stack, and that the
more significant bytes are the topmost on the stack, it would be
LSLB ; least significant byte
ROLA ; next less significant byte
ROL 1,U ; next more significant byte on stack
ROL ,U ; most significant byte on stack
And on the 68000, as I have already said, we can multiply 32 bits in D0 by two
with
LSL.L #1,D0
32-bit CPUs are nice, aren't they?
What if you need to do shift a 32-bit wide integer on top of stack without
bringing it into a register for some reason? Since there is no 32-bit version
of LSL for operands in memory, you'll need to do this:
LSL.W 2(A6) ; shift less-significant 16-bit word 1 bit left, carry to C and X.
ROXL.W (A6) ; rotate X carry into more-significant 16-bit word, with 1-bit shift.
For the times you need it, it can be done.
And, incidentally, we see that the 68000 has something called an eXtended carry for extending 1-bit carries. The C carry is for branching, and the X carry is for extending. It's a bit weird, but useful.
What if you need to shift 64 bits left?
On the 8-bit CPUs, you'll start with LSL on the least significant byte, then perform 7 more ROLs preceding in order from less to most significant. Because at least 6 of the shifts will be in memory, it'll take at least 12 bytes of ROL instructions plus the one (6801) or two (6800, 6809) for the accumulator bytes.
I think I'd better show this doing it all in memory, and you can consider whether it would be worth bringing the least significant bytes into the accumulators:
LSL 7,X ; least significant byte (byte 7)
ROL 6,X ; next less significant byte (byte 6)
ROL 5,X ; next more significant byte (byte 5)
ROL 4,X ; next more significant byte (byte 4)
ROL 3,X ; next more significant byte (byte 3)
ROL 2,X ; next more significant byte (byte 2)
ROL 1,X ; next more significant byte (byte 1)
ROL 0,X ; most significant byte on (byte 0)
On the 6801, if you already have the least significant two bytes in D, you could start it with a LSLD and save maybe three bytes. Or you could do LDD 6,X; LSLD and save one byte, maybe.
But if you do it without moving the least significant bytes into the
accumulators, you can do the whole 8 bytes without touching A or B.
Let's look at trying to save bytes by making a loop. We'll assume X is already pointing to the least significant byte.
But if this isn't the case, we'll have to adjust X, and that will likely cost at least 8 bytes of code on a 6800. (A series of 8 INX instructions is less code than saving X in a DP variable, adding 8 to it using one or both accumulators, and then loading it back to X.) Or it will take at least 3 bytes of code on a 6801 (where it can be done with a LDAB #8 and an ABX.)
Here's the loop, for either 6800 or 6801:
* Assume X pointing to least significant byte (not likely)
LDAA #7 ; bytes to ROL
LSL 0,X ; least significant byte starts with LSL
SHL64L DEX ; carry not affected
ROL 0,X ; next more significant byte
DECA ; count down, carry not affected
BNE SHL64L ; do next
* Ends with X pointing to most significant byte
Is it worth saving two bytes, max, to use the loop? Or costing one to seven extra bytes for the loop in code to pre-adjust X?
(Well, on the 6801, you may actually be able to shave a couple of INX
instructions by shifting the least significant bytes in D. I'll let you
calculate that out.)
The unrolled version is most likely better, although, if you need to do this
kind of shift on a variable number of bytes, you have now seen a bit of code
to start from.
The loop form can improve on the 6809, because of the index math the 6809 provides. But there are other approaches, and you may need to choose between code size and speed for things like this on the 6809.
Hmm. I think I'd better show the loop and index adjustment on the 6809.
[JMR202412311146 addenda part 1:]
But first I want to show you another way, that uses accumulator offset:[JMR202412311146 addenda part 1 end.]
* The 64 bit number to shift is on the parameter stack (U)
LDA #6 ; bytes to ROL minus 1
LSL 7,U ; least significant byte
SHL64L ROL A,U ; next more significant byte
DECA ; count down, carry not affected
BPL SHL64L ; do next (want to do 0, too)
Nine bytes for the loop, including the pointer math. There is another approach that uses auto-decrement and LEA, that should be similar in cycle and byte count, but this might be worth the three bytes saved in a really tight ROM or something.
[JMR202412311146 addenda part 2:]
This is one way the auto-decrement method could be mapped into the 6809:
* The 64 bit number to shift is on the parameter stack (U)
LDAA #7 ; bytes to ROL
LEAX A,U
LSL ,X ; least significant byte starts with LSL
SHL64L ROL ,-X ; next more significant byte
DECA ; count down, carry not affected
BNE SHL64L ; do next
* Ends with X pointing to most significant byte
[JMR202412311146 addenda part 2 end.]
For shifting 64 bits left by 1 on the 68000, it's nicely straightforward:
LSL.L #1,D1 ; less significant long word, carry to C and X
ROXL.L #1,D0 ; rotate X carry into more-significant long word
I mentioned the eXtend carry above, and we can use it here, too. Doing it in
memory only would use the 16-bit shift and three of the 16-bit rotates.How about multiplying 8 bits by 4?
For the 6800, 6801, and 6809:
LSLB ; multiply by 2
LSLB ; and again
The second time you multiply by 2, don't use ROL. That would be feeding the
top bit back into the bottom, which would be a different operation. Each time
you multiply by 2, you have to start with LSL, not a ROL. ROL is for catching
the bits between bytes.
For the 68000, oh, fun!:
LSL.B #2,D0 ; times 2^2
One instruction and done! WHEEEEE!!!
For multiplying by 2n, shift n times. Just be aware that the more
you shift without catching bits off the top, the more you lose bits.
What about multiplying 16 bits by 4? On the 8-bit processors, you have to
repeat the entire chain of shifts to avoid losing bits between the bytes. On
the 6800 and 6809, this is what you usually want:
LSLB ; multiply by 2
ROLA ; catch the carry
LSLB ; and again
ROLA ; catch the carry again
On the 6801,
LSLD ; multiply by 2
LSLD ; and again
I hope this is enough that you can see how to multiply 32-bit integers by the constant 4 using shifts on the 8-bit processors. I could show them anyway as an excuse to talk about optimization, but we need to move on.
On the 68000, you can specify the number of bits to shift, up to 8, as the immediate argument to the shift, for all widths, byte, 16-bit, and full 32-bit. That means you can multiply a full 32-bit integer in a register by any power of 2, up to 28, in a single instruction. The only cost is that each bit costs an additional couple of CPU internal cycles, but compared to the cost of fetching and encoding an additional instruction, it's not a great cost.
Just remember that if you shift a byte 8 bits, every bit in the byte gets
shifted out. Thus
LSL.B #8,D0 ; times 2^8 -- WHEEEEE!?!?!???
is an expensive way to clear the lowest byte of D0.
Of course,
LSL.W #8,D0 ; 16 bits times 2^8
is one way to get the lowest byte in D0 up to the next higher byte position
(bits 15 to 8). (That's the difference between .B and
.W here.)
How about multiplying by 16? That's 24, so it's 4 shifts.
As we've already noted, on the 68000, it's just one instruction. LSL.B or
LSL.W or LSL.L and the operands are #4,Dn to shift Dn by 4 bits left.
On the 8-bit processors, if it's a byte value, that's 4 shifts on the same
operand in a row, which is not too bad. 4 instructions, 4 bytes. One more byte
than a JSR in extended mode.
LSLB ; multiply by 2
LSLB ; again by 2 is by 4
LSLB ; and again by 2 is by 8
LSLB ; and again by 2 is by 16
If it's a 16-bit value, for the 6800 and 6809, it's 4 pairs of LSLB; ROLA in a row, 8 instructions in 8 bytes. That might be worth the performance hit of making it into a loop, depending on how tight memory is:
LSLB ; multiply by 2
ROLA ; catch the carry
LSLB ; and again, to make it by 4
ROLA ; catch the carry again
LSLB ; and again, to make it by 8
ROLA ; catch the carry
LSLB ; and again, to make it by 16
ROLA ; catch the carry again
vs., say, on the 6809 where it's easiest to construct the loop:
MUL16W LDB #4
PSHU B
LDD 1,U ; operand on parameter stack
MUL16WL LSLB ; multiply by 2
ROLA ; catch the carry
DEC ,U
BNE MUL16WL
LEAU 1,U
STD ,U
But, as you can see, for the cost of setting up the loop, you might as well just do it.
On the other hand, that code easily turns into a general loop for shifting
left n bytes, so don't forget it.
For the 6801, it's again,
LSLD ; multiply by 2
LSLD ; and again, making it by 4
LSLD ; and again, making it by 8
LSLD ; and again, making it by 16
On the 6801, it won't be worth making a loop for 16-bit integers until we try to multiply by constants greater than 28, which are pretty rare, really. And ...
Speaking of multiplying by 28, if we are following the A:B is D
convention,
TBA ; multiply by 256
CLRB ; don't forget to zero out low byte.
does the trick for rotating by 8 bits. And you may not even need to explicitly do this -- may be good enough to just move it in memory.
While we're thinking about this, let's look at another way to multiply by 256 on the 68000 (or to get the lowest byte of Dn into bits 15 to 8:
MOVE.B D0,-(A6) ; this does NOT work on A7!
CLR.B -(A6)
MOVE.W (A6)+,D0
Usually, you would not want to do it this way, because you're having to read and write memory, but there are times it can be useful.
The reason it doesn't work on A7? A7 absolutely must stay 16-bit aligned on the 68000, because of return addresses. So when you push a byte to A7 -- any MOVE.B Dn,-(A7) -- the 68000 magically double decrements A7 for you. And when you pop a byte from A7 -- any MOVE.B (A7)+,Dn -- it automatically double increments A7 for you.
(Don't get frustrated about this. As I have hinted before, you really don't
want to store strings on the A7 stack.)
Multiplying by 32, 64, and 128 are also not nearly as common as by 2, 4, 8, and 16. It may be a better use of resources to just let them go to a general multiply routine (which, as I say, I will show you shortly).
Except, we do actually want to look at multiplying a byte by 25, and, by implication, 26, and 27, because I can demonstrate now a little about how multiplying is dividing.
Multiply B by 128, leaving result in A:
CLRA ; for result
LSRB ; bit 0 to carry
RORA ; bit 0 of B now in bit 7 of A
Note that A now has the less significant byte, and B has the more significant
byte.
The following also works, but takes an extra byte of code while leaving A untouched, and loses any of the more significant bits than bit 0:
LSRB ; bit 0 to carry
RORB ; now to bit 7
ANDB #$80 ; chop off the lost, double-shifted high bits
[JMR202401010106 addendum:]
The above could be useful in saturation math, or if we know the input will only be 0 or 1.
[JMR202501071039 edit:]
If we want to start with saturation math, but keep all the bits that we would
otherwise be losing, we might try to store the intermediate result away in like this:
LSRB ; bit 0 to carry
RORB ; now to bit 7
TBA
ANDB #$80 ; chop off the high bits
ANDA #$7F ; chop off the low bit (but ...)
But the high bits were already shifted too many times, so we have to un-shift
them. Fortunately, the carry has not been altered by the TBA or AND
instructions, so we can rotate the carry back in at several points. Taking the
earliest point, we could do it like this:
LSRB ; bit 0 to carry
RORB ; now to bit 7
TBA
ROLA ; bring the high bits back into position
ANDB #$80 ; chop off the high bits
ANDA #$7F ; chop off the low bit
But if we back up and take the copy first,
TBA ; make two halves
LSRA ; bit 0 to carry, high bits
RORB ; now to bit 7
ANDB #$80 ; chop off the high bits
Does it seem like we did that above?
Back up and look at the code and see what's different and think about why the effect is the same -- when done correctly.
[JMR202501071039 edit end.]
[JMR202401010106 addendum end.]
It could also be interpreted as a divide, which we will talk about later.
Either way can be extended to multiply by 64 as in the following, leaving the result in A:
CLRA ; for result
LSRB ; bit 0 to carry
RORA ; old bit 0 of B now in bit 7 of A
LSRB ; old bit 1 of B to carry
RORA ; old bit 1,0 of B now in bit 7,6 of A
Not touching A to multiply by 64:
LSRB ; bit 0 to carry
RORB ; now to bit 7, old bit 1 to carry
RORB ; now to bit 7,6 in order
ANDB #$C0 ; chop off the remainder
And we can see that using A in byte inversion to multiply by 32 costs more bytes and cycles than not touching A to multiply by 32:
LSRB ; bit 0 to carry
RORB ; now to bit 7, old bit 1 to carry
RORB ; now to bit 7,6 in order, old bit 2 to carry
RORB ; now to bit 7,6,5 in order
ANDB #$E0 ; chop off the remainder
This works for 16-bit integers as well, but for multiplying by 215, 214, 213, and so forth. We'll show multiplying 213:
LSRB ; bit 0 to carry
RORA ; now to bit 15, bit 8 to carry
RORB ; bit 8 to bit 7, old bit 1 to carry
RORA ; old bit 1 to bit 15 in order
RORB ; old bit 9 to bit 7, old bit 2 to carry
RORA ; old bit 2 to bit 15 in order
RORB ; old bit 10 to bit 7 (old bit 3 to carry)
ANDA #$E0 ; chop off the top bytes, ignore carry
Yes, that does lose 13 bits off the top. If you wanted them, you needed to
allocate another couple of bytes to save them in.
Now, how does this play out on the 68000?
Nicely enough:
ROR.B #3,D0 ; 8-bit rotate wraps around!
AND.B #$E0,D0 ; mask out remainder.
ROR and ROL are register-wide rotations -- 8-, 16-, and 32-bits wide. They
copy the bit that wraps to the other end into both the eXtend carry and the
test/branch Carry bits.
This is worth remembering to help shave cycle counts -- not so much for 5 bit
shifts, but for 6 and 7 on bytes, yes. And, definitely, 3 bits right is faster
than 13 bits left:
ROR.W #3,D0 ; 16-bit rotate wraps around!
AND.W #$E000,D0 ; mask out remainder.
Wait. I said you can shift by immediate bit counts up to 8. 13 is more than
eight. But Motorola did leave us another way, that we don't have to do
LSL.W #8,D0 ; First shift 8
LSL.W #5,D0 ; Then shift the rest.
Not that we wanted to shift a 16-bit word by 13 bits, but we can also do it in
one shift, using variable shift counts:
MOVEQ #13,D1 ; shift count
LSL.W D1,D0 ; 16 bit shift by count in D1
So there are several ways to do it. You'll want to look at op-code byte counts and cycle timing when there isn't any other reason to choose between them.
Backing up a bit to register-wide rotations, 8- and 16-bit wide rotate instructions are missing in the 6800, 6801, and 6809. You have to explicitly move the old carry bit(s) in when you need to do register-wide. The general approach is to make a copy and do two logical shifts, one in the desired direction and count, the other in the complementary count in the opposite direction, then bit-or the results together:
* accumulator-wide ROL by 3 / ROR by 5:
LDAA 0,X ; copy
LSL 0,X ; shift left by 3
LSL 0,X
LSL 0,X
LSRA ; shift right by 5
LSRA ; (use the faster shift accumulator)
LSRA
LSRA
LSRA
ORAA 0,X ; put results together
This works because logical shifts leave 0 bits behind in the vacated bits.
[JMR202501040801 addendum: (Shifting to 6800 on purpose here. We'll see why
below.) ]
The 6800, 6801, and 6809 do not have an instruction to OR the accumulators
together, but when you know the bits are set in only one result or the other,
you can add them together to get the same result:
* accumulator-wide ROL by 3 / ROR by 5:
LDAB 0,X ; copy
TBA
LSLA ; shift left by 3
LSLA
LSLA
LSRB ; shift right by 5
LSRB
LSRB
LSRB
LSRB
ABA ; put the results together
[JMR202501041102 correction:]
Instead of LSRB above, I had originally written RORB, which is, of course,
going to mix bits in a semi-arbitrary non-random way, thus, a bug. Erk.
[JMR202501041102 correction end.]
[JMR202501040801 addendum:]
(6809 does not even have an ABA instruction. You have to push it to a stack and add post-increment. So you should OR post-increment, so that you remember that ADDing was a substitute for ORring.It's easy to think it was unnecessary cost cutting, and I tend to lean toward the solution that the 6309's hidden instruction set used, of having the ADDR (and ORR) register to register instructions with arguments like TFR and EXG use. But I also recognize the awkwardness of providing an addition path between D and the index registers outside the LEA instructions. Whether or not you include a tertiary ALU, the data paths get pretty complex. With modern design tools, it's not so hard, but it was not all that easy with the tools they had in 1978.)
[JMR202501040801 addendum end.]
When it's a single bit left rotation, there's a neat trick:
ROLB8 LSLB ; 8-bit wide rotation
ADCB #0 ; move the carry in
Single bit right doesn't fair so well, however:
ROR8B LSRB
BCC ROR8BN
ORAB #$80
ROR8BN NOP ; next instruction
or, using the other accumulor to get the low bit in the carry first,
ROR8B TBA
LSRA ; get low bit in carry first
RORB
are the best I'm aware of.
What about 16-bit and 32-bit integer rotation? I think we have enough here to
figure them out on the 6800, 6801, and 6809.
And Motorola gave us another way to optimize 32-bit shifts and rotates on the 68000:
SWAP D0 ; Effectively a 16-bit rotate on D0!
SWAP exchanges the low and high halves of the data register. It's a 16-bit instruction and only takes four cycles, so it's a good way to avoid 2 cycles per bit for really long shifts and rotates. For instance, if we want to shift a 32-bit integer in D0 left by 20 bits, we can use SWAP to speed it up:
SWAP D0 ; rotate by 16 bits
CLR.W D0 ; mask out low half
LSL.L #4,D0 ; shift the remaining 4 bits.
This is all very interesting, but it doesn't really seem to be taking us any closer to multiplying by ten? Have we just gotten lost in shifting and masking? Is this all just shifty business?
Shifting bits is not something people do every day, so I'm trying to expose you to a lot of things that you can do shifting bits before we use them for the real stuff.
If there's something you really want to check in the above, make up your own test code and check it. (I may have made a mistake? ;-)
If you find mistakes, leave me a note in the comments, please.
Otherwise, hang on for the ride.
It'll make more sense in the chapter after next, where we do some fast multiplying by a few small constants that are not powers of 2.
In the next chapter, you can look at some (incompletely tested) demonstration code for the 6800.
No comments:
Post a Comment