joel's programming fun: 2024

Tuesday, December 24, 2024

ALPP 03-XX (16) -- Multiplying by Powers of Two (Shift Left)

I'm leaving this here for reference for a little while. Eventually, I plan to do a chapter on shifts, and most of this will be taken up there.

Multiplying by Powers of Two
(Shift Left)

(Title Page/Index)

Now that we've seen a little motivation and covered a little theory about multiplying by constants, it's time to look at multiplying by powers of two.

(But this chapter will also lean a bit more to theory than practice, even though there is practice. It also gets a little long. Please bear with me.)

Here's how to multiply a byte in accumulator B by 2 on the 6800, 6801, or 6809:


    	LSLB

And, of course, there's LSLA. And LSL n,X (indexed mode) or LSL <address> (extended mode) allows you to quickly multiply any byte in memory by two. It's faster in an accumulator, but direct shifts on memory avoid saving and storing whatever is in the accumulators.

Unfortunately, we can't use the abbreviated direct page addressing on the 6800/6801 to save a byte and a fetch cycle, but since the direct page is the lowest 256 bytes of memory, we can still shift operands there in extended mode.

(Incidentally, the DP mode on the 6809 actually uses more cycles than extended mode, darn it. Saves a byte and is as fast as indexed mode, anyway.)

(As another aside, in the 6805 microcontroller, which is sort of a half a 6800, direct page mode is provided for the read-modify-write instructions -- including shifts -- instead of extended mode. And you get the speedup. This was definitely a good trade-off for the 6805.)

Why logical shift left instead of arithmetic shift left? Actually, on Motorola's 68XX and 680XX series microprocessors, ASL is a synonym of LSL We didn't see a good reason to make a distinction, and Motorola and other companies didn't. We'll talk about that more when we get to division by constants.

On the 68000, we can multiply a byte in D0 by two with


    	LSL.B	#1,D0

I'll save the full addressing modes discussion for later, but the 68000 allows logical shifts of any width on all data registers D0-D7.

However, byte-width shifts on operands in memory are not provided.

So how about a 16-bit integer?

On the 6800 and 6809, for two bytes together in the accumulators, with the high byte in A, it's


    	LSLB	; less significant byte in B
	ROLA	; more significant byte in A

This works on the 6801, too, of course. But on the 6801, we also have


    	LSLD	; high in A, low in B

There is no LSLD on the 6809.

On the 6800, it's only a matter of convention to put the more significant byte in A, and often the convention has been reversed in existing code. On the 6801 and 6809, you can still change or reverse the convention, but the ability to treat the accumulator pair, A:B, as the single double accumulator D means you usually want to follow the double-accumulator convention.

Now you can combine those with bytes in memory if you need to for some reason. For instance, if you need to keep a counter in A, you can keep the more significant byte on the stack and reference it by indexed mode, as below, assuming the parameter stack pointer PSP is in X:


    	LSLB	; less significant byte
	ROL	0,X	; more significant byte on stack (X has PSP)

On the 68000, we can multiply sixteen bits in D0 by two with


    	LSL.W	#1,D0

But I already said that, didn't I?

However (Surprise?), the 68000 does provide single-bit shifts on 16-bit wide word only operands in memory, accessed via (most) normal indexing or absolute modes.


    	LSL.W	(A6)	; shift top 16-bit word of parameter stack 1 bit left.

This was perhaps because the 68000 instruction set was originally designed for 16-bit wide memory designs. (I think they thought it was an optimization, and it probably was at the time.) Again, we'll look at this more later.

Four-byte wide shifts? Just in case it's not clear, I'll show you one way you might want to do it on the 6800, again assuming PSP in X and the more significant bytes on top of stack:


    	LSLB	; least significant byte
	ROLA	; next less significant byte
	ROL	1,X	; next more significant byte on stack
	ROL	0,X	; most significant byte on stack

On the 6801, when we use the accumulators for 32-bit shifts, we probably want to use the double shift where we can, on the first shift:


    	LSLD	; least significant 16 bits
	ROL	1,X	; next more significant byte on stack
	ROL	0,X	; most significant byte on stack

This allows us to grab the carry from the double shift and pass it into the following rotate instructions, which only exist in byte form on the 6801.

On the 6809, assuming we are using U for the parameter stack, and that the more significant bytes are the topmost on the stack, it would be


    	LSLB	; least significant byte
	ROLA	; next less significant byte
	ROL	1,U	; next more significant byte on stack
	ROL	,U	; most significant byte on stack

And on the 68000, as I have already said, we can multiply 32 bits in D0 by two with


    	LSL.L	#1,D0

32-bit CPUs are nice, aren't they?

What if you need to do shift a 32-bit wide integer on top of stack without bringing it into a register for some reason? Since there is no 32-bit version of LSL for operands in memory, you'll need to do this:


    	LSL.W	2(A6)	; shift less-significant 16-bit word 1 bit left, carry to C and X.
	ROXL.W	(A6)	; rotate X carry into more-significant 16-bit word, with 1-bit shift.

For the times you need it, it can be done.

And, incidentally, we see that the 68000 has something called an eXtended carry for extending 1-bit carries. The C carry is for branching, and the X carry is for extending. It's a bit weird, but useful.

What if you need to shift 64 bits left?

On the 8-bit CPUs, you'll start with LSL on the least significant byte, then perform 7 more ROLs preceding in order from less to most significant. Because at least 6 of the shifts will be in memory, it'll take at least 12 bytes of ROL instructions plus the one (6801) or two (6800, 6809) for the accumulator bytes.

I think I'd better show this doing it all in memory, and you can consider whether it would be worth bringing the least significant bytes into the accumulators:


    	LSL	7,X	; least significant byte (byte 7)
	ROL	6,X	; next less significant byte (byte 6)
	ROL	5,X	; next more significant byte (byte 5)
	ROL	4,X	; next more significant byte (byte 4)
	ROL	3,X	; next more significant byte (byte 3)
	ROL	2,X	; next more significant byte (byte 2)
	ROL	1,X	; next more significant byte (byte 1)
	ROL	0,X	; most significant byte on (byte 0)

On the 6801, if you already have the least significant two bytes in D, you could start it with a LSLD and save maybe three bytes. Or you could do LDD 6,X; LSLD and save one byte, maybe.

But if you do it without moving the least significant bytes into the accumulators, you can do the whole 8 bytes without touching A or B.

Let's look at trying to save bytes by making a loop. We'll assume X is already pointing to the least significant byte.

But if this isn't the case, we'll have to adjust X, and that will likely cost at least 8 bytes of code on a 6800. (A series of 8 INX instructions is less code than saving X in a DP variable, adding 8 to it using one or both accumulators, and then loading it back to X.) Or it will take at least 3 bytes of code on a 6801 (where it can be done with a LDAB #8 and an ABX.)

Here's the loop, for either 6800 or 6801:


    * Assume X pointing to least significant byte (not likely)
	LDAA	#7	; bytes to ROL
	LSL	0,X	; least significant byte starts with LSL
SHL64L	DEX		; carry not affected
	ROL	0,X	; next more significant byte
	DECA		; count down, carry not affected
	BNE	SHL64L	; do next
* Ends with X pointing to most significant byte

Is it worth saving two bytes, max, to use the loop? Or costing one to seven extra bytes for the loop in code to pre-adjust X?

(Well, on the 6801, you may actually be able to shave a couple of INX instructions by shifting the least significant bytes in D. I'll let you calculate that out.)

The unrolled version is most likely better, although, if you need to do this kind of shift on a variable number of bytes, you have now seen a bit of code to start from.

The loop form can improve on the 6809, because of the index math the 6809 provides. But there are other approaches, and you may need to choose between code size and speed for things like this on the 6809.

Hmm. I think I'd better show the loop and index adjustment on the 6809.

[JMR202412311146 addenda part 1:]

But first I want to show you another way, that uses accumulator offset:

[JMR202412311146 addenda part 1 end.]


    * The 64 bit number to shift is on the parameter stack (U)
	LDA	#6	; bytes to ROL minus 1
	LSL	7,U	; least significant byte
SHL64L	ROL	A,U	; next more significant byte
	DECA		; count down, carry not affected
	BPL	SHL64L	; do next (want to do 0, too)

Nine bytes for the loop, including the pointer math. There is another approach that uses auto-decrement and LEA, that should be similar in cycle and byte count, but this might be worth the three bytes saved in a really tight ROM or something.

[JMR202412311146 addenda part 2:]

This is one way the auto-decrement method could be mapped into the 6809:


    * The 64 bit number to shift is on the parameter stack (U)
	LDAA	#7	; bytes to ROL
	LEAX	A,U
	LSL	,X	; least significant byte starts with LSL
SHL64L	ROL	,-X	; next more significant byte
	DECA		; count down, carry not affected
	BNE	SHL64L	; do next
* Ends with X pointing to most significant byte

[JMR202412311146 addenda part 2 end.]

For shifting 64 bits left by 1 on the 68000, it's nicely straightforward:


    	LSL.L	#1,D1	; less significant long word, carry to C and X
	ROXL.L	#1,D0	; rotate X carry into more-significant long word

I mentioned the eXtend carry above, and we can use it here, too. Doing it in memory only would use the 16-bit shift and three of the 16-bit rotates.

How about multiplying 8 bits by 4?

For the 6800, 6801, and 6809:


    	LSLB	; multiply by 2
	LSLB	; and again

The second time you multiply by 2, don't use ROL. That would be feeding the top bit back into the bottom, which would be a different operation. Each time you multiply by 2, you have to start with LSL, not a ROL. ROL is for catching the bits between bytes.

For the 68000, oh, fun!:


    	LSL.B	#2,D0	; times 2^2

One instruction and done! WHEEEEE!!!

For multiplying by 2ⁿ, shift n times. Just be aware that the more you shift without catching bits off the top, the more you lose bits.

What about multiplying 16 bits by 4? On the 8-bit processors, you have to repeat the entire chain of shifts to avoid losing bits between the bytes. On the 6800 and 6809, this is what you usually want:


    	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again
	ROLA	; catch the carry again

On the 6801,


    	LSLD	; multiply by 2
	LSLD	; and again

I hope this is enough that you can see how to multiply 32-bit integers by the constant 4 using shifts on the 8-bit processors. I could show them anyway as an excuse to talk about optimization, but we need to move on.

On the 68000, you can specify the number of bits to shift, up to 8, as the immediate argument to the shift, for all widths, byte, 16-bit, and full 32-bit. That means you can multiply a full 32-bit integer in a register by any power of 2, up to 2⁸, in a single instruction. The only cost is that each bit costs an additional couple of CPU internal cycles, but compared to the cost of fetching and encoding an additional instruction, it's not a great cost.

Just remember that if you shift a byte 8 bits, every bit in the byte gets shifted out. Thus


    	LSL.B	#8,D0	; times 2^8 -- WHEEEEE!?!?!???

is an expensive way to clear the lowest byte of D0.

Of course,


    	LSL.W	#8,D0	; 16 bits times 2^8

is one way to get the lowest byte in D0 up to the next higher byte position (bits 15 to 8). (That's the difference between .B and .W here.)

How about multiplying by 16? That's 2⁴, so it's 4 shifts.

As we've already noted, on the 68000, it's just one instruction. LSL.B or LSL.W or LSL.L and the operands are #4,Dn to shift Dn by 4 bits left.

On the 8-bit processors, if it's a byte value, that's 4 shifts on the same operand in a row, which is not too bad. 4 instructions, 4 bytes. One more byte than a JSR in extended mode.


    	LSLB	; multiply by 2
	LSLB	; again by 2 is by 4
	LSLB	; and again by 2 is by 8
	LSLB	; and again by 2 is by 16

If it's a 16-bit value, for the 6800 and 6809, it's 4 pairs of LSLB; ROLA in a row, 8 instructions in 8 bytes. That might be worth the performance hit of making it into a loop, depending on how tight memory is:


    	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again, to make it by 4
	ROLA	; catch the carry again
	LSLB	; and again, to make it by 8
	ROLA	; catch the carry
	LSLB	; and again, to make it by 16
	ROLA	; catch the carry again

vs., say, on the 6809 where it's easiest to construct the loop:


    MUL16W	LDB	#4
	PSHU	B
	LDD	1,U	; operand on parameter stack
MUL16WL	LSLB	; multiply by 2
	ROLA	; catch the carry
	DEC	,U
	BNE	MUL16WL
	LEAU	1,U
	STD	,U

But, as you can see, for the cost of setting up the loop, you might as well just do it.

On the other hand, that code easily turns into a general loop for shifting left n bytes, so don't forget it.

For the 6801, it's again,


    	LSLD	; multiply by 2
	LSLD	; and again, making it by 4
	LSLD	; and again, making it by 8
	LSLD	; and again, making it by 16

On the 6801, it won't be worth making a loop for 16-bit integers until we try to multiply by constants greater than 2⁸, which are pretty rare, really. And ...

Speaking of multiplying by 2⁸, if we are following the A:B is D convention,


    	TBA	; multiply by 256
	CLRB	; don't forget to zero out low byte.

does the trick for rotating by 8 bits. And you may not even need to explicitly do this -- may be good enough to just move it in memory.

While we're thinking about this, let's look at another way to multiply by 256 on the 68000 (or to get the lowest byte of Dn into bits 15 to 8:


    	MOVE.B	D0,-(A6)	; this does NOT work on A7!
	CLR.B	-(A6)
	MOVE.W	(A6)+,D0

Usually, you would not want to do it this way, because you're having to read and write memory, but there are times it can be useful.

The reason it doesn't work on A7? A7 absolutely must stay 16-bit aligned on the 68000, because of return addresses. So when you push a byte to A7 -- any MOVE.B Dn,-(A7) -- the 68000 magically double decrements A7 for you. And when you pop a byte from A7 -- any MOVE.B (A7)+,Dn -- it automatically double increments A7 for you.

(Don't get frustrated about this. As I have hinted before, you really don't want to store strings on the A7 stack.)

Multiplying by 32, 64, and 128 are also not nearly as common as by 2, 4, 8, and 16. It may be a better use of resources to just let them go to a general multiply routine (which, as I say, I will show you shortly).

Except, we do actually want to look at multiplying a byte by 2⁵, and, by implication, 2⁶, and 2⁷, because I can demonstrate now a little about how multiplying is dividing.

Multiply B by 128, leaving result in A:


    	CLRA	; for result
	LSRB	; bit 0 to carry
	RORA	; bit 0 of B now in bit 7 of A

Note that A now has the less significant byte, and B has the more significant byte.

The following also works, but takes an extra byte of code while leaving A untouched, and loses any of the more significant bits than bit 0:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	ANDB	#$80	; chop off the lost, double-shifted high bits

[JMR202401010106 addendum:]

The above could be useful in saturation math, or if we know the input will only be 0 or 1.

[JMR202501071039 edit:]

~~If we want to keep all the bits that we might be losing, we would have to store the intermediate result away first, probably using A, after all:~~

If we want to start with saturation math, but keep all the bits that we would otherwise be losing, we might try to store the intermediate result away in like this:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	TBA
	ANDB	#$80	; chop off the high bits
	ANDA	#$7F	; chop off the low bit (but ...)

But the high bits were already shifted too many times, so we have to un-shift them. Fortunately, the carry has not been altered by the TBA or AND instructions, so we can rotate the carry back in at several points. Taking the earliest point, we could do it like this:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	TBA
	ROLA		; bring the high bits back into position
	ANDB	#$80	; chop off the high bits
	ANDA	#$7F	; chop off the low bit

But if we back up and take the copy first,


    	TBA		; make two halves
	LSRA		; bit 0 to carry, high bits
	RORB		; now to bit 7
	ANDB	#$80	; chop off the high bits

Does it seem like we did that above?

Back up and look at the code and see what's different and think about why the effect is the same -- when done correctly.

[JMR202501071039 edit end.]

[JMR202401010106 addendum end.]

It could also be interpreted as a divide, which we will talk about later.

Either way can be extended to multiply by 64 as in the following, leaving the result in A:


    	CLRA	; for result
	LSRB	; bit 0 to carry
	RORA	; old bit 0 of B now in bit 7 of A
	LSRB	; old bit 1 of B to carry
	RORA	; old bit 1,0 of B now in bit 7,6 of A

Not touching A to multiply by 64:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7, old bit 1 to carry
	RORB		; now to bit 7,6 in order
	ANDB	#$C0	; chop off the remainder

And we can see that using A in byte inversion to multiply by 32 costs more bytes and cycles than not touching A to multiply by 32:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7, old bit 1 to carry
	RORB		; now to bit 7,6 in order, old bit 2 to carry
	RORB		; now to bit 7,6,5 in order
	ANDB	#$E0	; chop off the remainder

This works for 16-bit integers as well, but for multiplying by 2¹⁵, 2¹⁴, 2¹³, and so forth. We'll show multiplying 2¹³:


    	LSRB		; bit 0 to carry
	RORA		; now to bit 15, bit 8 to carry
	RORB		; bit 8 to bit 7, old bit 1 to carry
	RORA		; old bit 1 to bit 15 in order
	RORB		; old bit 9 to bit 7, old bit 2 to carry
	RORA		; old bit 2 to bit 15 in order
	RORB		; old bit 10 to bit 7 (old bit 3 to carry)
	ANDA	#$E0	; chop off the top bytes, ignore carry

Yes, that does lose 13 bits off the top. If you wanted them, you needed to allocate another couple of bytes to save them in.

Now, how does this play out on the 68000?

Nicely enough:


    	ROR.B	#3,D0	; 8-bit rotate wraps around!
	AND.B	#$E0,D0	; mask out remainder.

ROR and ROL are register-wide rotations -- 8-, 16-, and 32-bits wide. They copy the bit that wraps to the other end into both the eXtend carry and the test/branch Carry bits.

ROXR and ROXL are register-plus-carry rotations -- 9-, 17-, and 33- bits wide. They rotate the bit that comes out of the one end into the X carry, rotating the old X carry into the other end, and copy the new X carry into the C carry.

This is worth remembering to help shave cycle counts -- not so much for 5 bit shifts, but for 6 and 7 on bytes, yes. And, definitely, 3 bits right is faster than 13 bits left:


    	ROR.W	#3,D0	; 16-bit rotate wraps around!
	AND.W	#$E000,D0	; mask out remainder.

Wait. I said you can shift by immediate bit counts up to 8. 13 is more than eight. But Motorola did leave us another way, that we don't have to do


    	LSL.W	#8,D0	; First shift 8
	LSL.W	#5,D0	; Then shift the rest.

Not that we wanted to shift a 16-bit word by 13 bits, but we can also do it in one shift, using variable shift counts:


    	MOVEQ	#13,D1	; shift count
	LSL.W	D1,D0	; 16 bit shift by count in D1

So there are several ways to do it. You'll want to look at op-code byte counts and cycle timing when there isn't any other reason to choose between them.

Backing up a bit to register-wide rotations, 8- and 16-bit wide rotate instructions are missing in the 6800, 6801, and 6809. You have to explicitly move the old carry bit(s) in when you need to do register-wide. The general approach is to make a copy and do two logical shifts, one in the desired direction and count, the other in the complementary count in the opposite direction, then bit-or the results together:


    * accumulator-wide ROL by 3 / ROR by 5:
	LDAA	0,X	; copy
	LSL	0,X	; shift left by 3
	LSL	0,X
	LSL	0,X
	LSRA		; shift right by 5
	LSRA		; (use the faster shift accumulator)
	LSRA
	LSRA
	LSRA
	ORAA	0,X	; put results together

This works because logical shifts leave 0 bits behind in the vacated bits.

[JMR202501040801 addendum: (Shifting to 6800 on purpose here. We'll see why below.) ]

The 6800, 6801, and 6809 do not have an instruction to OR the accumulators together, but when you know the bits are set in only one result or the other, you can add them together to get the same result:


    * accumulator-wide ROL by 3 / ROR by 5:
	LDAB	0,X	; copy
	TBA
	LSLA	; shift left by 3
	LSLA
	LSLA
	LSRB	; shift right by 5
	LSRB
	LSRB
	LSRB
	LSRB
	ABA	; put the results together

[JMR202501041102 correction:]

Instead of LSRB above, I had originally written RORB, which is, of course, going to mix bits in a semi-arbitrary non-random way, thus, a bug. Erk.

[JMR202501041102 correction end.]

[JMR202501040801 addendum:]

(6809 does not even have an ABA instruction. You have to push it to a stack and add post-increment. So you should OR post-increment, so that you remember that ADDing was a substitute for ORring.

It's easy to think it was unnecessary cost cutting, and I tend to lean toward the solution that the 6309's hidden instruction set used, of having the ADDR (and ORR) register to register instructions with arguments like TFR and EXG use. But I also recognize the awkwardness of providing an addition path between D and the index registers outside the LEA instructions. Whether or not you include a tertiary ALU, the data paths get pretty complex. With modern design tools, it's not so hard, but it was not all that easy with the tools they had in 1978.)

[JMR202501040801 addendum end.]

When it's a single bit left rotation, there's a neat trick:


    ROLB8	LSLB		; 8-bit wide rotation
	ADCB	#0	; move the carry in

Single bit right doesn't fair so well, however:


    ROR8B	LSRB
	BCC	ROR8BN
	ORAB	#$80
ROR8BN	NOP	; next instruction

or, using the other accumulor to get the low bit in the carry first,


    ROR8B	TBA
	LSRA	; get low bit in carry first
	RORB

are the best I'm aware of.

What about 16-bit and 32-bit integer rotation? I think we have enough here to figure them out on the 6800, 6801, and 6809.

And Motorola gave us another way to optimize 32-bit shifts and rotates on the 68000:


    	SWAP	D0	; Effectively a 16-bit rotate on D0!

SWAP exchanges the low and high halves of the data register. It's a 16-bit instruction and only takes four cycles, so it's a good way to avoid 2 cycles per bit for really long shifts and rotates. For instance, if we want to shift a 32-bit integer in D0 left by 20 bits, we can use SWAP to speed it up:


    	SWAP	D0	; rotate by 16 bits
	CLR.W	D0	; mask out low half
	LSL.L	#4,D0	; shift the remaining 4 bits.

This is all very interesting, but it doesn't really seem to be taking us any closer to multiplying by ten? Have we just gotten lost in shifting and masking? Is this all just shifty business?

Shifting bits is not something people do every day, so I'm trying to expose you to a lot of things that you can do shifting bits before we use them for the real stuff.

If there's something you really want to check in the above, make up your own test code and check it. (I may have made a mistake? ;-)

If you find mistakes, leave me a note in the comments, please.

Otherwise, hang on for the ride.

It'll make more sense in the chapter after next, where we do some fast multiplying by a few small constants that are not powers of 2.

In the next chapter, you can look at some (incompletely tested) demonstration code for the 6800.

(Title Page/Index)

Monday, December 23, 2024

ALPP 03-XX (15) -- Numeric Output Conversion and Multiplying by Constants (Theory)

I'm leaving this here for reference, for the moment. Eventually, I intend to do a chapter including focus on shifts, and some of this will find place there.
You probably want to go here: https://joels-programming-fun.blogspot.com/2025/01/alpp-03-15-converting-numbers-output-input-multiplication-division-theory.html.

Numeric Output Conversion
and Multiplying by Constants
(Theory)

(Title Page/Index)

Now that we've debugged getting an input key from the ST's keyboard and outputting its ASCII code value in hexadecimal and binary on the 68000, a natural next step would be to learn how to parse numbers from the input.

But that will require multiplying and dividing by ten.

Why? Because we usually interact with numbers in decimal base -- radix base ten.

(Yeah, I'm not all that comfortable trying to remember the digits of π in hexadecimal or binary, either. And I'm not going to go out of my way to memorize those, particularly when I know how to get a computer to calculate them any time I need them, as in bc, using obase to set binary and hexadecimal output radix base:


    $ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
obase=2
4*a(1)
11.00100100001111110110101010001000100001011010001100001000110100101\
01
obase=16
4*a(1)
3.243F6A8885A308D2A

arctangent(1) is, of course, π/4. Yeah, if you're looking at the final digits above, the last byte that bc calculates in the scale you specify will be somewhat incorrect. The scale above is the default of 20 decimal digits when starting bc with the -l option.)

When you're working in binary, getting numbers in and out in decimal requires converting between binary and decimal, and converting between binary and radix base ten requires multiplying and dividing by ten.

To display each digit of a value in decimal going right, we have to divide by the largest power of ten we can, and convert the quotient to the ASCII code for that digit, repeating with the remainder until the remainder is less than ten.

Or we can work going to the left, by dividing by ten, converting the remainder to the ASCII code for that digit, repeating with the quotient until the quotient is less than ten.

Division, either way.

To input a decimal value from the keyboard, we get each digit in order, multiplying the accumulated value by ten before adding the digit we got, repeating until there are no more digits entered (or until the accumulated value overflows).

Or we can read all the digits first, count the number of digits, and multiply each digit by the appropriate power of ten as we go, and that also requires multiplication.

Multiplication, either way.

While we can do general multiplication and division on the 68000, we haven't really talked about it. And we haven't looked at how to synthesize multiplication and division on 8-bit CPUs that don't have them. I'll show you general routines for multiplication and division pretty soon, but I want to show why they work (and give you some clues about how to speed them up) by introducing multiplying and dividing by constants, which, by more than coincidence, can be useful in decimal input and output.

But we've already been multiplying and dividing by two and sixteen, haven't we?

Haven't we?

Let's look again at getting both binary and hexadecimal output. We need to understand what we are doing there.

When converting to binary from decimal or hexadecimal by hand, the usual approach (ignoring fractions) is


    Set the radix point (fraction/decimal point) on the right.
Do until all digits (bits) converted (until quotient is zero):
  Divide the number by 2, keeping both quotient and remainder.
  Write the remainder down as the next digit,
    going left from the radix point.
  Repeat with the quotient.

Now, even taking into account that this algorithmic description is rather loose, looking at what we were doing in the 6800 chapter, it looks different, doesn't it? (We were, in fact, converting to external binary from what we could call internal radix base 256. And that's not being absurd to say it that way, no.)

I mean, even ignoring the additional step of converting the remainder to a character for output, it's different. We were going left-to-right, and not even noticing the radix point until we were done, if then.

Let's look at the 6809 code for binary output again (since I think the 6809 code is easier to read):


    * Output a 0
OUT0	LDB	#'0
OUT01	PSHU	D
	LBSR	OUTC
	RTS
*
* Output a 1 
OUT1	LDB	#'1
	BRA	OUT01
* Rob code, shave a couple of bytes, waste a few cycles.
*
* Output the 8-bit binary (base two) number on the stack.
* For consistency, we are passing the byte in the low-order byte
* of a 16-bit word.
OUTB8	LDB	#8	; 8 bits
	STB	0,U	; Borrow the upper byte of the parameter.
OUTB8L	LSL	1,U	; Get the leftmost bit of the lower byte.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1
OUTB8D	DEC	,U
	BNE	OUTB8L	; loop if not Zero
	LEAU	2,U	; drop parameter bytes
	RTS

In modified human English, that's going to look like


    Do:
  Shift the bits left, capturing the bit carried off the top.
  Convert the captured bit to a character and
    write it down as the next digit,
    going right.
  Repeat until no bits remain to be converted.

Yep. Going the opposite direction. And the radix point just ended up where we stopped.

Completely backwards!

What's going on here?

You'll remember that I mentioned that shifting digits to the left (shifting the radix point to the right and filling with zeroes) is the same as multiplying by the radix.

You don't remember that I said that?

What did I say? Ah, here it is, in the chapter on hexadecimal output on the 6800:

... shifting is division and multiplication by powers of two. ...

a little before talking about moving the radix point in decimal numbers, which is the same as shifting decimal digits.

So, shifting bits to the left is multiplying by two. And shifting bits to the right is dividing by two.

And when we grabbed the bit that came off the high end into the carry, we were just grabbing the bits as the came off the top, right?

Here's how I want to see that. On the one hand we were multiplying by two. On the other hand, the top bit came off into the carry, and we grabbed it. So we were shifting right by 7 grabbing the quotient (from the carry).

Which is dividing by 2⁷ -- dividing by 128_ten.

This is because the byte is 8 bits, and the 8 bit register forms something mathematicians call a ring, which we aren't going to describe in detail because I don't want to put everyone to sleep.

But it's mathematics. We can rely on it once we understand it. Multiplication in a ring is division, and sometimes that is useful.

Now, we could do this:


    NUMBUF	RMB	34	; enough for 32 bits of output
*
CNVB8	TFR	DP,A	; point to the direct page
	CLRB
	TFR	D,Y
	LEAY	NUMBUF-LOCBAS,Y	; point to NUMBUF
	LEAY	9,Y	; start at the right
	CLR	,-Y	; NUL terminate it
	LDA	#8	; 8 bits
CNVB8L	LDB	#'0'	; ASCII '0'
	LSR	1,U	; Get the lowest bit into the carry
	ADCB	#0	; convert it to ASCII
	STB	,-Y	; build the string right-to-left
CNVB8D	DECA
	BNE	CNVB8L	; loop until counted out
	STY	,U	; return the address of the buffer
	RTS		; (this ought to work, anyway)

And that would be in the order of working right-to-left, and then we could take the address that CNVB8 returns and pass it off to OUTS, and print the number as a string.

But that would require an intermediate buffer (NUMBUF above), and I want to be able to output binary and hexadecimal without intermediate buffers. (And without explicit multiply and divide instructions.)

The intermediate buffer is where we set the radix point on the right so we can chop the less significant digits off first and write them down going to the left.

Intermediate buffers make debugging more difficult.

You can see that I would want to be able to output decimal numbers without intermediate buffers, too, right? Maybe?

Can it be done?

(If you aren't really following me, well, stick around for the ride. It does eventually begin to make sense. I think.)

We've seen that it can be done with hexadecimal. In the 6809 code we had


    ASC0	EQU	'0	; Some assemblers won't handle 'c constants well.
ASC9	EQU	'9
ASCA	EQU	'A
ASCXGAP	EQU	ASCA-ASC9-1	; Gap between '9' and 'A' for hexadecimal
*
* Mask off and convert the nybble in B to ASCII numeric,
* including hexadecimals
OUTH4	ANDB	#$0F	; mask it off
	ADDB	#ASC0	; Add the ASCII for '0'
	CMPB	#ASC9	; Greater than '9'?
	BLS	OUTH4D	; no, output as is.
	ADDB	#ASCXGAP	; Adjust it to 'A' - 'F'
OUTH4D	CLRA
	STD	,--U
	LBSR	OUTC
	RTS
*
* Output an 8-bit byte in hexadecimal,
* byte as a 16-bit parameter on PSP.
OUTHX8	LDB	1,U	; get the byte
	LSRB
	LSRB
	LSRB
	LSRB
	BSR	OUTH4
	LDB	1,U
	BSR	OUTH4
	LEAU	NATWID,U
	RTS

You can see we were capturing the high nybble (four bits) by shifting left four times, right?

Shifting right four is the same as shifting left four, capturing as we go, ~~right~~ correct? (Sorry about that.)

And, rather than shifting the low four back into place, we can grab the top bits from a copy, and then grab the bottom bits from the original, masking the top bits off.

Multiplying by sixteen in an 8-bit ring is dividing by sixteen in the 8-bit ring (within the conditions of the ring).

With a little thought, we could figure out how to output the character code in radix base 4 (quarternary) or 8 (octal) by this method, as well. Any power of 2 would be just a matter of shifting the bits appropriately and capturing and adjusting the resultant output value to a symbol that represents the output value.

One problem with octal base -- radix base eight -- is that 8 is 2³, which requires 3 bits per octal digit. And, where we can fit 2 hex digits exactly in an 8-bit byte, or four quaternary digits exactly, octal digits end up with a bit left over. Two bits, I mean. (Sorry. ;)

Which means the left-most digit when converting a byte has only two bits, and can only range between 0 and 3 instead of 0 and 7. And you have to account for that as you convert.

Which means that, converting a byte from left-to-right, when you start by shifting a digit off the top, you have to shift only 2 bits for the first digit. Instead of multiplying by 8 (which is dividing by 2⁵, or dividing by 32) the first time, you have to multipy by 4 (divide by 2⁶, or 64) to get that first octal digit on the left.

Thinking about this, 64 is the maximum power of 8 that fits in a byte. 2⁹, 512, does not fit in a byte. Keep this in mind.

And if you're converting a byte to octal from right to left, you still have to remember to only shift twice on the last shift.

How about base ten, then? Can we do something like this with base ten?

Octal cuts binary integers up three bits per octal digit. Hexadecimal cuts it up into four bits per hex digit. How many bits per decimal digit? It's clearly not a whole number of bits, and we don't know how to shift by anything but whole numbers of bits, so it doesn't look hopeful.

As a digression (8-o), say you encode decimal in four bits per decimal digit. Could this work?

Let's see.

In hexadecimal, we can record a digit from 0_sixteen to F_sixteen in four bits. So, what if we decide to only record digit values 0 through 9? It's a little bit wasteful, but it's enough to encode a decimal digit in four bits.

Let's see it:

Yep, it can be encoded.

This is binary coded decimal, or BCD.

But, 10011001_two ($99 => 99_sixteen) is (128 + 16 + 8 + 1), which is equal to 153_ten.

Where 10011001_BCD (also $99) is 99_ten.

Eaaaoooooohhh confusion!

It turns out you can add and subtract directly in BCD, although it takes an extra step or so to handle carries correctly. And, of course, you can multiply and divide, and the algorithms look like what you'd do by hand. And of course shifting left one BCD digit (four bits) at a time does work out to multiplying by ten, and shifting right one BCD digit at a time works out to dividing by ten.

But it's a bit wasteful of bits.

In BCD, you can encode numbers from 0 to 99,999,999 in four bytes (eight nybbles), or 32 bits.

0 to 99,999,999 in binary (00000000000000 to 101111101011110000011111111) requires only 27 bits, which fits in just less than six and a half bytes (one bit short of seven nybbles).

Okay, it's not really all that wasteful. (In fact, if I understand correctly, my favorite multi-precision command-line *nix tool, bc, demonstrated again above, operates in BCD.)

But now we have issues when we want to convert BCD to binary, so that we can use numbers as addresses and such. It just shifts the problems around. (In bc, we usually aren't working with addresses, by the way.)

Yes, we'll have to talk about BCD at some point.

The purpose of the digression was to try to give you a little more space and perspective before we dig into multiplying by shifts and adds.

This theoretical stuff is getting long. Let's look at some actual code to multiply by constant powers of 2.

(Title Page/Index)

Wednesday, December 18, 2024

ALPP 03-XX -- Radix Output

False start, kept for reference.

Radix Output

(Title Page/Index)

Now that we've debugged getting a key from the ST's keyboard and outputting its ASCII code value in hexadecimal and binary on the 68000, a natural next step would be to learn how to parse numbers from the input. But that will require multiplying and dividing by ten.

Why? Because we usually interact with numbers in decimal base -- radix base ten.

While we can do that on the 68000, we haven't really talked about it, and we haven't looked at how to synthesize multiplication and division on 8-bit CPUs that don't have them.

So, instead of going directly to parsing numbers, I want to look at multiplication and division, at least enough to be able to multiply and divide by ten.

But we've already been multiplying and dividing by two and sixteen, haven't we?

Haven't we?

Let's look again at getting both binary and hexadecimal output. We need to understand what we are doing there.

When converting to binary from base ten by hand, the usual approach (ignoring fractions) is


    Set the radix point (fraction/decimal point) on the right.
Do until all digits (bits) converted (until quotient is zero):
  Divide the number by 2, keeping both quotient and remainder.
  Convert the remainder to a character and 
    write it down as the next digit,
    going left from the radix point.
  Repeat with the quotient.

Now, even taking into account that this algorithmic description is rather loose, looking at what we were doing in the 6800 chapter, it looks different, doesn't it?

We were going left-to-right, and not even noticing the radix point until we were done, if then.

Let's look at the 6809 code again (since I think the 6809 code is easier to read):


    * Output a 0
OUT0	LDB	#'0
OUT01	PSHU	D
	LBSR	OUTC
	RTS
*
* Output a 1 
OUT1	LDB	#'1
	BRA	OUT01
* Rob code, shave a couple of bytes, waste a few cycles.
*
* Output the 8-bit binary (base two) number on the stack.
* For consistency, we are passing the byte in the low-order byte
* of a 16-bit word.
OUTB8	LDB	#8	; 8 bits
	STB	0,U	; Borrow the upper byte of the parameter.
OUTB8L	LSL	1,U	; Get the leftmost bit of the lower byte.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1
OUTB8D	DEC	,U
	BNE	OUTB8L	; loop if not Zero
	LEAU	2,U	; drop parameter bytes
	RTS

In human language, that's going to look like


    Do:
  Shift the bits left, capturing the bit off the top.
  Convert the captured bit to a character and
    write it down as the next digit,
    going right.
  Repeat until no bits remain to be converted.

Yep. Going the opposite direction. And the radix point just ended up where we stopped.

Completely backwards!

What's going on here?

You'll remember that I mentioned that shifting digits to the left (shifting the radix point to the right and filling with zeroes) is the same as multiplying by the radix.

You don't remember that I said that?

What did I say? Ah, here it is, in the chapter on hexadecimal output on the 6800:

... shifting is division and multiplication by powers of two. ...

a little before talking about moving the radix point in decimal numbers, which is the same as shifting decimal digits.

So, shifting bits to the left is multiplying by two. And shifting bits to the right is dividing by two.

And when we grabbed the bit that came off the high end into the carry, we were just grabbing the bits as the came off, right?

Here's how I want to see that. On the one hand we were multiplying by two. On the other hand, the top bit came off into the carry, and we grabbed it. So we were shifting left bey7.

Which is dividing by 2⁷, dividing by 128.

This is because the byte is 8 bits, and the 8 bits form something mathematicians call a ring, which we aren't going to describe in detail because I don't want to put everyone to sleep.

But it's mathematics. We can rely on it. Multiplication in a ring is division, and sometimes that is useful.

Now, we could do this:


    NUMBUF	RMB	34	; enough for 32 bits of output
*
CNVB8	TFR	DP,A	; point to the direct page
	CLRB
	TFR	D,Y
	LEAY	NUMBUF-LOCBAS,Y	; point to NUMBUF
	LEAY	9,Y	; start at the right
	CLR	,-Y	; NUL terminate it
	LDA	#8	; 8 bits
CNVB8L	LDB	#'0'	; ASCII '0'
	LSR	1,U	; Get the lowest bit into the carry
	ADCB	#0	; convert it to ASCII
	STB	,-Y	; build the string right-to-left
CNVB8D	DECA
	BNE	CNVB8L	; loop until counted out
	STY	,U	; return the address of the buffer
	RTS		; (this ought to work, anyway)

Now we can take the address that CNVB8 returns and pass it off to OUTS, and print the number as a string.

*********

With a little thought, we could figure out how to output the character code in base four or eight, as well. Any power of 2 would be just a matter of shifting the bits appropriately and adjusting the resultant value to a symbol that represents the value.

For any base ten or less, the adjustment is really straightforward in ASCII-based characters -- just adding the value to the ASCII-based character code value of '0'. (This works for UNICODE, too, but we won't be going there.)

For any base up to sixteen, if the resultant symbol exceeds the ASCII value of '9', we further add one less than the difference between the ASCII for 'A' and the ASCII for '9'. Or we can test the value first and add the appropriate adjustment in one step:

ASCII for '0' if less than or equal to 9
and ASCII for ten less than 'A' if greater than 9.

We saw the former method in the 6800 code for hexadecimal output:


    ASC0	EQU	'0	; Some assemblers won't handle 'c constants well.
ASC9	EQU	'9
ASCA	EQU	'A
ASCXGAP	EQU	ASCA-ASC9-1	; Gap between '9' and 'A' for hexadecimal
*
* Mask off and convert the nybble in B to ASCII numeric,
* including hexadecimals
OUTH4	ANDB	#$0F	; mask it off
	ADDB	#ASC0	; Add the ASCII for '0'
	CMPB	#ASC9	; Greater than '9'?
	BLS	OUTH4D	; no, output as is.
	ADDB	#ASCXGAP	; Adjust it to 'A' - 'F'
	...

This would also work as it is for a radix higher than sixteen, if we accept the approach usually taken in radix eleven through sixteen and continue with it, up to base 36 (highest valued digit 'Z').

There are reasons we may not want to do that, but it could be done.

Anyway, we know we can handle the adjustment in the cases that interest us most.

Now, let's look again at how we got each digit.

For binary, it was easy. Shift a bit off the left (high-order) end of the binary integer and convert it to ASCII '0' or '1':


    * Output a 0
OUT0	LDAB	#'0
OUT01	JSR	PPSHD
	JSR	OUTC
	RTS
*
* Output a 1 
OUT1	LDAB	#'1
	BRA	OUT01
* :::
OUTB8L	LSL	1,X	; Get the leftmost bit.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1

For hexadecimal, it may not be quite as clear that was what we were doing -- shifting a digit's worth of bits off the left, capturing them, and converting them to ASCII:


    	LDAB	1,X	; get the byte
	LSRB		; move the hexadecimal digit into place
	LSRB
	LSRB
	LSRB
	BSR	OUTRAD	; convert to ASCII and output
	LDX	PSP
	LDAB	1,X
	ANDB	#$0F	; mask the high digit off
	BSR	OUTRAD	; convert to ASCII and output

Say WHAT?!?!? Those are right shifts! And then no shifts! just a bit-AND to mask off the ...

Yeah, it would have been a little bit more plain like this:


    	CLRB		; ready to capture high four bits
	LSL	1,X	; get high bit off top 
	RORB		; capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	BSR	OUTRAD
	CLRB		; ready to capture next four bits
	LSL	1,X	; get next bit off top 
	RORB		; capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	BSR	OUTRAD

If you can't see from just reading the code that the result in B and the output is the same, go ahead and substitute these lines of code into the code for chapter 03-05 and trace through it, watching the bits shift around.

In either method, we shift the high four bits of the byte we're putting out in order, into the low four bits of B.

Then, in the method above, we shift the low four bits back into the low four bits where they came from. But in the method of chapter 03-05 way we just leave them there and mask the high bits off.

If you think of the byte register as a ring of 8 bits, you might be able to see the bits coming back around.

There's another way of looking at it. In chapter 03-05, we noted that shifting digits left one column is the same as multiplying by the radix.

Shifting a decimal number left (by adding a zero to the right and moving the decimal fraction point to the right of the added zero) is the same as multiplying by ten.

Shifting a binary number left one bit is the same as multiplying it by two.

Shifting a hexadecimal digit left by one digit is the same as multiplying by sixteen. Or, shifting a binary number left by four bits is the same as multiplying by sixteen.

How about shifting right? It's the same as dividing by the radix. We'll look at that in a bit.

So, back to thinking about output in base 4.

If we want to output in base four, we can shift two bits left, capturing and outputting each pair as we go. Do it four times and we've got the byte output in quaternary base.

How about octal? If we shift three bits, then three more, we've only got two left, and that doesn't work. So what we should have done is recognize that we only had 8 bits to shift and only shifted two bits to start, then shifted three and three.

Why?

It's helpful to note that or FF_sixteen (all bits 1, 255_ten) is 377_eight. That's how you wright the maximum value of an 8-bit byte in base eight. And the high digit of that is 3, which only takes two bits in binary. So it makes sense that you would only shift off two bits for the first digit.

Now, if you were doing two bytes at once, that would be five sets of three bits and one bit for the most significant digit. 177777_eight. is how you write FFFF_sixteen (65535_ten), the maximum value of a sixteen-bit number, for octal. For binary, it's sixteen digits: 11111111_two. For quaternary, it's eight digits: 33333333_four.

So we are getting some ideas how output in base four or base eight would work, and how to output sixteen bit values in any radix base that is a power of two. It's just shifting.

But base ten doesn't work like this.

Why?

Let me take you on a short detour through something called binary-coded decimal (BCD).

In hexadecimal, we can record a digit from 0_sixteen to F_sixteen in four bits, right?

Well, what if we decide to only record digit values 0 through 9? It's a little bit wasteful, but it's enough to encode a decimal digit in four bits.

Let's see it:

Yep, it can be done.

But, 10011001_two (99_sixteen) is (128 + 16 + 8 + 1) equal to 153_ten.

Where 10011001_BCD is 99_ten. Eaaaoooooohhh confusion!

But maybe we can see that shifting a BCD number four bits to the left is multiplying by ten? Maybe?

It is. We can play with that later. Let's set BCD aside for a moment.

The point is that, where we can divide binary numbers into fields of n bits for any radix base 2ⁿ, and we can even do something like that for binary coded decimal, trying to divide a straight binary number into fields of radix base ten is going to have us trying to use fractions of bits.

And we don't now how to do that.

I don't think anyone knows a good, simple way to do it, other than repeatedly dividing by ten, which isn't very simple in binary (which is why this chapter is so long).

Dividing is shifting right. Right? (Sorry.)

It is.

I pointed out that shifting left 1 bit is the same as shifting right 7? Well, if you capture the bits correctly, anyway.

I'm going to use 6809 code for this example instead of 6800, because we can focus a bit better on what we are doing without having a lot of DEX instructions getting in the way.

Here's what we did for the 6809 binary output in chapter 03-03:


    OUTB8	LDB	#8	; 8 bits
	STB	0,U	; Borrow the upper byte of the parameter.
OUTB8L	LSL	1,U	; Get the leftmost bit of the lower byte.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1
OUTB8D	DEC	,U
	BNE	OUTB8L	; loop if not Zero

Instead of shifting out the top and capturing the carry (multiplying by two and capturing the overflow), and writing the number left-to-right, let's divide by two and build the output string for the number right-to-left:


    NUMBUF	RMB	34	; enough for 32 bits of output
*
CNVB8	TFR	DP,A	; point to the direct page
	CLRB
	TFR	D,Y
	LEAY	NUMBUF-LOCBAS,Y	; point to NUMBUF
	LEAY	9,Y	; start at the right
	CLR	,-Y	; NUL terminate it
	LDA	#8	; 8 bits
CNVB8L	LDB	#'0'	; ASCII '0'
	LSR	1,U	; Get the lowest bit into the carry
	ADCB	#0	; convert it to ASCII
	STB	,-Y	; build the string right-to-left
CNVB8D	DECA
	BNE	CNVB8L	; loop until counted out
	STY	,U	; return the address of the buffer
	RTS		; (this ought to work, anyway)

Now we can take the address that CNVB8 returns and pass it off to OUTS, and print the number as a string.

And we could take the same approach with the hexadecimal conversion, dividing by sixteen -- shifting four bits right and capturing them in order -- and converting and storing right-to-left.

(But we would actually make a copy, mask the high bits out, convert and store, then divide by sixteen for the next digit, because it's quicker that way. But we will ignore the optimization.)

If we think about it, when we convert from decimal to binary or hexadecimal by hand, that's the way we do it. We divide by the base we are converting to, capture the remainder and write that, writing from right-to-left. And it works from decimal to binary or hexadecimal. Or any radix base to any radix base.

Why not do that in the first place?

Several reasons. One is that it's useful to be able to get numbers in and out without sending them through a conversion string buffer. Another is that shifting bits in registers and memory is one of the more useful things you can learn about, especially for assembly language. Yet another is, well, ...

Now you know that dividing and multiplying by powers of two is easy, right?

On the 68000, we have general multiply and divide, at least for 16 bits.

On the 6809 and 6801, we have byte multiply to 16 bits. No divide. (Multiply is much easier than divide.)

On the 6800 we have no multiply and no divide.

We are going to have to synthesize some multiplication and division. Also, even on the 68000, multiplication and division cost more than shifts in CPU cycle counts.

It would be nice to be have a quick way to multiply and divide by constants other than powers of 2, wouldn't it? Especially by ten?

Why, yes, it would. Let's do it. Multiplication is easier. Let's do some middle-school algebra:

10X == 2(5X)
5X == (4 + 1)X => 10X == 2((4+1)X)
(4 + 1)X == 4X + X => 10X == 2(4X + X)

4X == 2(2X) => 10X == 2(2(2X)+X)

Let's build that up from adds and shifts:


    *
MUL10	LDD	,U	; X
	CLR	,-U	; for overflow (parameter 1 off)
	CLR	,-U	; 16 bits (parameter 2 off)
	ASLB		; 2X
	ROLA
	ROL	1,U
	ASLB		; 2(2X)
	ROLA
	ROL	1,U
	ADDD	2,U	; 2(2X)+X
	BCC	MUL10N
	INC	1,U
MUL10N	ASLB		; 2(2(2X)+X) == 10X
	ROLA
	ROL	1,U
	STD	2,U
	RTS

(Title Page/Index)

Sunday, December 15, 2024

ALPP 02-35 -- Tentative Op-code Map of RK0801 CPU (Extension of M6801)

One final bit of treasure from the bottom of the pool.

Tentative Op-code Map of
RK0801 CPU
(Extension of 6801)

(Title Page/Index)

This is a tentative op-code map of extensions to the 6801 CPU that I think would make it significantly more efficient without blowing the semiconductor real estate (gates) budget for an 8-bit CPU core, from some older ideas I've had for a while (direct page unaries and SBX) and some new ideas suggested by the addressing math and stack frame examples:

New in this map:

SBX: Subtract B from X corollary to existing ABX. This optimizes small-to-medium allocations where size is not known at compile/assemble time, also helps when following relative links around.
(Adding an op-code to add D to X might be another possibility, but would require sign-extending into A.)
Add signed Immediate byte to X and S, ADIX/ADIS. This optimizes small-to-medium stack and other allocations where size is known at compile/assemble time.
(Add and Subtract unsigned byte immediate is an option, but requires more op-codes in the very tight primary op-code table. Add 16-bit immediate is yet another option, but is less efficient with code size, enough so as to make the most common case, add plus or minus 2, meaningless.)
(Considered dropping INX/DEX and INS/DES, but that de-optimizes byte string operations.)
Direct-page versions of unary/read-modify-write byte instructions,
- NEG (NEGate byte),
- COM (bit COMplement byte),
- LSR (Logical Shift Right byte),
- ROR (ROtate Right byte through carry)
- ASR (Arithmetic Shift Right byte, copying sign),
- ROL (ROtate Left byte through carry),
- DEC (DECrement byte),
- INC (INCrement byte),
- TST (TeST byte),
- CLR (CLeaR byte).
(These are, really, more appropriate in direct-page mode than in extended mode, to provide effective pseudo-registers.)
(Also, it might be useful to provide address function code outputs that distinguish between direct page and extended mode, providing an effective separate address space for pseudo-registers and I/O, with all addressing modes enabled on it.)
16-bit read/modify/write instructions:
- DINC, (Double-byte INCrement)
  (including INCD, INCrement Double accumulator),
- DDEC (Double-byte DECrement)
  (including DECD, DECrement Double accumulator),
- DASL (Double-byte Arithmetic Shift Left)
  (including ASLD, Arithmetic Shift Left Double accumulator),
- DLSR (Double-byte Logical Shift Right)
  (including LSRD, Logical Shift Right Double accumulator).
(DASL and DLSR are moved from their position in the 6801 map to the corresponding position in the new 16-bit ranks.)
(16-bit increment and decrement in the direct page will be especially helpful for software stacks.)
JMP to direct-page target (not in 6801 op-codes).

Adding the FDIV and IDIV instructions that the 68HC11 has would be fun, but would likely shoot the gates budget. Likewise, adding the 68HC11's bit testing and manipulation instructions or an additional stack register would require using pre-bytes, and I don't want to do that, either.

Instead of moving the op-codes around, the missing op-codes could be squeezed into empty codes in the 6801 map, but that would require gates that could be used for something else.

Using a pre-byte and putting the direct page op-codes in a second op-code map would partially erase the advantage of direct-page op-codes.

Left half of the op-code table:

Mnemonic
	UNARY		BRANCH		UNARY
	**ACCA	**INH	REL	INH	**ACCB	*Dir	Ind	Ext
	0	1	2	3	4	5	6	7
0	NEG	***CBA	BRA	TSX	NEG	NEG
1		NOP	BRN	[INS]	INCD	*DINC
2		***SBA	BHI	PULA	DECD	*DDEC
3	COM	***ABA	BLS	PULB	COM	COM
4	LSR	***TAB	BCC	[DES]	LSR	LSR
5		***TBA	BCS	TXS	**ASLD	*DASL
6	ROR	TAP	BNE	PSHA	ROR	ROR
7	ASR	TPA	BEQ	PSHB	ASR	ASR
8	ASL	[INX]	BVC	PULX	ASL	ASL
9	ROL	[DEX]	BVS	RTS	ROL	ROL
A	DEC	CLV	BPL	ABX	DEC	DEC
B		SEV	BMI	RTI	***LSRD	*DLSR
C	INC	CLC	BGE	PSHX	INC	INC
D	TST	SEC	BLT	MUL	TST	TST
E	***DAA	CLI	BGT	WAI	*SBX	*JMP
F	CLR	SEI	BLE	SWI	CLR	CLR
	*Not in 6801					*No JMP dp in 6801
	**Moved in 2801
	***Both row and column moved.

Right half of the op-code table:

Mnemonic
	BINARY
	ACCA				ACCB
	Imm	Dir	Ind	Ext	Imm	Dir	Ind	Ext
	8	9	A	B	C	D	E	F
0	SUB
1	CMP
2	SBC
3	SUBD				ADDD
4	AND
5	BIT
6	LDA
7		STA				STA
8	EOR
9	ADC
A	ORA
B	ADD
C	CPX				LDD
D	BSR	JSR				STD
E	LDS				LDX
F	*~ADIS	STS			*~ADIX	STX
	*Not in 6801
	*~ADIS and ADIX are signed byte constant

Expanding the address map via segment registers or widened address registers is tempting, but I'm thinking to simply be satisfied with two additional address function outputs to allow distinction between

code space (PC relative),
return address stack space (S relative),
direct page space (DP mode),
general data (everything else).

Four address spaces won't really even double available address space because of issues in indexing and hard space separation, but it will make it possible to reach or somewhat exceed full 64 K addressing.

On the other hand, it would not be hard to give the '0801 widened X and PC and maybe S, or segment registers for two, three or all four of the above address spaces or something similar. If segment registers, I would want to use either full-width segment registers, or have the segment registers offset a full byte. None of that 4-bit offset wamby-pamby.

Further extensions, such as a second stack and widened address registers, and the Y register, bit operators, and IDIV and FDIV from the 'HC11, would warrant another part number, say 2801, the "2" indicating two stacks.

Or a second 16-bit accumulator, such as the 6309 has, would make it a 16-bit CPU, so maybe 21601. But borrowing from the 6309 tends to point to the idea that, beyond a certain point, we'd want to move up to an extended derivative of the 6809.

Well, I don't think I have anything more for this rabbit hole at this moment, so you can return to the irregularly scheduled assembly language tutorial, continuing with getting numbers output.

(Title Page/Index)

Sunday, December 8, 2024

ALPP 02-34 -- Ascending the Right Island -- Frameless Examples (Single- & Split-stack): 68000

This should be the final bits of treasure I have to drag up from the bottom of the pool before I get back to I/O.

Ascending the Right Island --
Frameless Examples (Single- & Split-stack):
68000

(Title Page/Index)

Having worked through the 6809 versions of the stack frame example code with the stack frame code stripped out, let's continue full circle and look at the 68000 version with the stack frame code stripped out.

Again, there is not a whole lot to say here that can't be seen fairly easily in the code. Let's start with the single-stack no-frame code, comparing it with the single-stack stack frame code for the 68000 and the same for the 6809 in a new window. You'll want to assemble this and trace through it, stopping appropriately to look at the registers, stack, and memory.


    	OPT LIST,SYMTAB	; Options we want for the stand-alone assembler.
	MACHINE MC68000	; because there are a lot the assembler can do.
	OPT DEBUG	; We want labels for debugging.
	OUTPUT
***********************************************************************
*
* 16-bit addition as example of single-stack no-frame discipline on 68000
* with test code
* Joel Matthew Rees, October, November 2024
*
NATWID	EQU	4	; 4 bytes in the CPU's natural integer
HLFNAT	EQU	2	; half natural integer
*
*
	EVEN
LB_ADDR	EQU	*
ENTRY	BRA.W	START
	NOP		; A little buffer zone.
	NOP
A4SAVE	DS.L	1	; a place to keep A4 to A7 so we can return clean
A5SAVE	DS.L	1	; using it as pseudo-DP
A6SAVE	DS.L	1	; 
A7SAVE	DS.L	1	; SP
	DS.L	2	; gap
HPPTR	DS.L	1	; heap pointer (not yet managed)
HPALL	DS.L	1	; heap allocation pointer
	DS.L	2	; gap
FINAL	DS.L	1	; unused statically allocated variable
GAP1	DS.L	51	; gap, make it an even 256 bytes.
*
	DS.L	2	; a little bumper space
SSTKLIM	DS.L	16*8	; 16 levels of call, with room for stack frames
* 			; 68000 is pre-dec (pre-store-decrement) push
SSTKBAS	DS.L	4	; for canary return
	DS.L	2	; bumper
HBASE	DS.L	$1000	; heap space (not yet managing it)
HLIM	DS.L	2	; bumper
*
*
	EVEN
INISTKS	MOVEM.L	(A7)+,A0	; get the return address from the BIOS-provided stack
	LEA	LB_ADDR(PC),A3	; point to our process-local area
	MOVEM.L	A4-A7,A4SAVE-LB_ADDR(A3)	; Store away what the BIOS gives us.
	MOVE.L	A3,A5	; set up our local base (pseudo-DP) in A5
	LEA	SSTKBAS+4*NATWID-LB_ADDR(A5),A7	; set up our return stack
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	LEA	HBASE-LB_ADDR(A5),A4	; as if we actually had a heap
	MOVE.L	A4,HPPTR-LB_ADDR(A5)
	MOVE.L	A4,HPALL-LB_ADDR(A5)
	JMP	(A0)		; return via A0
*

***
* Stack after entry when functions are called by MAIN
* with two parameters
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [--------]
* [--------] 
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] <= SP 
*

* Signed 16 bit add to 32 bit result
* Why do this? Stack cell is 32-bit, parameters are 16.
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right in 32-bit cell on stack
* output parameter:
*   17-bit sum in 32-bit D1
ADD16S	MOVE.W	NATWID+HLFNAT(A7),D0	; right (16-bit only)
	EXT.L	D0
	MOVE.W	2*NATWID+HLFNAT(A7),D1	; add to left (16-bit only)
	EXT.L	D1
	ADD.L	D0,D1			; 32-bit result
	RTS		; return, *** all flags valid!! ***
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 32-bit cell on stack
* output parameter:
*   17-bit sum in 32-bit D1
ADD16U	CLR.L	D0
	MOVE.W	NATWID+HLFNAT(A7),D0	; right (16-bit only)
	CLR.L	D1
	MOVE.W	2*NATWID+HLFNAT(A7),D1	; add to left (16-bit only)
	ADD.L	D0,D1			; 32-bit result
	RTS		; return, *** all flags valid!! ***
*
* Etc.
*

***
* Stack after entry when functions are called by MAIN
* with two parameters (pointer and addend)
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [VAR1_1--]
* [VAR1_2--] <= PARAM2_1
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] <= SP
* To show how to access caller's local variables through pointer
* instead of walking stack --
* Add 16-bit signed parameter
* to 32 bit caller's 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	MOVE.W	NATWID+HLFNAT(A7),D1	; skip over return address
	EXT.L	D1
	MOVE.L	2*NATWID(A7),A0		; get pointer to (internal) variable
	ADD.L	D1,(A0)			; add to variable pointed to
	RTS		; return, *** all flags valid!! ***
*
*
***
* Stack on entry
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= SP
*
MAIN	CLR.L	-(A7)	; 2 variables
	CLR.L	-(A7)
	MOVE.L	#$1234,-(A7)
	MOVE.L	#$CDEF,-(A7)
	BSR.W	ADD16U	; result in D1 should be $E023
	LEA	2*NATWID(A7),A7	; could reuse, instead
	MOVE.L	D1,-(A7)
	MOVE.L	#$8765,-(A7)
	BSR.W	ADD16S	; result in D1 should be $FFFF6788 (and carry set)
	LEA	2*NATWID(A7),A7	; drop the parameters
	MOVE.L	D1,(A7)	; store result in 2nd local variable
	PEA	(A7)
	MOVE.L	#$A5A5,-(a7)
	BSR.W	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	MOVE.L	2*NATWID(A7),FINAL-LB_ADDR(A5)	; store the result
	LEA	4*NATWID(A7),A7	; drop the parameters and locals
	RTS
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP (A7)
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= SP
***
*
START	BSR.W	INISTKS
*
	NOP
*
	BSR.W	MAIN
*
	NOP
*
DONE	NOP
ERROR	NOP		; stack underflow and ERROR skip DONE
STKUNDR	NOP
	MOVEM.L	A4SAVE-LB_ADDR(A5),A4-A7	; restore the monitor's A4-A7
	NOP
	NOP		; landing pad
	NOP
	NOP
* One way to return to the OS or other calling program
	CLR.W	-(A7)	; there should be enough room on the caller's stack
	TRAP	#1	;	quick exit
*

I have stepped through the code myself. It runs, puts the correct results where they are supposed to go, and restores the stack as it should.

Now let's look at it with a split-stack frameless discipline, comparing it with both the above in a new browser window and the split-stack stack frame version for the 68000, and with the split-stack version for the 6809. Assemble this one, as well, and step through it, watching memory and registers.


    	OPT LIST,SYMTAB	; Options we want for the stand-alone assembler.
	MACHINE MC68000	; because there are a lot the assembler can do.
	OPT DEBUG	; We want labels for debugging.
	OUTPUT
***********************************************************************
*
* 16-bit addition as example of split-stack, frameless discipline on 68000
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	4	; 4 bytes in the CPU's natural integer
HLFNAT	EQU	2	; half natural integer
*
*
	EVEN
LB_ADDR	EQU	*
ENTRY	BRA.W	START
	NOP		; A little buffer zone.
	NOP
A4SAVE	DS.L	1	; a place to keep A4 to A7 so we can return clean
A5SAVE	DS.L	1	; using it as pseudo-DP
A6SAVE	DS.L	1	; 
A7SAVE	DS.L	1	; SP
	DS.L	2	; gap
HPPTR	DS.L	1	; heap pointer (not yet managed)
HPALL	DS.L	1	; heap allocation pointer
	DS.L	2	; gap
FINAL	DS.L	1	; unused statically allocated variable
GAP1	DS.L	51	; gap, make it an even 256 bytes.
*
	DS.L	2	; a little bumper space
SSTKLIM	DS.L	16*2	; 16 levels of call, with room for frame pointers
* 			; 68000 is pre-dec (pre-store-decrement) push
SSTKBAS	DS.L	2	; for canary return
SSTKBMP	DS.L	2	; bumper
PSTKLIM	DS.L	16*4	; roughly 16 levels of call
PSTKBAS	DS.L	2	; bumper
HBASE	DS.L	$1000	; heap space (not yet managing it)
HLIM	DS.L	2	; bumper
*
*
	EVEN
INISTKS	MOVE.L	(A7)+,A0	; get the return address from the other (BIOS) stack
	LEA	LB_ADDR(PC),A3
	MOVEM.L	A4-A7,A4SAVE-LB_ADDR(A3)	; Store away what the BIOS gives us.
	MOVE.L	A3,A5	; set up our local base (pseudo-DP)
	LEA	SSTKBMP-LB_ADDR(A5),A7	; set up our return stack
	LEA	PSTKBAS-LB_ADDR(A5),A6	; set up our parameter stack
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	LEA	HBASE-LB_ADDR(A5),A4	; as if we actually had a heap
	MOVE.L	A4,HPPTR-LB_ADDR(A5)
	MOVE.L	A4,HPALL-LB_ADDR(A5)
	JMP	(A0)		; return via A0
*

***
* Return stack when functions are called by MAIN
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [RETADR1 ] <= RSP
*
* Parameter stack when called by MAIN
* with two 16-bit parameters,
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*

* Signed 16 bit add with 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right in 16-bit cells
* output parameter:
*   17-bit sum in 32-bit cell
ADD16S	MOVEM.W	(A6)+,D0/D1	; D0 lowest, but 16-bit sign extends!
*	EXT.L	D0		; right
*	EXT.L	D1		; left
	ADD.L	D0,D1		; add right to left
	MOVE.L	D1,-(A6)
	RTS		; return, *** all flags valid!! ***
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 16-bit cells
* output parameter:
*   17-bit sum in 32-bit cell
ADD16U	CLR.L	D0
	CLR.L	D1
*	MOVEM.W	(A6)+,D0/D1	; D0 lowest, but 16-bit sign extends!
	MOVE.W	(A6)+,D0	; right
	MOVE.W	(A6)+,D1	; left
	ADD.L	D0,D1		; add right to left
	MOVE.L	D1,-(A6)
	RTS		; return, *** all flags valid!! ***
*
* Etc.
*

***
* Parameter stack when called by MAIN
* with two parameters, 32-bit pointer and 16-bit addend
* [32:VAR1_1--]
* [32:VAR1_2--] <= PARAM2_1 
* [32:PARAM2_1]
* [16:PARAM2_2] <= PSP

* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to caller's 2nd 32-bit internal variable.
* input parameter:
*   32-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	MOVE.W	(A6)+,D1		; addend
	EXT.L	D1
	MOVE.L	(A6)+,A0		; get caller's internal variable pointer
	ADD.L	D1,(A0)	; add to caller's 2nd variable
	RTS		; return, *** all flags valid!! ***
*
*
***
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= RSP
*
* Parameter stack after local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN	CLR.L	-(A6)		; allocate and initialize
	CLR.L	-(A6)		; allocate and initialize
	MOVE.W	#$1234,-(A6)
	MOVE.W	#$CDEF,-(A6)
	BSR.W	ADD16U	; result on parameter stack should be $E023
	LEA	HLFNAT(A6),A6	; adjust to 16 bit, could be optimized out
	MOVE.W	#$8765,-(A6)
	BSR.W	ADD16S	; result on parameter stack should be $FFFF6788 (and carry set)
	MOVE.L	(A6),NATWID(A6)	; save in local
	LEA	NATWID(A6),A0
	MOVE.L	A0,(A6)		; save the pointer.
	MOVE.W	#$A5A5,-(a6)	; push the addend.
	BSR.W	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	MOVE.L	(A6),FINAL-LB_ADDR(A5)	; store the result
	RTS
*
***
* Stack at START:
* (what BIOS/OS gave us) <= RSP (A7)
***
* (who knows?) <= PSP (A6)
***
*
***
* Return stack will always be in pairs:
* [RETADRNN  ]
* [CALLERFMNN]
*
* Return stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= RSP
*
* Parameter stack after initialization, mark:
* [<unknown] <= PSP,FP==<EMPTYP>
*
START	BSR.W	INISTKS
*
	NOP
*
	BSR.W	MAIN
*
	NOP
*
DONE	NOP
	NOP		; landing pad
ERROR	NOP
STKUNDR	NOP
	MOVE.L	(A7)+,A4
	MOVEM.L	A4SAVE-LB_ADDR(A5),A4-A7	; restore the monitor's A4-A7
	NOP
	NOP		; landing pad
	NOP
	NOP
* One way to return to the OS or other calling program
	CLR.W	-(A7)	; there should be enough room on the caller's stack
	TRAP	#1	;	quick exit
*

Man, I'm worn out. On the other hand, I think this detour will help me focus as we progress through the I/O examples, starting with binary numeric output.

(It definitely helped me figure out some daydreams about extending the 6801, if you're interested.)

(Title Page/Index)

Wednesday, December 4, 2024

ALPP 02-33 -- Ascending the Right Island -- Frameless Examples (Single- & Split-stack): 6809

Yet another couple of useful bits, from the bottom of the pool.

Ascending the Right Island --
Frameless Examples (Single- & Split-stack):
6809

(Title Page/Index)

Now that we have worked through both the single-stack and split-stack frameless examples for the 6801, we can finally get back to the code that started this detour (6809 version) and strip out the code for maintaining the stack frames.

On higher-level architectures like the 6809, the stack frame maintenance code can be so non-intrusive that it can be easy to fail to notice it.

But it can still get in the way. So I'm going ahead and showing the code without it here, in single-stack no-frame and split-stack frameless discipline.

Frameless does mean we have to keep track of what's on the stack(s).

And there's really not much more left to talk about, although we want to remember that, because we are making specific use of the direct page, the entry address is $2000 instead $80.

First the single-stack version. You'll want to compare with the single-stack stack frame version for the 6809 to get a better feel for what is happening with stack frames, as well as with the single-stack no frame version for the 6801 to see how the 6809's addressing modes make things easier.


    * 16-bit addition as example of single-stack no frame discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$2000	; MDOS says this is a good place for usr stuff.
*	SETDP	$20	; for some other assemblers
	SETDP	$2000	; for EXORsim
*
ENTRY	LBRA	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP
SSAVE	RMB	2	; a place to keep S so we can return clean
SSAVEX	EQU	6	; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE	RMB	2	; just for kicks, save U, too
USAVEX	EQU	SSAVEX+2
DPSAVE	RMB	2	; a place to keep DP so we can return clean
DPSAVEX	EQU	USAVEX+2
	RMB	4	; bumper
XWORK	RMB	2	; For saving an index register temporarily
XWORKX	EQU	DPSAVEX+6
HPPTR	RMB	2	; heap pointer (not yet managed)
HPPTRX	EQU	XWORKX+2
HPALL	RMB	2	; heap allocation pointer
HPALLX	EQU	HPPTRX+2
	RMB	4	; bumper
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	HPALLX+6
GAP1	RMB	2	; Mark the bottom of the gap
GAP1X	EQU	FINALX+4
*
LB_ADDR	EQU	ENTRY
*
*
	SETDP	0	; Not yet set up
	ORG	$2100	; Give the DP room.
	RMB	4	; a little bumper space
SSTKLIM	RMB	96	; roughly 16 levels of call
SSTKLIMX	EQU	$104
* 			; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS	RMB	4	; for canary return
SSTKBASX	EQU	SSTKLIMX+96
SSTKBMP	RMB	4	; a little bumper space
SSTKBMPX	EQU	SSTKBASX+4
*
HBASE	RMB	$1024		; Not using or managing heap yet.
HBASEX	EQU	SSTKBMPX+4
HLIM	RMB	4	; bumper
HLIMX	EQU	HBASEX+$1024
*
*
INISTK	TFR	DP,A
	CLRB
	TFR	D,X		; save old DP base for a moment
	LEAY	ENTRY,PCR	; Set up new DP base
	TFR	Y,D
	TFR	A,DP		; Now we can access DP variables correctly.
*	SETDP	$20	; some other assemblers
	SETDP	$2000	; EXORsim
	STX	<DPSAVE		; technically only need to save high byte
	STU	<USAVE
	PULS	X		; get return address
	STS	<SSAVE		; Save what the monitor gave us.
	LEAS	SSTKBMPX,Y	; Move to our own stack
	LEAY	STKUNDR,PCR	; fake return to stack underflow handler
	PSHS	Y		; 
	PSHS	Y		; one more fake return to handler
	CLRB			; A still has run-time DP
	ADDD	#HBASEX		; calculat EA
	TFR	D,Y		; as if we actually had a heap
	STY	<HPPTR
	STY	<HPALL
	JMP	,X	; return via X
*
***
* Stack after call when fuctions are called by MAIN
* with two parameters
* (#0 means no local variables)
* We will return result in D:X
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [--------]
* [--------] 
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] 
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit D:X D high, X low
ADD16S	LDX	#-1	; sign extend right
	TST	2,S	; sign bit, anyway
	BMI	ADD16SR
	LEAX	1,X	; 0
ADD16SR	PSHS	X	; push right extension (parameters 4 offset)
	LDX	#-1	; negative
	LDD	6,S	; left
	BMI	ADD16SL
	LEAX	1,X	; 0
ADD16SL	PSHS	X	; push left extension (parameters 6 offset)
	ADDD	6,S	; add right
	TFR	D,X	; save low
	PULS	D	; get left sign extension (parameters 4 offset)
	ADCB	1,S	; carry is still safe
	ADCA	,S	; high word complete
	LEAS	2,S	; drop temporary
	RTS		; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit D:X D high
ADD16U	LDD	4,S	; left
	ADDD	2,S	; add right
	TFR	D,X	; save low
	LDD	#0	; extend
	ADCB	#0	; extend Carry unsigned (could ROL in)
	RTS		; C, N valid, Z not valid
*
* Etc.
*
***
* Stack at entry when called by MAIN
* (#0 means no local variables)
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [VAR1_1--]
* [VAR1_2--] <= PARAM2_1
* [PARAM2_1] (pointer to VAR1_2)
* [PARAM2_2]
* [RETADR1 ] 
*
* To show how to access caller's local variables through pointer
* instead of walking stack --
* Add 16-bit signed parameter
* to 32 bit caller's 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDX	#-1	; sign extend 1st parameter
	TST	2,S
	BMI	ADD16SIP
	LEAX	1,X
ADD16SIP	PSHS	X	; parameters now 4 offset
	LDX	6,S	; pointer -- LDD [6,X] gets the high half
	LDD	2,X	; caller's 2nd variable, low
	ADDD	4,S	; 1st parameter
	STD	2,X	; update low half
	LDD	,X	; caller's 2nd variable, high
	ADCB	1,S	; sign extension
	ADCA	,S	; high byte 
	STD	,X	; update
	LEAS	2,S	; drop temporary
	RTS		; C, N valid, Z not valid
*
*
***
* Stack after allocating local variables
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2] <= SP
*
MAIN	LDD	#0	; allocate and initialize
	TFR	D,X
	PSHS	D,X
	PSHS	D,X
*
	LDX	#$1234
	LDD	#$CDEF
	PSHS	D,X
	LBSR	ADD16U	; result in D:X should be $E023
	STX	2,S
	LDD	#$8765
	STD	0,S
	LBSR	ADD16S	; result in D:X should be $FFFF6788 (and carry set)
	STX	6,S	; result in 2nd local variable
	STD	4,S
	LEAX	4,S	; calculate address of 2nd variable to pass in
	STX	2,S
	LDD	#$A5A5
	STD	,S	
	LBSR	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	LDD	4,S
	STD	<FINAL
	LDD	6,S
	STD	<FINAL+2
	LEAS	12,S	; drop both the used parameters and the local variables together
	RTS		; C, N still valid, Z still not
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= SP
***
*
START	NOP
	LBSR	INISTK
	NOP
*
*
	LBSR	MAIN
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	<SSAVE	; restore the monitor stack pointer
	LDU	<USAVE	; restore U
	LDD	<DPSAVE	; restore the monitor DP last
	TFR	A,DP
	SETDP	0	; For lack of a better way to set it.
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	JMP	[$FFFE]	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

Again, not much to say about the split-stack code. other than that you'll want to compare it with the split-stack stack frame version for the 6809 and the split-stack stack frame version for the 6801, for the same reasons as mentioned above. to get a better feel of the differences.


    * 16-bit addition as example of split-stack frame-free discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$2000	; MDOS says this is a good place for usr stuff.
*	SETDP	$20	; for lwasm and some other assemblers
	SETDP	$2000	; for EXORsim and some other assemblers
*
ENTRY	LBRA	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP
SSAVE	RMB	2	; a place to keep S so we can return clean
SSAVEX	EQU	4	; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE	RMB	2	; just for kicks, save U, too
USAVEX	EQU	SSAVEX+2
DPSAVE	RMB	2	; a place to keep DP so we can return clean
DPSAVEX	EQU	USAVEX+2
	RMB	4	; bumper
XWORK	RMB	2	; For saving an index register temporarily
XWORKX	EQU	DPSAVEX+6
HPPTR	RMB	2	; heap pointer (not yet managed)
HPPTRX	EQU	XWORKX+2
HPALL	RMB	2	; heap allocation pointer
HPALLX	EQU	HPPTRX+2
	RMB	4	; bumper
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	HPALLX+6
GAP1	RMB	2	; Mark the bottom of the gap
GAP1X	EQU	FINALX+4
*
LB_ADDR	EQU	ENTRY
*
*
	SETDP	0	; Not yet set up
	ORG	$2100	; Give the DP room.
	RMB	4	; a little bumper space
SSTKLIM	RMB	32	; 16 levels of call
SSTKLIMX	EQU	$104	; Skip over the DP page.
* 			; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS	RMB	4	; for canary return
SSTKBASX	EQU	SSTKLIMX+32
SSTKBMP	RMB	4	; a little bumper space
SSTKBMPX	EQU	SSTKBASX+4
PSTKLIM	RMB	64	; about 16 levels of call at two parameters per call
PSTKLIMX	EQU	SSTKBMPX+4
PSTKBAS	RMB	4	; bumper space -- parameter stack is pre-dec
PSTKBASX	EQU	PSTKLIMX+64
*
HBASE	RMB	$1024		; Not using or managing heap yet.
HBASEX	EQU	PSTKBASX+4
HLIM	RMB	4	; bumper
HLIMX	EQU	HBASEX+$1024
*
*
* Calculate DP because we don't have DP relative in index postbyte:
INISTKS	TFR	DP,A
	CLRB
	TFR	D,X		; save old DP base for a moment
	LEAY	ENTRY,PCR	; Set up new DP base
	TFR	Y,D
	TFR	A,DP		; Now we can access DP variables correctly.
*	SETDP	$20	; some other assemblers
	SETDP	$2000	; EXORsim
	STX	<DPSAVE		; technically only need to save high byte
	STU	<USAVE
	PULS	X		; get return address
	STS	<SSAVE		; Save what the monitor gave us.
	LEAS	SSTKBMPX,Y	; Move to our own return stack
	LEAU	PSTKBASX,Y	; and our own parameter stack
	LEAY	STKUNDR,PCR	; fake return to stack underflow handler
	PSHS	Y
	PSHS	Y		; one more fake return to stack underflow handler
	CLRB			; A still has run-time DP
	ADDD	#HBASEX		; calculat EA
	TFR	D,Y		; as if we actually had a heap
	STY	<HPPTR
	STY	<HPALL
	JMP	,X	; return via X
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ]
* [RETADR1 ]
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit
ADD16S	LDX	#-1	; sign extend right
	TST	,U	; sign bit, anyway (Use Y to show it can be used.)
	BMI	ADD16SR
	LEAX	1,X	; 0
ADD16SR	PSHU	X	; push right extension (parameters 2 offset)
	LDX	#-1	; negative
	LDD	4,U	; left
	BMI	ADD16SL
	LEAX	1,X	; 0
ADD16SL	PSHU	X	; push left extension (parameters 4 offset)
	ADDD	4,U	; add right
	STD	6,U	; save low
	PULU	D	; get left sign extension (parameters 2 offset)
	ADCB	1,U	; carry is still safe
	ADCA	,U++	; high word complete, tricky postinc (parameters 0 offset)
	STD	,U
	RTS		; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 32-bit
* output parameter:
*   17-bit sum in 32-bit 
ADD16U	LDD	2,U	; left
	ADDD	,U	; add right
	STD	2,U	; save low
	LDD	#0	; extend
	ROLB		; extend Carry unsigned (could ADC #0)
	STD	,U
	RTS		; C, N valid, Z not valid
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with two 16-bit parameters,
* [32:VAR1_1--]
* [32:VAR1_2--] <= PARAM2_1
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDD	#-1	; sign extend addend parameter
	TST	,U
	BMI	ADD16SIP
	LDD	#0
ADD16SIP	PSHU	D	; save sign extension (parameters 2 offset)
	LDX	4,U	; get pointer to variable
	LDD	2,X	; caller's 2nd variable, low
	ADDD	2,U	; addend parameter
	STD	2,X	; update low half
	LDD	,X	; caller's 2nd variable, high
	ADCB	1,U	; sign extension low byte
	ADCA	,U	; high byte
	STD	,X	; store result
	LEAU	6,U	; drop temporary and parameters -- no return parameter
	RTS		; C, N valid, Z not valid
*
*
***
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= RSP
*
* Parameter stack after local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN	LDD	#0	; allocate and initialize
	TFR	D,X
	PSHU	D,X
	PSHU	D,X
	LDX	#$1234
	LDD	#$CDEF
	PSHU	D,X	; 8 bytes local, 4 bytes parameter, 12 bytes offset
	LBSR	ADD16U	; 32-bit result on parameter stack should be $0000E023
	LEAU	2,U	; drop high part (could be optimized out).
	LDD	#$8765
	PSHU	D
	LBSR	ADD16S	; result on parameter stack should be $FFFF6788 (and carry set)
	PULU	D,X	; 4 bytes of used parameters removed from stack (local variables on top)
	STX	2,U	; low half, store in local variable
	STD	,U	; high half
	LEAX	,U	; point to 2nd variable
	LDD	#$A5A5
	PSHU	D,X	; X pushed first
	LBSR	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	LDD	2,U
	STD	<FINAL+2
	LDD	,U
	STD	<FINAL
	LEAU	8,U
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= RSP (S)
***
* (who knows?) <= PSP (U)
***
*
***
* Return stack will be just the return addresses:
* [RETADRNN  ]
*
* Return stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= RSP
*
*
* Parameter stack after initialization, mark:
* [<unknown] <= PSP
*
START	LBSR	INISTKS
*
	LBSR	MAIN
*
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	<SSAVE	; restore the monitor stack pointer
	LDU	<USAVE	; restore U
	LDD	<DPSAVE	; restore the monitor DP last
	TFR	A,DP
	SETDP	0	; For lack of a better way to set it.
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	JMP	[$FFFE]	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

As always, I have stepped through the code and made sure it does what I say it does.

If reading through it and comparing it with other version brings up questions that stepping through the code doesn't answer, go ahead and leave me a comment.

From here, you can either go ahead to digging into outputting binary numbers, or (when I get it ready) you can look at one more set of examples for frameless discipline, on the 68000.

Ah, more squirrels to chase. I mean, more daydreams.

With the stack split up, we might be able to see how a simple hysteric spill-fill cache could significantly optimize calls and returns.

Calls and returns cost, in addition to the code and cycles to load the new PC, cycles to save and restore the old. With the combined stack, they also tend to incur code and cycle costs in moving parameters into place and saving and restoring registers.

With a cache attached to the return stack pointer, saves and restores can happen in parallel with fetching, decoding, and executing instructions, effectively hiding the basic call/return overhead.

Here's what I mean by hysteric spill/fill:

Say the cache has sixteen entries (32 bytes).

When pushing a new return address crosses the boundary between the 12th and 13th entry -- 3/4 ful, the cache controller starts pushing saved addresses off the other end into main RAM, to make more room. It watches the bus so that it can do so when the bus is not busy with instruction fetches or data or DMA accesses, unless it the cache completely fills, in which case it gets the bus at higher priority than instruction fetches.

It will keep pushing addresses out until the cache is half-empty again, or until a return cancels the fill.

Returns will work in reverse. As long as it is nested more than four calls deep, it will try to keep at least four return addresses in the cache, schedule reads to bring addresses back in from RAM when the boundary between the 5th and 4th entries is crossed.

It uses a cache base and limit register to maintain position in the stack address space, and a stack base and limit register to tell the controller when to come to a hard stop, and when to initiate stack overflow or underflow interupt/exception processing.

I assume that you have noticed that splitting the stack helps relieve the costs of moving parameters into place, and can even be of some relief relative to saving and restoring registers.

The parameter stack is not as regularly structured as the return address stack, but it could profitably be cached in a similar manner, with a larger cache, either double or quadruple size.

Both of these caches should be paired, to enable fast context switching. Or maybe done in sets of four, but I'm not sure the 6809 would benefit from four of each. One for the current process and one to be writing back to RAM after a process switch should be enough.

And I guess, since I've commented after the 6801 examples about how the direct page should be a bank of memory to use as pseudo-registers, I should mention the concept of a cache for the direct page here. This would also be paired, with the switch activated when the DP is set. There would need to be several different strategies for filling the new cache and writing back the dirty entries from the old cache, plus a way of setting priority for differnt regions of the direcgt page.

Caching the direct page would conflict with using it for I/O devices, so I'm thinking the 6809 wants a second direct page (specifiable in the index post-byte) just for I/O.

Heh. Daydreams, indeed. This is just an 8/16-bit processor with a 16-bit address space. Too greedy. Unless we had a true 16/32-bit descendant of the 6809.

Ah. Sorry for the further distractions.

The 68000 frameless examples.

Or outputting binary numbers.

(Title Page/Index)

Tuesday, December 24, 2024

Multiplying by Powers of Two(Shift Left)

Monday, December 23, 2024

Numeric Output Conversionand Multiplying by Constants(Theory)

Wednesday, December 18, 2024

Radix Output

Sunday, December 15, 2024

Tentative Op-code Map ofRK0801 CPU(Extension of 6801)

Sunday, December 8, 2024

Ascending the Right Island -- Frameless Examples (Single- & Split-stack):68000

Wednesday, December 4, 2024

Ascending the Right Island -- Frameless Examples (Single- & Split-stack):6809

Multiplying by Powers of Two
(Shift Left)

Numeric Output Conversion
and Multiplying by Constants
(Theory)

Tentative Op-code Map of
RK0801 CPU
(Extension of 6801)

Ascending the Right Island --
Frameless Examples (Single- & Split-stack):
68000

Ascending the Right Island --
Frameless Examples (Single- & Split-stack):
6809