joel's programming fun

Sunday, April 26, 2026

Which Is More Compact on the Tandy Color Computer -- Machine Code or BASIC?

A perennial question on one of the Tandy/Radio Shack Color Computer FaceBook groups:

Is machine code or BASIC more compact?

The answer is not simple, so I'm going to blog it.

The first question is, which BASIC?

BASIC compilers convert the BASIC program source to machine code (or assembly language, which is pretty much functionally equivalent to machine code). So comparing the compactness of BASIC compilers to machine code is basically just analyzing the ability of the compiler to produce efficient code.

Some BASIC interpreters directly interpret from the program text. This tends not to be very compact, since you're the program is just the text stored. But it may well be more compact than the machine code produced by a simplistic compiler or a programmer who isn't very fluent in the processor and its machine/assembly language.

Other BASIC interpreters parse the source text and compile it to high-level byte codes defined by the interpreter, storing the stream of byte codes, usually with the comments. Whether the stored byte-code stream is compact or not depends to a large extent on how well the byte code is designed.

Compared to most CPU machine codes, the byte-code streams can be rather compact.

But the comments that get stored with the byte-code streams can result in a stored program that is not compact. In fact, the better the comments, the fatter the stored program.

But there's more! (heh)

The 6809 in the Tandy/Radio Shack Color Computer has a very advanced machine code which is essentially equivalent in expressiveness to a well-designed byte code.

So 6809 machine code tends to be compact.

Aaaand the BASIC interpreter in the Color Computer is one of Microsoft's. Microsoft developed a method of mechanically projecting their 8080 BASIC on other CPU architectures, which resulted in quickly done implementations that basically ignored many of the special features of other processors and wasted a lot of code and cycles emulating the BASIC run-time that Bill and Paul hacked together for the 8080.

So, even though the BASIC in the Color Computer is a byte-code interpreter, it's not necessarily all that compact.

So the general conclusion is that you really don't want to ask the question. Just learn both as tools.

But, there's MORE!

The Color Computer does not have a separate BIOS like the IBM PC 5150 does.

It has some routines that can be used more-or-less independently of BASIC, specifically the Input Character and Output Character routines. If you have DOS Extended Basic plugged in (via the floppy controller cartridge), the disk input and output routines can also be used, with some care, somewhat independently from BASIC.

That's code that you don't have to write yourself, and it (usually) exists in ROM in the Color Computers 1 & 2, and doesn't take up space in RAM.

That's not compact code, but it might save space.

If you are willing to let BASIC do a lot of the lifting, you can mix machine language code with BASIC code by reserving space for it via BASIC commands and reading the machine language code in from data statements, or reading the machine language parts in from disk or tape.

If you don't co-exist with BASIC, you have to write your own fundamental (especially, I/O) routines and include them with your programs -- unless you target the OS-9/6809 operating system (or the community rewrite of OS-9/6809, NitrOS-9), or Flex09, or one of the other operating systems available for the 6809.

And all the code you write yourself takes up space.

Really, targeting OS-9/6809 is probably the better idea, but I don't want to take the time to take that topic up in this post.

Bits of stuff you may want to know relative to this question, BASIC takes up 8K to 24K in the upper half of the 6809's 64K address space in the Color Computers 1 & 2. Your BASIC program and your machine code will have to share the lower 32K, or as much as is physically installed -- except for in the 64K mode, which I will explain a little bit after explaining what I just said.

Color BASIC lives in 8K of ROM inside the computer, in the upper half of the address space -- so it can boot the computer up on power on. Color BASIC is the boot ROM.
Extended Color BASIC (ECB) is obtained by adding an 8K extension ROM internally. (Or, in some later models that come with ECB installed, there is only one 16K ROM.) The 8K of ECB extensions exist just below the boot ROM.
Disk Extended Color BASIC (DECB) adds another 8K of ROM in the controller ROMpack, as I mentioned above. The ROM for DECB, when the floppy disk controller pack is installed, exists just above the boot ROM.

It looks like this:

(Double apologies for having incorrect order in the description above and an incorrect table here for the first couple of hours live, and till having the positions of the ROMs swapped above and below, for the first week this was live. The described position of the interrupt vectors in the boot ROM was also 8K too low. Guess I've been working too hard.)

65504~65535	reserved, and interrupt vectors
65280~65503	I/O, control
57344~65279	(gap)
49152~57343	DECB ROM (if present)
40960~49151	Color BASIC/Boot/interrupt code
32768~40959	ECB (if present)
0~32767	RAM

(For those wondering, the addressing circuitry ghosts the top 32 bytes of the boot ROM {from 49120 to 49151} up 16K {up to 65504 to 65535} so that the vectors in the ROM that starts at 40960 can be read in response to interrupts.)

The boot process points the video controller somewhere in the RAM, and 512 bytes (one quarter kilobyte) is used for the video buffer (text mode). BASIC reserves some of the RAM for strings and variables and some for storing code, does some other initialization, and shows you the copyright notice and OK prompt.

You can check how much physical RAM is installed by using the BASIC PRINT command, printing the value of the MEM pseudo-variable right after power-on:

PRINT MEM

What it should show you for various sizes of physical RAM installed:

4K RAM: PRINT MEM shows less than 4K, around 2K after boot, in Color BASIC only. ECB and DECB will not boot in 4K of RAM.
16K RAM: PRINT MEM shows between 4K and 16K, close to 16K for Color BASIC, a bit more than 8K for ECB, and a bit more than 6K for DECB.
32K RAM: PRINT MEM shows between 16K and 32K, close to 31K for Color BASIC, a bit over 24K for ECB, and a bit over 22K for DECB.

These reservations of RAM can be adjusted by BASIC commands in your program (or just at the keyboard, when you're testing things).

Use the FILES command to adjust the number of file buffers.

Use the CLEAR command to specify how much string variable/space you want, and what you want BASIC to respect as the top of RAM. For instance, if you want 2000 bytes for variables and strings, and you want to put your machine language routines at 30000 and above, use

CLEAR 2000, 30000

(That's assuming you have 32K RAM or more.)

PRINT MEM after that, and you should get just over 18K, which BASIC will use to store the program in.

(I think that's the way it worked.)

Now, if you want to leave BASIC completely behind, using your own I/O routines and all, you don't really have to care how much memory BASIC thinks is left over. You only have to pay attention to where the video buffer is. But then you have to include your own I/O routines with the code, and they do take space (and time to write).

(But you also need to pay attention to the interrupt time operations of the machine.)

You can get information on how to determine where the video buffer is and such from EDTASM+ or DISK EDTASM manuals and other sources.

Now, I mentioned the 64K RAM mode. You can install 64K of RAM in a Color Computer 1 or 2 (with some physical modifications on the mainboard, instructions are available in various places if yours doesn't have the expansions already installed). But Microsoft's BASIC will only use 32K of it.

How you get access to the full 64K is to use OS-9/6809 -- or NitrOS-9. And leave Microsoft BASIC behind.

Or, you can load a (machine language) routine to hit the Page Switch bit to map the physically high half of RAM low, copy the ROMs to the RAM, flip the page switch back so that the code from the ROMs is in place for the next interrupt, and you're in business, running the ROM code out of RAM.

Unfortunately, Microsoft BASIC still doesn't want to let you set the limit to BASIC above (a hundred bytes or so below) 32K. But you do end up with some RAM in the upper half that can be used for machine code.

There are patches, modified BASICs and 3rd party BASICs that let you get around the 32K limit, finding them is an exercise for the reader.

Then there is the Color Computer 3. The smallest RAM configuration for the Color Computer 3 is 128K. That's bigger than the 6809 can address by itself, of course. The Color Computer 3 has a replacement for the SAM (and VDG) called the GIME, which provides the means of bank switching the RAM around in 8K chunks, and that's how it can access more memory.

The Color Computer 3 boots from the ROMs, copies the ROM to RAM, and still leaves you with only 32K for BASIC. Except that the new version of BASIC provides for wider text screens, higher resolutions, and more colors, managing the video memory for you. All of this uses the expanded memory, and leaves the conventional memory available for your program. So it's definitely worth the upgrade.

On the other hand, OS-9/6809 level 2 or NitrOS-9 will take care of the bank switching for you, if you use that, letting you use more of the expanded RAM space for your own code.

Other OSses that will take advantage of the banks switching for you -- I believe UniFlex can, and Fuzix should be able to, as well.

Or you can write your own, for fun. :)

I really should do a more complete version of this post, I suppose, with diagrams and coding examples, and an explanation of how to run OS-9 and NitrOS-9 and maybe some of the others, but I really don't feel like I can afford the time this week or next. I have to take care of some government paperwork.

So, the conclusion --

If you are needing to squeeze functionality into a tight space, the question isn't really machine code vs. BASIC.

It's figuring out what you want to do with which -- and the fun in doing that is why you are doing retro in the first place, isn't it?

And then when you've had fun with the BASIC environment, figuring out when you want to leave it behind for a more complete OS, and which path forward.

Monday, January 13, 2025

ALPP 03-15 -- Converting Numbers for Output and Input with Multiplication and Division (Theory)

Converting Numbers for Output and Input
with Multiplication and Division

(Title Page/Index)

Now that we've debugged getting an input key from the ST's keyboard and outputting its ASCII code value in hexadecimal and binary on the 68000, a natural next step would be to learn how to parse numbers from the input.

But that will require multiplying and dividing by ten, because we usually interact with numbers in decimal base -- radix base ten.

(Yeah, I'm not all that comfortable trying to remember the digits of π in hexadecimal or binary, either. And I'm not going to go out of my way to memorize those, particularly when I know how to get a computer to calculate them any time I need them, as in bc, using obase to set binary and hexadecimal output radix base:


    $ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
obase=2
4*a(1)
11.00100100001111110110101010001000100001011010001100001000110100101\
01
obase=16
4*a(1)
3.243F6A8885A308D2A

arctangent(1) is, of course, π/4. Yeah, if you're looking at the final digits above, the last byte that bc calculates in the scale you specify will be somewhat incorrect. The scale above is the default of 20 decimal digits when starting bc with the -l option.)

When you're working in binary, getting numbers in and out in decimal requires converting between binary and decimal.

We call it decimal ("dec" is ten) because it is radix base ten, and we have become used to writing our numbers in (radix) columns using that base. Something to do, as we suppose, with the number of fingers we have.

Converting between binary (radix base two) and decimal (radix base ten) requires multiplying and dividing by two and ten.

Output, Working Left-to-right

One approach to display each digit of a numeric value in decimal is to proceed from left (most significant in our traditional column order) to right (least significant).

We start by finding the largest power of ten smaller than the number and divide the number by that value. The quotient will be the first digit on the left.

Then we repeat with the remainder, until there is no more remainder.

The advantage of this approach is that we can start writing digits down where we are.

The disadvantage is that we have to find the largest power less than the number before we can start.

One way to make it easier to find the largest power of ten smaller is to have a pre-calculated array of powers of ten to compare the number to.

Output, Working Right-to-left

Another approach is to guess or calculate how much space we need and start from the right (least significant column) and proceed to the left (towards the most significant). Divide the number by ten itself. The remainder is the right-most digit.

Then we repeat with the quotient until the quotient goes to zero.

The advantage of this approach is that we will always be dividing by ten. There is no need to find the largest power of ten smaller to start with.

The disadvantage is that we have to guess how much space to leave -- or calculate it.

But we can avoid either guessing or calculating the amount of space by doing our initial work in a temporary buffer somewhere, then copying the buffer to output.

Efficiency

Which is more efficient depends on a lot of things, but, in many cases, the code for the former can be organized so that it is as if only one actual complete division is performed, the iteration for each column producing one digit. By comparison, the latter approach requires a division for each column, and the division is by a small number, which is the sort of division that takes the most processor cycles.

But if we are just trying to get output going, we may find it easier to allocate the conversion buffer as a process global variable and use the latter method.

Input, Working Left-to-Right

To input a decimal value from the keyboard, we can get each digit in order from left to right, multiplying the accumulated value by ten before adding the digit we got, repeating until there are no more digits entered (or perhaps until the accumulated value overflows).

Input, Working Right-to left

Or we can read all the digits first into a global conversion buffer, count the number of digits, and multiply each digit by the appropriate power of ten as we go. And that also requires multiplication.

Multiplication, either way.

Efficiency

Again, efficiency appears to be more on the side of the left-to-right approach. But, again, we my find it easier (more efficient use of programmers' time) to declare the buffer, copy input into the buffer until we get a non-numeric input, and parse/convert the number in the buffer instead of directly from the keyboard.

But thinking about efficiency too early in the planning stages is a mistake, unless you are actually not thinking about efficiency so much as trying to understand the problem.

Approaching Implementation

I don't know about you, but I find multiplication to be easier than division.

Why?

Memorizing the multiplication tables is fairly easy, and once we have the table memorized, you can look at each pair of digits from the multiplier and multiplicand and directly produce a digit in the product, with possible carry.

It's a straight-forward input-driven process.

Trying to memorize division tables means memorizing lots of possible products and the factors used to produce them, and there are so many possibilities we don't usually get motivated to do that. (There are certain patterns that we can memorize that help, though.) And then we use what we remember to look at the quotient and guess which product of which pair of factors is applicable.

Even when we do that for each digit in the dividend, we often have to guess and then we have to check our guesses and, if the our guess is correct, only then can we reduce the dividend, and count and record each digit of the quotient.

Essentially, we look at the divisor and the dividend and go searching for the quotient.

Aaaaaaannnddd ---

Checking whether we have found the correct digit of the quotient at each step requires multiplication.

Erk. Does it feel like we're being corralled into understanding multiplication?

Let's look at multiplication.

(Title Page/Index)

Monday, January 6, 2025

ALPP 03-XX (17) -- Demonstrating Left Shift -- 6800

I'm putting this here for reference. Eventually, I plan to do a chapter on shifts, and most of this will be demonstrated there. I've only tested part of the code.

Demonstrating Left Shift --
6800

(Title Page/Index)

I've shown you some theoretical background on bit shifting left and multiplying by powers of 2, and I want to move ahead because we can't print out the results in decimal yet.

But I talked it over with God. Oh, some people will understand, some people won't. I could call it a hunch -- a strong hunch, strong enough to keep me from proceeding -- if you prefer.

And the result? There's a lot of code in these. Read through the 6809 and 68000 versions, scan the other two, and test one or more if you are inclined. Come back for reference if things get murky when we talk about synthesizing the multiplication and division routines.

I didn't want to consume four posts for this, to show the rigging and the test code for each processor, but it's going to be four posts even without the rigging framework.

But you've already seen everything in the rigging framework anyway, in the single character input chapters. I'm going to let you move the new stuff into the rigging framework yourself this time.

Starting with the 6800 code for character input, open the file rt_rig03_6800.asm up in a text editor and save it as rt_rig04_6800.asm. Keep it open and open inkey_6800.asm (or whatever you saved it as) up and save it as shftst_6800.asm (or whatever). Change the inclusion (EXP) line to include rt_rig04_6800.asm instead of rt_rig03_6800.asm.

Now cut INCHAR and INCHNE out of shftst_6800.asm , from the comments on AECHO to the hook to INCHV, and move them into someplace appropriate in rt_rig04_6800.asm. You might want to move them in two or three separate pieces, or you might want to move those lines altogether at once to the same place. Your choice.

Also grab PDUP for the 6800 and 6801.

Save the two files and make sure they still assemble and run as in chapter 3-10.

Now, if you want to do this part yourself, cut the test code out of multest_6800.asm and replace it with appropriate 6800 code from the last chapter. Yes, that does mean you'll need to convert the 6809 code to 6800 code. It's not hard, just tedious, and instructive.

If you don't want to do the conversion yourself, or if you want to see how I'd do it, the following is some demonstration code I produced. A little less than two thirds of the way down, I realized I was heading the wrong direction and quit testing it. Some of the remaining code is known not to do what I intended, and the rest of the remaining code is not tested, but I'm leaving it here for reference.

For the 6800, assuming PSP in X and the bytes to test on the parameter stack:


    * test shifts and multiplies for 6800 (EXORsim)
* using parameter stack,
* with test frame
* Copyright Joel Matthew Rees, December 2024
*
	EXP	rt_rig04_6800.asm
****************
* Program code:
*
*
INX8	INX
INX7	INX
INX6	INX
	INX
INX4	INX
	INX
	INX
	INX
	RTS
*
DEX8	DEX
	DEX
DEX6	DEX
	DEX
DEX4	DEX
	DEX
	DEX
	DEX
	RTS
*
* Unrolled 64-bit integer shift left 1 bit:
LSL64	LSL	7,X	; least significant byte (byte 7)
	ROL	6,X	; next less significant byte (byte 6)
	ROL	5,X	; next more significant byte (byte 5)
	ROL	4,X	; next more significant byte (byte 4)
	ROL	3,X	; next more significant byte (byte 3)
	ROL	2,X	; next more significant byte (byte 2)
	ROL	1,X	; next more significant byte (byte 1)
	ROL	0,X	; most significant byte on (byte 0)
	RTS
*
* 64-bit integer shift left 1 bit in a loop:
LSL64LP	LDX	PSP	; but do not update!
	JSR	INX7	; point to last byte
	LDAA	#7	; bytes to ROL
	LSL	0,X	; least significant byte starts with LSL
SHL64L	DEX		; carry not affected
	ROL	0,X	; next more significant byte
	DECA		; count down, carry not affected
	BNE	SHL64L	; do next
	RTS
* Ends with X pointing to most significant byte
*
PGSTRT	LDAB	#$5A	; a common bit pattern to watch move
	ROLB		; $B4	9-bit rotate left
	ROLB		; $68
	ROLB		; $D1
	ROLB		; $A2
	ROLB		; $45
	ROLB		; $8B
	ROLB		; $16
	ROLB		; $2D
	ROLB		; $5A
	NOP		; Pause for a look.
*
	ROLB
	ADCB	#0	; $B4	8-bit rotate left
	ROLB
	ADCB	#0	; $69
	ROLB
	ADCB	#0	; $D2
	ROLB
	ADCB	#0	; $A5
	TBA		; is another common bit pattern to watch move
	ROLB
	ADCB	#0	; $4B
	ROLB
	ADCB	#0	; $96
	ROLB
	ADCB	#0	; $2D
	ROLB
	ADCB	#0	; $5A
	NOP
*
	LSLB		; 1st: $A55A	16-bit shift left
	ROLA		; $4AB4
	LSLB		; 2nd
	ROLA		; $9568
	LSLB		; 3rd 
	ROLA		; $2AD0
	LSLB		; 4th 
	ROLA		; $55A0
	NOP
	LSLB		; 5th 
	ROLA		; $AB40
	LSLB		; 6th 
	ROLA		; $5680
	LSLB		; 7th 
	ROLA		; $AD00
	LSLB		; 8th 
	ROLA		; $5A00
	NOP
*
	LDX	PSP	; 32-bit shift left mixed stack/register
	DEX		; allocate two bytes
	DEX
	STX	PSP
	LDAA	#$87
	LDAB	#$65
	STAB		1,X
	STAA		0,X	; $8765 on stack
	LDAA	#$43
	LDAB	#$21	; $4321 in D
	LSLB		; least significant byte
	ROLA		; next less significant byte
	ROL	1,X	; next more significant byte on stack
	ROL	0,X	; most significant byte on stack
	LSLB		; 2nd time
	ROLA
	ROL	1,X
	ROL	0,X
	LSLB		; 3rd time
	ROLA
	ROL	1,X
	ROL	0,X
	LSLB		; 4th time
	ROLA
	ROL	1,X
	ROL	0,X	; result -- 7654:3210
	NOP
*
	LDAB	#$10	; set up test data
	STAB	1,X	; X still has PSP
	ADDB	#$22
	LDAA	#7
T64LP	STX	PSP	; allocate before store
	STAB	0,X
	ADDB	#$22
	DEX
	DECA
	BNE	T64LP
	NOP
* Check the contents of the parameter stack when done.
	LDX	PSP	; sync X and PSP
	JSR	LSL64	; unrolled loop
	JSR	LSL64LP	; loop
	JSR	LSL64
	JSR	LSL64LP
	NOP
* Should be shifted left one hexadecimal digit.
	JSR	LSL64
	JSR	LSL64LP
	JSR	LSL64
	JSR	LSL64LP
	NOP
* Should be shifted left another hexadecimal digit,
* which is a full byte!
* But that's a hard way to shift left 8 bits.
* Let's try an easier, quicker way:
	LDAA	#7
LS8BITL	LDAB	1,X
	STAB	0,X
	INX
	DECA
	BNE	LS8BITL
* Wasn't that fast?
	NOP
	LDX	PSP
	JSR	INX8	; drop them all
	STX	PSP	
	NOP
* Multiply 8 bits by 4
	LDAB	#65	; $41
	LSLB	; multiply by 2
	LSLB	; ignore carry and multiply by 2
	NOP
* Multiply 16 bits by 4
	LDAB	#65
	CLRA
	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again
	ROLA	; catch the carry again
	NOP
* Multiply 16 bits by 16
	LDAB	#65
	CLRA
	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again
	ROLA	; catch the carry again
	LSLB	; and again
	ROLA	; catch the carry again
	LSLB	; and again
	ROLA	; catch the carry again
	NOP
* Multiply 16 bits by 16, using loop
	LDAB	#65
	CLRA
	JSR	PPSHD	; PSP in X on return
	LDAB	#4
	JSR	PPSHD
	LDAA	2,X	; operand on parameter stack
	LDAB	3,X
MUL16WL	LSLB	; multiply by 2
	ROLA	; catch the carry
	DEC	1,X
	BNE	MUL16WL
	NOP
	INX		; drop count
	INX
	STX	PSP
	STAA	0,X	; save result
	STAB	1,X
	NOP		; re-use 2 bytes of allocation
* Other powers of 2: 2^7 == 128
	LDAB	#83	; $53 X 128
	CLRA		; for the high bits
	LSLB		; 1st
	ROLA
	LSLB		; 2nd
	ROLA
	LSLB		; 3rd
	ROLA
	LSLB		; 4th
	ROLA
	LSLB		; 5th
	ROLA
	LSLB		; 6th
	ROLA
	LSLB		; 7th
	ROLA
	STAA	0,X	; $2980 == 10624
	STAB	1,X
	NOP
* compare going the other direction,
* ends with high byte in B, low byte in A:
	LDAB	#83	; $53 X 128
	CLRA	; for result
	LSRB	; bit 0 to carry, B becomes high byte
	RORA	; bit 0 of B now in bit 7 of A
	NOP
* Saturation math:
	LDAB	#83	; $53 X 128
	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	ANDB	#$80	; chop off the lost, double-shifted high bits
	NOP
* Extend the saturation math with recovery (de-optimization):
	LDAB	#83	; $53 X 128 (9 bits => 7 to left is 2 to right)
	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	TBA
	ROLA		; bring the high bits back into position
	ANDB	#$80	; chop off the high bits
	ANDA	#$7F	; chop off the low bits
	NOP
* and again, more efficiently, but not most efficiently
	LDAB	#83	; $53 X 128 (8 bits => 7 to left is 1 to right)
	TBA		; make two halves
	LSRA		; bit 0 to carry, high bits
	RORB		; now to bit 7 (bit 0 to carry)
	ANDB	#$80	; chop off the high bits
	NOP
* 2^6 == 64
	LDAB	#83	; $53 X 64
	CLRA		; for the high bits
	LSLB		; 1st
	ROLA
	LSLB		; 2nd
	ROLA
	LSLB		; 3rd
	ROLA
	LSLB		; 4th
	ROLA
	LSLB		; 5th
	ROLA
	LSLB		; 6th
	ROLA
	STAA	0,X
	STAB	1,X
	NOP
* compare going the other direction,
* ends with high byte in B, low byte in A:
	LDAB	#83	; $53 X 64
	CLRA	; for result
	LSRB	; bit 0 to carry
	RORA	; old bit 0 of B now in bit 7 of A
	LSRB	; old bit 1 of B to carry
	RORA	; old bit 1,0 of B now in bit 7,6 of A
	NOP
* Saturation math:
	LDAB	#83	; $53 X 64
    	LSRB		; bit 0 to carry
    	RORB		; now to bit 7, old bit 1 to carry
    	RORB		; now to bit 7,6 in order
    	ANDB	#$C0	; chop off the remainder
	NOP
* Extend the saturation math with recovery (de-optimization):
	LDAB	#83	; $53 X 64 (9 bits => 6 to left is 3 to right)
    	LSRB		; bit 0 to carry
    	RORB		; now to bit 7, old bit 1 to carry
    	RORB		; now to bit 7,6 in order
	TBA
	ROLA		; recover high bits including last carry
    	ANDB	#$C0	; chop off the high bits
	ANDA	#$3F	; chop off the low bits
	NOP
* and again, copying first
	LDAB	#83	; $53 X 64 (8 bits => 6 to left is 2 to right)
	TBA		; make two halves
	LSRA		; bit 0 to carry, high bits
	RORB		; now to bit 7 (bit 0 to carry)
	LSRA		; bring the high bits into place (bit 1 to C)
    	RORB		; now to bit 7,6 in order
	ANDB	#$C0	; chop off the high bits
	NOP
* 2^5 == 32
	LDAB	#83	; $53 X 32
	CLRA		; for the high bits
	LSLB		; 1st
	ROLA
	LSLB		; 2nd
	ROLA
	LSLB		; 3rd
	ROLA
	LSLB		; 4th
	ROLA
	LSLB		; 5th
	ROLA
	STAA	0,X
	STAB	1,X
	NOP
* compare going the other direction,
* ends with high byte in B, low byte in A:
	LDAB	#83	; $53 X 32
	CLRA		; for the high bits
	LSRB	; bit 0 to carry
	RORA	; old bit 0 of B now in bit 7 of A
	LSRB	; old bit 1 of B to carry
	RORA	; old bit 1,0 of B now in bit 7,6 of A
	LSRB	; old bit 2 of B to carry
	RORA	; old bit 2,1,0 of B now in bit 7,6,5 of A
	NOP
* Saturation math:
	LDAB	#83	; $53 X 32
    	LSRB		; bit 0 to carry
    	RORB		; now to bit 7, old bit 1 to carry
    	RORB		; now to bit 7,6 in order, old bit 2 to carry
    	RORB		; now to bit 7,6,5 in order
    	ANDB	#$E0	; chop off the high bits
	NOP
* Extend the saturation math with recovery:
	LDAB	#83	; $53 X 32 (9 bits => 5 to left is 4 to right)
    	LSRB		; bit 0 to carry
    	RORB		; now to bit 7, old bit 1 to carry
    	RORB		; now to bit 7,6 in order, old bit 2 to carry
    	RORB		; now to bit 7,6,5 in order
	TBA
	ROLA		; recover last carry into high bits
    	ANDB	#$E0	; chop off the remainder
	ANDA	#$1F	; chop off the low bits
	NOP
* and again, copying first
	LDAB	#83	; $53 X 32 (8 bits => 5 to left is 3 to right)
	TBA		; make two halves
	LSRA		; bit 0 to carry, high bits
	RORB		; now to bit 7 (bit 0 to carry)
	LSRA		; bring the high bits into place (bit 1 to C)
    	RORB		; now to bit 7,6 in order
	LSRA		; once more (old bit 2 to C)
    	RORB		; now to bit 7,6,5 in order
    	ANDB	#$E0	; chop off the high bits
	NOP
* balance the stack
	INX
	INX
	STX	PSP	; clear stack
	NOP
*
* From here down either hasn't been tested
* or doesn't function as I intended.
*
* shift left by 13 by right rotation => multiply by 8192, lose high bits
	LDAA	#$41	; $4153 == 16723
	LDAB	#$53
    	LSRB		; bit 0 to carry
    	RORA		; now to bit 15, bit 8 to carry
    	RORB		; bit 8 to bit 7, old bit 1 to carry
    	RORA		; old bit 1 to bit 15 in order
    	RORB		; old bit 9 to bit 7, old bit 2 to carry
    	RORA		; old bit 2 to bit 15 in order
    	RORB		; old bit 10 to bit 7 (old bit 3 to carry)
    	ANDA	#$E0	; chop off the top bytes, ignore carry
	CLRB
	NOP
* shift left by 13 => multiply by 8192, capture all bits:
	LDAA	#$41	; $4153 == 16723
	LDAB	#$53
	JSR	DEX4	; pre-allocate
	STX	PSP
    	LSRB		; bit 0 to carry
    	RORA		; now to bit 15, bit 8 to carry
    	RORB		; bit 8 to bit 7, old bit 1 to carry
    	RORA		; old bit 1 to bit 15 in order
    	RORB		; old bit 9 to bit 7, old bit 2 to carry
    	RORA		; old bit 2 to bit 15 in order
    	RORB		; old bit 10 to bit 7 (old bit 3 to carry)
	CLR	3,X	; least significant, ignore carry
	CLR	2,X	; save a place for lower middle byte
	STAB	1,X	; save next more significant byte
	TAB		; copy to split out high and low
    	ANDB	#$E0	; chop off the high bits
	STAB	2,X	; save the lower middle byte
	ANDA	#$1F	; chop off low bits
	STAA	0,X	; save high bits
	NOP
* Check the results before continuing.
	NOP
* shift left directly by 13 => multiply by 8192, capture all bits:
	DEX		; 2 placeholders, for middle upper byte	
	DEX		; and high byte
	STX	PSP
	LDAA	#$41	; $4153 == 16723
	LDAB	#$53
	LSLB
	ROLA
	ROL	1,X	; catch 1
	LSLB
	ROLA
	ROL	1,X	; catch 2
	LSLB
	ROLA
	ROL	1,X	; catch 3
	LSLB
	ROLA
	ROL	1,X	; catch 4
	LSLB
	ROLA
	ROL	1,X	; catch 5
	LSLB
	ROLA
	ROL	1,X	; catch 6
	LSLB
	ROLA
	ROL	1,X	; catch 7
	LSLB
	ROLA
	ROL	1,X	; catch 8
	DEX		; for high byte final resting place
	CLR	0,X	; not completely filled
	LSLB
	ROLA
	ROL	1,X	; catch 9
	ROL	0,X	; catch 1
	LSLB
	ROLA
	ROL	1,X	; catch 10
	ROL	0,X	; catch 2
	LSLB
	ROLA
	ROL	1,X	; catch 11
	ROL	0,X	; catch 3
	LSLB
	ROLA
	ROL	1,X	; catch 12
	ROL	0,X	; catch 4
	LSLB
	ROLA
	ROL	1,X	; catch 13
	ROL	0,X	; catch 5
	NOP
* stop to check
	NOP
	JSR	INX6	; drop all the above
	STX	PSP
	NOP
* 8-bit-wide rotation
* accumulator-wide ROL by 3 / ROR by 5 using the stack:
	LDX	PSP	; just to by sure, and remind ourselves
	DEX		; temp
	STX	PSP
	LDAA	#83	; $53
	STAA	0,X	; copy to stack
	LSL	0,X	; shift left by 3
	LSL	0,X
	LSL	0,X
	LSRA		; shift right by 5
	LSRA		; (more shifts, use the faster shift accumulator)
	LSRA
	LSRA
	LSRA
	ORAA	0,X	; put results together	
	NOP
* accumulator-wide ROL by 3 / ROR by 5 using ABA:
	LDAB	#83	; $53
	TBA	; copy
	LSLA	; shift left by 3
	LSLA
	LSLA
	LSRB	; shift right by 5
	LSRB
	LSRB
	LSRB
	LSRB
	ABA	; put the results together
	NOP
* accumulator-wide ROL by 3 / ROR by 5 using ADC #0 trick:
	LDAB	#83	; $53
	LSLB
	ADCB	#0
	LSLB
	ADCB	#0
	LSLB
	ADCB	#0
	NOP
* ugly accumulator-wide ROR by 5 / ROL by 3 using branch and set:
	LDAB	#83	; $53
	LSRB		; clears bit 7
	BCC	RR8BN1
	ORAB	#$80	; set it for the carry
RR8BN1	LSRB
	BCC	RR8BN2
	ORAB	#$80
RR8BN2	LSRB
	BCC	RR8BN3
	ORAB	#$80
RR8BN3	LSRB
	BCC	RR8BN4
	ORAB	#$80
RR8BN4	LSRB
	BCC	RR8BN5
	ORAB	#$80
RR8BN5	NOP		; next instruction
* Not as ugly accumulator-wide ROR by 5 / ROL by 3,
* but uses both accumulators to avoid branches:
	LDAB	#83	; $53
	TBA
	LSRA	; get lowest bit in carry first
	RORB
	LSRA	; get 2nd bit in carry first
	RORB
	LSRA	; get 3rd bit in carry first
	RORB
	LSRA	; get 4th bit in carry first
	RORB
	LSRA	; get 5th bit in carry first
	RORB
	NOP
* Compare result before dropping
	INX		; drop temp
	STX	PSP
	NOP
* 16-bit integer rotate left 3 / right 13  on 6800:
	LDAA	#$41	; $4153 == 16723
	LDAB	#$53
	LSLB		; clear bottom bit on shifting left
	ROLA
	ADCB	#0	; push the top carry in (16-bit rotation)	
	LSLB
	ROLA
	ADCB	#0	; push the top carry in (16-bit rotation)	
	LSLB
	ROLA
	ADCB	#0	; push the top carry in (16-bit rotation)
	NOP
* 16-bit integer rotate right 3 / left 13  on 6800:
	DEX		; temp to grab bit with
	STX	PSP
	STAB	0,X	; copy
	LSR	0,X	; get bottom bit
	RORA		; rotate it into top byte
	RORB		; 1 bit complete
	LSR	0,X	; get next bottom bit
	RORA
	RORB		; 2nd bit complete
	LSR	0,X	; get next bottom bit
	RORA
	RORB		; 3rd bit complete
	NOP		; Should be back to $4153
	INX
	STX	PSP
*
	RTS
*
	END	ENTRY

Now let's look at multiplying by some small constants that aren't powers of two.

(Title Page/Index)

Saturday, January 4, 2025

ALPP 03-XX (18) -- Multiplying by Small Constants (Shift Left and Add)

I'm putting this chapter here for reference. I discovered I was heading too deep, too early, but I want to keep this chapter handy. I haven't checked the code or the explanations carefully, be careful if you read or try to use what I have here.

Multiplying by Small Constants
(Shift Left and Add)

(Title Page/Index)

We've worked out multiplying by some constant powers of 2, hopefully enough to have some confidence in using bit shifts for multiplying.

(Note that we are not using Trachtenberg methods.

So, I said that multiplying by small constants that are not powers of ten would still be easy. Considering I said that multiplying by powers of two would be easy, you may be doubting me.

Fair enough.

But let's look at multiplying by constant 3.

We can look at it several different ways:

3X == X + X + X

Multiplying something by three is adding it to itself three times. That's what the algebra says, and when I say it in somewhat ordinary English, it makes sense. Sort of. But if we are very careful about how we say this, we have to say adding the number to zero three times.

Adding it to itself three times could actually mean something else, something we just talked about in the last chapter, and are still talking about here.

But multiplying something by two -- doubling -- is adding it to zero twice, or adding it to itself (once). Anyway,

3X == 2X + X

Multiplying something by three is multiplying by two and then adding it again -- or left-shifting in binary once and then adding it to the product again.

For a byte on the 6809, catching carries:


    	CLRA	; for carries
	LDB	,U	; get the byte
	LSLB		; 2X
	ROLA		; grab any high bit
	ADDB	,U	; 2X+X==3X
	ADCA	#0	; get any carry

Let's do that to a 16-bit integer, as a subroutine. And capture the carries.


    MUL3	LDD	,U
	CLR	,-U	; for carries
	LSLB		; 2X
	ROLA
	ROL	,U	; get carries
	ADDD	1,U	; + X makes 3X
	BCC	MUL3NC
	INC	,U	; get carries
MUL3NC	STD	1,U
	RTS

If we weren't capturing carries and setting it up as a subroutine, it would be really short.

Speaking of capturing carries, you will note that this subroutine accepts a 16-bit integer as input, but returns a 24-bit integer with the carries in the added most significant byte.

Let's not tell anyone, but , rather than using shifts, we might have simply loaded and then added twice, also:


    MUL3ADD	LDD	,U
	CLR	,U	; for carries
	LDD	1,U		; 1X
	ADDD	1,U	; 2X
	ROL	,U	; get first carry
	ADDD	1,U	; 3X
	BCC	MUL3NC
	INC	,U	; get 2nd carry
MUL3ANC	STD	1,U
	RTS

The first load is the same as adding it to zero, so that's really just multiplying by 3 the hard way, adding it up three times.

Adding a number to itself is another way to shift it left by one bit, by the way.

How about another way that works on the 6809 and 6801? Both have the MUL A × B instruction that leaves the product in D:


    MUL3MUL	LDD	,U
	PSHU	A	; save high byte
	LDA	#3
	MUL		; multiply low byte in B
	STD	1,U	; save 16-bit result
	LDB	,U	; get high byte back
	CLR	,U	; for result addition
	LDA	#3
	MUL		; multiply high byte in B
	ADD	,U
	STD	,U
	RTS

MUL takes 10 cycles on the 6801 and 11 on the 6809, so it would take a little longer than the shift method, but not much, and with a similar byte count.

But this routine using MUL could be generalized into an 8-bit by 16-bit routine that could be fairly quick for both small variables and small constants:


    * 16 by 8 to 24-bit multiply
* 16-bit multiplicand in 2nd and 3rd bytes,
* 8-bit multiplier in 8 bits on top:
MUL16X8	LDB	2,U	; low byte
	LDA	,U	; multiplier
	MUL
	STB	2,U	; store result low byte
	PSHU	A	; save result middle byte temp
	LDD	1,U	; multiplier in A, high byte in B
	MUL
	ADDB	,U+	; add result middle byte, pop it
	ADCA	#0	; add the carry into the high byte
	STD	,U	; save middle and high bytes of result
	RTS
* 
MUL3CAL	LDB	#3
MUL3ROB	PSHU	B
	BRA	MUL16X8	; rob the RTS
MUL5CAL	LDB	#5
	BRA	MUL3ROB	; rob the PSHU
MUL10CAL	LDB	#10
	BRA	MUL3ROB	; rob the PSHU

Cool enough?

How about multiplying by 5 via shift-and-add, just to compare? Five is four plus one:


    MUL5	LDD	,U
	CLR	,-U	; for carries
	LSLB		; 2X
	ROLA
	ROL	,U	; carries
	LSLB		; 2X again makes 4X
	ROLA
	ROL	,U	; carries
	ADDD	,U	; + X makes 5X
	BCC	MUL5NC
	INC	,U	; get carries
MUL5NC	STD	1,U
	RTS

Similar length to the MUL method, but just a little faster.

How about 10?

Ten is five times 2.

Shifting left and then calling MUL5 would lose us a carry. But we can call MUL5 and then shift once more left. Or use the shift-and-add MUL5 code and hang another shift on the end:


    * Multiply a 16-bit integer by 10,
* return as a 24 bit integer, to keep the carries.
* For the 6809
MUL10	LDD	,U
	CLR	,-U	; push a zero for overflow
	LSLB		; 2X
	ROLA
	ROL	,U	; catch the overflow
	LSLB		; 2X again makes 4X
	ROLA
	ROL	,U	; catch the overflow
	ADDD	1,U	; + X makes 5X
	BCC	MUL10NC
	INC	,U	; carry in the addition
MUL10NC		LSLB		; 2X again makes 10X
	ROLA
	ROL	,U	; catch the overflow
	STD	1,U
	RTS

But if we have the MUL16X8 routine, it's probably going to be about as fast to call that.

That's the 6809. How about actual code on the other processors?

The conversion is straightforward. I recommend trying it.

For the 6801, you will be loading X from PSP and updating PSP appropriately. Also, you will be able to replace LSLB ; ROLA pairs with LSLD.

For the 6800, in addition to loading and updating PSP, you will need to split the double accumulator operators into individually working on A and B.

As a rough guess, given how fast the MUL is on 6801 and 6809, it would be reasonable to just use MUL and ADD on both. Shift-and-add will be useful on the 6800, however.

For the 68000, don't try to be too literal. Give 32-bit results instead of 24.

Oh, okay, let's look at code for the 68000:


    MUL3_16:	; guessing it'll be about 48 processor cycles
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	LSL.L	#1,D7
	ADD.L	(A6)
	MOVE.L	D7,(A6)	; return in 32 bits
	RTS

MUL3_16_ADD:
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	ADD.L	(A6)		; Or maybe do it in-register?
	ADD.L	(A6)
	MOVE.L	D7,(A6)
	RTS

* No 8-bit MUL in 68000
* MULU takes 38 + 2 per bit processor clocks.
MUL3_16_MUL:	; MULU by #3 takes about 46 processor cycles, about 12 memory clocks.
	MOVE.W	(A6)+,D7	; pop it
	MULU.W	#3,D7	; ignore source high, 32-bit result in D7
	MOVE.L	D7,-(A6)
	RTS

* 16 by 16 to 32 bit multiply
* 16-bit multiplicand in 3rd and 4th bytes,
* 16-bit multiplier in 16 bits on top:
MUL16X16:
	MOVE.W	2(A6),D7
	MULU.W	(A6),D7	; ignore source high, 32-bit result in D7
	MOVE.L	D7,(A6)
	RTS
* 
MUL3CAL_16:
	MOVE.W	#3,-(A6)
	BRA.S	MUL16X16	; rob the code
MUL5CAL_16:
	MOVE.W	#5,-(A6)
	BRA.S	MUL16X16	; rob the code
MUL10CAL_16:
	MOVE.W	#10,-(A6)
	BRA.S	MUL16X16	; rob the code

MUL5_16:
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	LSL.L	#2,D7	; 4X
	ADD.L	(A6)	; 4X+X
	MOVE.L	D7,(A6)	; return in 32 bits
	RTS

MUL10_16:
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	LSL.L	#2,D7	; 4X
	ADD.L	(A6)	; 4X+X
	LSL.L	#1,D7	; 2(4X+X)
	MOVE.L	D7,(A6)	; return in 32 bits
	RTS

And 16 bit integers ought to be big enough for anyone, right? I mean, who would ever do anything with an integer bigger than 30,000?

(Which is part of where the old "Who needs more than 64K of memory?" excuse for corner-cutting in the early-to-mid 1980s came from.)

However, we do often want to do this with 32-bit operands, and may want to catch the carries, from that, even. So let's look at giving 32-bit results with carries in 40, I mean, 48 bits. (We shouldn't want to do math in odd numbers of bytes on the 68000.):

Draw your own conclusions about optimization.

Recap

And now we can do part of what is necessary for decimal input and output. That was quick.

Dividing from the top.

multiply byte by ten, remainder from divide by 100?

***********

(Title Page/Index)

Tuesday, December 24, 2024

ALPP 03-XX (16) -- Multiplying by Powers of Two (Shift Left)

I'm leaving this here for reference for a little while. Eventually, I plan to do a chapter on shifts, and most of this will be taken up there.

Multiplying by Powers of Two
(Shift Left)

(Title Page/Index)

Now that we've seen a little motivation and covered a little theory about multiplying by constants, it's time to look at multiplying by powers of two.

(But this chapter will also lean a bit more to theory than practice, even though there is practice. It also gets a little long. Please bear with me.)

Here's how to multiply a byte in accumulator B by 2 on the 6800, 6801, or 6809:


    	LSLB

And, of course, there's LSLA. And LSL n,X (indexed mode) or LSL <address> (extended mode) allows you to quickly multiply any byte in memory by two. It's faster in an accumulator, but direct shifts on memory avoid saving and storing whatever is in the accumulators.

Unfortunately, we can't use the abbreviated direct page addressing on the 6800/6801 to save a byte and a fetch cycle, but since the direct page is the lowest 256 bytes of memory, we can still shift operands there in extended mode.

(Incidentally, the DP mode on the 6809 actually uses more cycles than extended mode, darn it. Saves a byte and is as fast as indexed mode, anyway.)

(As another aside, in the 6805 microcontroller, which is sort of a half a 6800, direct page mode is provided for the read-modify-write instructions -- including shifts -- instead of extended mode. And you get the speedup. This was definitely a good trade-off for the 6805.)

Why logical shift left instead of arithmetic shift left? Actually, on Motorola's 68XX and 680XX series microprocessors, ASL is a synonym of LSL We didn't see a good reason to make a distinction, and Motorola and other companies didn't. We'll talk about that more when we get to division by constants.

On the 68000, we can multiply a byte in D0 by two with


    	LSL.B	#1,D0

I'll save the full addressing modes discussion for later, but the 68000 allows logical shifts of any width on all data registers D0-D7.

However, byte-width shifts on operands in memory are not provided.

So how about a 16-bit integer?

On the 6800 and 6809, for two bytes together in the accumulators, with the high byte in A, it's


    	LSLB	; less significant byte in B
	ROLA	; more significant byte in A

This works on the 6801, too, of course. But on the 6801, we also have


    	LSLD	; high in A, low in B

There is no LSLD on the 6809.

On the 6800, it's only a matter of convention to put the more significant byte in A, and often the convention has been reversed in existing code. On the 6801 and 6809, you can still change or reverse the convention, but the ability to treat the accumulator pair, A:B, as the single double accumulator D means you usually want to follow the double-accumulator convention.

Now you can combine those with bytes in memory if you need to for some reason. For instance, if you need to keep a counter in A, you can keep the more significant byte on the stack and reference it by indexed mode, as below, assuming the parameter stack pointer PSP is in X:


    	LSLB	; less significant byte
	ROL	0,X	; more significant byte on stack (X has PSP)

On the 68000, we can multiply sixteen bits in D0 by two with


    	LSL.W	#1,D0

But I already said that, didn't I?

However (Surprise?), the 68000 does provide single-bit shifts on 16-bit wide word only operands in memory, accessed via (most) normal indexing or absolute modes.


    	LSL.W	(A6)	; shift top 16-bit word of parameter stack 1 bit left.

This was perhaps because the 68000 instruction set was originally designed for 16-bit wide memory designs. (I think they thought it was an optimization, and it probably was at the time.) Again, we'll look at this more later.

Four-byte wide shifts? Just in case it's not clear, I'll show you one way you might want to do it on the 6800, again assuming PSP in X and the more significant bytes on top of stack:


    	LSLB	; least significant byte
	ROLA	; next less significant byte
	ROL	1,X	; next more significant byte on stack
	ROL	0,X	; most significant byte on stack

On the 6801, when we use the accumulators for 32-bit shifts, we probably want to use the double shift where we can, on the first shift:


    	LSLD	; least significant 16 bits
	ROL	1,X	; next more significant byte on stack
	ROL	0,X	; most significant byte on stack

This allows us to grab the carry from the double shift and pass it into the following rotate instructions, which only exist in byte form on the 6801.

On the 6809, assuming we are using U for the parameter stack, and that the more significant bytes are the topmost on the stack, it would be


    	LSLB	; least significant byte
	ROLA	; next less significant byte
	ROL	1,U	; next more significant byte on stack
	ROL	,U	; most significant byte on stack

And on the 68000, as I have already said, we can multiply 32 bits in D0 by two with


    	LSL.L	#1,D0

32-bit CPUs are nice, aren't they?

What if you need to do shift a 32-bit wide integer on top of stack without bringing it into a register for some reason? Since there is no 32-bit version of LSL for operands in memory, you'll need to do this:


    	LSL.W	2(A6)	; shift less-significant 16-bit word 1 bit left, carry to C and X.
	ROXL.W	(A6)	; rotate X carry into more-significant 16-bit word, with 1-bit shift.

For the times you need it, it can be done.

And, incidentally, we see that the 68000 has something called an eXtended carry for extending 1-bit carries. The C carry is for branching, and the X carry is for extending. It's a bit weird, but useful.

What if you need to shift 64 bits left?

On the 8-bit CPUs, you'll start with LSL on the least significant byte, then perform 7 more ROLs preceding in order from less to most significant. Because at least 6 of the shifts will be in memory, it'll take at least 12 bytes of ROL instructions plus the one (6801) or two (6800, 6809) for the accumulator bytes.

I think I'd better show this doing it all in memory, and you can consider whether it would be worth bringing the least significant bytes into the accumulators:


    	LSL	7,X	; least significant byte (byte 7)
	ROL	6,X	; next less significant byte (byte 6)
	ROL	5,X	; next more significant byte (byte 5)
	ROL	4,X	; next more significant byte (byte 4)
	ROL	3,X	; next more significant byte (byte 3)
	ROL	2,X	; next more significant byte (byte 2)
	ROL	1,X	; next more significant byte (byte 1)
	ROL	0,X	; most significant byte on (byte 0)

On the 6801, if you already have the least significant two bytes in D, you could start it with a LSLD and save maybe three bytes. Or you could do LDD 6,X; LSLD and save one byte, maybe.

But if you do it without moving the least significant bytes into the accumulators, you can do the whole 8 bytes without touching A or B.

Let's look at trying to save bytes by making a loop. We'll assume X is already pointing to the least significant byte.

But if this isn't the case, we'll have to adjust X, and that will likely cost at least 8 bytes of code on a 6800. (A series of 8 INX instructions is less code than saving X in a DP variable, adding 8 to it using one or both accumulators, and then loading it back to X.) Or it will take at least 3 bytes of code on a 6801 (where it can be done with a LDAB #8 and an ABX.)

Here's the loop, for either 6800 or 6801:


    * Assume X pointing to least significant byte (not likely)
	LDAA	#7	; bytes to ROL
	LSL	0,X	; least significant byte starts with LSL
SHL64L	DEX		; carry not affected
	ROL	0,X	; next more significant byte
	DECA		; count down, carry not affected
	BNE	SHL64L	; do next
* Ends with X pointing to most significant byte

Is it worth saving two bytes, max, to use the loop? Or costing one to seven extra bytes for the loop in code to pre-adjust X?

(Well, on the 6801, you may actually be able to shave a couple of INX instructions by shifting the least significant bytes in D. I'll let you calculate that out.)

The unrolled version is most likely better, although, if you need to do this kind of shift on a variable number of bytes, you have now seen a bit of code to start from.

The loop form can improve on the 6809, because of the index math the 6809 provides. But there are other approaches, and you may need to choose between code size and speed for things like this on the 6809.

Hmm. I think I'd better show the loop and index adjustment on the 6809.

[JMR202412311146 addenda part 1:]

But first I want to show you another way, that uses accumulator offset:

[JMR202412311146 addenda part 1 end.]


    * The 64 bit number to shift is on the parameter stack (U)
	LDA	#6	; bytes to ROL minus 1
	LSL	7,U	; least significant byte
SHL64L	ROL	A,U	; next more significant byte
	DECA		; count down, carry not affected
	BPL	SHL64L	; do next (want to do 0, too)

Nine bytes for the loop, including the pointer math. There is another approach that uses auto-decrement and LEA, that should be similar in cycle and byte count, but this might be worth the three bytes saved in a really tight ROM or something.

[JMR202412311146 addenda part 2:]

This is one way the auto-decrement method could be mapped into the 6809:


    * The 64 bit number to shift is on the parameter stack (U)
	LDAA	#7	; bytes to ROL
	LEAX	A,U
	LSL	,X	; least significant byte starts with LSL
SHL64L	ROL	,-X	; next more significant byte
	DECA		; count down, carry not affected
	BNE	SHL64L	; do next
* Ends with X pointing to most significant byte

[JMR202412311146 addenda part 2 end.]

For shifting 64 bits left by 1 on the 68000, it's nicely straightforward:


    	LSL.L	#1,D1	; less significant long word, carry to C and X
	ROXL.L	#1,D0	; rotate X carry into more-significant long word

I mentioned the eXtend carry above, and we can use it here, too. Doing it in memory only would use the 16-bit shift and three of the 16-bit rotates.

How about multiplying 8 bits by 4?

For the 6800, 6801, and 6809:


    	LSLB	; multiply by 2
	LSLB	; and again

The second time you multiply by 2, don't use ROL. That would be feeding the top bit back into the bottom, which would be a different operation. Each time you multiply by 2, you have to start with LSL, not a ROL. ROL is for catching the bits between bytes.

For the 68000, oh, fun!:


    	LSL.B	#2,D0	; times 2^2

One instruction and done! WHEEEEE!!!

For multiplying by 2ⁿ, shift n times. Just be aware that the more you shift without catching bits off the top, the more you lose bits.

What about multiplying 16 bits by 4? On the 8-bit processors, you have to repeat the entire chain of shifts to avoid losing bits between the bytes. On the 6800 and 6809, this is what you usually want:


    	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again
	ROLA	; catch the carry again

On the 6801,


    	LSLD	; multiply by 2
	LSLD	; and again

I hope this is enough that you can see how to multiply 32-bit integers by the constant 4 using shifts on the 8-bit processors. I could show them anyway as an excuse to talk about optimization, but we need to move on.

On the 68000, you can specify the number of bits to shift, up to 8, as the immediate argument to the shift, for all widths, byte, 16-bit, and full 32-bit. That means you can multiply a full 32-bit integer in a register by any power of 2, up to 2⁸, in a single instruction. The only cost is that each bit costs an additional couple of CPU internal cycles, but compared to the cost of fetching and encoding an additional instruction, it's not a great cost.

Just remember that if you shift a byte 8 bits, every bit in the byte gets shifted out. Thus


    	LSL.B	#8,D0	; times 2^8 -- WHEEEEE!?!?!???

is an expensive way to clear the lowest byte of D0.

Of course,


    	LSL.W	#8,D0	; 16 bits times 2^8

is one way to get the lowest byte in D0 up to the next higher byte position (bits 15 to 8). (That's the difference between .B and .W here.)

How about multiplying by 16? That's 2⁴, so it's 4 shifts.

As we've already noted, on the 68000, it's just one instruction. LSL.B or LSL.W or LSL.L and the operands are #4,Dn to shift Dn by 4 bits left.

On the 8-bit processors, if it's a byte value, that's 4 shifts on the same operand in a row, which is not too bad. 4 instructions, 4 bytes. One more byte than a JSR in extended mode.


    	LSLB	; multiply by 2
	LSLB	; again by 2 is by 4
	LSLB	; and again by 2 is by 8
	LSLB	; and again by 2 is by 16

If it's a 16-bit value, for the 6800 and 6809, it's 4 pairs of LSLB; ROLA in a row, 8 instructions in 8 bytes. That might be worth the performance hit of making it into a loop, depending on how tight memory is:


    	LSLB	; multiply by 2
	ROLA	; catch the carry
	LSLB	; and again, to make it by 4
	ROLA	; catch the carry again
	LSLB	; and again, to make it by 8
	ROLA	; catch the carry
	LSLB	; and again, to make it by 16
	ROLA	; catch the carry again

vs., say, on the 6809 where it's easiest to construct the loop:


    MUL16W	LDB	#4
	PSHU	B
	LDD	1,U	; operand on parameter stack
MUL16WL	LSLB	; multiply by 2
	ROLA	; catch the carry
	DEC	,U
	BNE	MUL16WL
	LEAU	1,U
	STD	,U

But, as you can see, for the cost of setting up the loop, you might as well just do it.

On the other hand, that code easily turns into a general loop for shifting left n bytes, so don't forget it.

For the 6801, it's again,


    	LSLD	; multiply by 2
	LSLD	; and again, making it by 4
	LSLD	; and again, making it by 8
	LSLD	; and again, making it by 16

On the 6801, it won't be worth making a loop for 16-bit integers until we try to multiply by constants greater than 2⁸, which are pretty rare, really. And ...

Speaking of multiplying by 2⁸, if we are following the A:B is D convention,


    	TBA	; multiply by 256
	CLRB	; don't forget to zero out low byte.

does the trick for rotating by 8 bits. And you may not even need to explicitly do this -- may be good enough to just move it in memory.

While we're thinking about this, let's look at another way to multiply by 256 on the 68000 (or to get the lowest byte of Dn into bits 15 to 8:


    	MOVE.B	D0,-(A6)	; this does NOT work on A7!
	CLR.B	-(A6)
	MOVE.W	(A6)+,D0

Usually, you would not want to do it this way, because you're having to read and write memory, but there are times it can be useful.

The reason it doesn't work on A7? A7 absolutely must stay 16-bit aligned on the 68000, because of return addresses. So when you push a byte to A7 -- any MOVE.B Dn,-(A7) -- the 68000 magically double decrements A7 for you. And when you pop a byte from A7 -- any MOVE.B (A7)+,Dn -- it automatically double increments A7 for you.

(Don't get frustrated about this. As I have hinted before, you really don't want to store strings on the A7 stack.)

Multiplying by 32, 64, and 128 are also not nearly as common as by 2, 4, 8, and 16. It may be a better use of resources to just let them go to a general multiply routine (which, as I say, I will show you shortly).

Except, we do actually want to look at multiplying a byte by 2⁵, and, by implication, 2⁶, and 2⁷, because I can demonstrate now a little about how multiplying is dividing.

Multiply B by 128, leaving result in A:


    	CLRA	; for result
	LSRB	; bit 0 to carry
	RORA	; bit 0 of B now in bit 7 of A

Note that A now has the less significant byte, and B has the more significant byte.

The following also works, but takes an extra byte of code while leaving A untouched, and loses any of the more significant bits than bit 0:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	ANDB	#$80	; chop off the lost, double-shifted high bits

[JMR202401010106 addendum:]

The above could be useful in saturation math, or if we know the input will only be 0 or 1.

[JMR202501071039 edit:]

~~If we want to keep all the bits that we might be losing, we would have to store the intermediate result away first, probably using A, after all:~~

If we want to start with saturation math, but keep all the bits that we would otherwise be losing, we might try to store the intermediate result away in like this:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	TBA
	ANDB	#$80	; chop off the high bits
	ANDA	#$7F	; chop off the low bit (but ...)

But the high bits were already shifted too many times, so we have to un-shift them. Fortunately, the carry has not been altered by the TBA or AND instructions, so we can rotate the carry back in at several points. Taking the earliest point, we could do it like this:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7
	TBA
	ROLA		; bring the high bits back into position
	ANDB	#$80	; chop off the high bits
	ANDA	#$7F	; chop off the low bit

But if we back up and take the copy first,


    	TBA		; make two halves
	LSRA		; bit 0 to carry, high bits
	RORB		; now to bit 7
	ANDB	#$80	; chop off the high bits

Does it seem like we did that above?

Back up and look at the code and see what's different and think about why the effect is the same -- when done correctly.

[JMR202501071039 edit end.]

[JMR202401010106 addendum end.]

It could also be interpreted as a divide, which we will talk about later.

Either way can be extended to multiply by 64 as in the following, leaving the result in A:


    	CLRA	; for result
	LSRB	; bit 0 to carry
	RORA	; old bit 0 of B now in bit 7 of A
	LSRB	; old bit 1 of B to carry
	RORA	; old bit 1,0 of B now in bit 7,6 of A

Not touching A to multiply by 64:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7, old bit 1 to carry
	RORB		; now to bit 7,6 in order
	ANDB	#$C0	; chop off the remainder

And we can see that using A in byte inversion to multiply by 32 costs more bytes and cycles than not touching A to multiply by 32:


    	LSRB		; bit 0 to carry
	RORB		; now to bit 7, old bit 1 to carry
	RORB		; now to bit 7,6 in order, old bit 2 to carry
	RORB		; now to bit 7,6,5 in order
	ANDB	#$E0	; chop off the remainder

This works for 16-bit integers as well, but for multiplying by 2¹⁵, 2¹⁴, 2¹³, and so forth. We'll show multiplying 2¹³:


    	LSRB		; bit 0 to carry
	RORA		; now to bit 15, bit 8 to carry
	RORB		; bit 8 to bit 7, old bit 1 to carry
	RORA		; old bit 1 to bit 15 in order
	RORB		; old bit 9 to bit 7, old bit 2 to carry
	RORA		; old bit 2 to bit 15 in order
	RORB		; old bit 10 to bit 7 (old bit 3 to carry)
	ANDA	#$E0	; chop off the top bytes, ignore carry

Yes, that does lose 13 bits off the top. If you wanted them, you needed to allocate another couple of bytes to save them in.

Now, how does this play out on the 68000?

Nicely enough:


    	ROR.B	#3,D0	; 8-bit rotate wraps around!
	AND.B	#$E0,D0	; mask out remainder.

ROR and ROL are register-wide rotations -- 8-, 16-, and 32-bits wide. They copy the bit that wraps to the other end into both the eXtend carry and the test/branch Carry bits.

ROXR and ROXL are register-plus-carry rotations -- 9-, 17-, and 33- bits wide. They rotate the bit that comes out of the one end into the X carry, rotating the old X carry into the other end, and copy the new X carry into the C carry.

This is worth remembering to help shave cycle counts -- not so much for 5 bit shifts, but for 6 and 7 on bytes, yes. And, definitely, 3 bits right is faster than 13 bits left:


    	ROR.W	#3,D0	; 16-bit rotate wraps around!
	AND.W	#$E000,D0	; mask out remainder.

Wait. I said you can shift by immediate bit counts up to 8. 13 is more than eight. But Motorola did leave us another way, that we don't have to do


    	LSL.W	#8,D0	; First shift 8
	LSL.W	#5,D0	; Then shift the rest.

Not that we wanted to shift a 16-bit word by 13 bits, but we can also do it in one shift, using variable shift counts:


    	MOVEQ	#13,D1	; shift count
	LSL.W	D1,D0	; 16 bit shift by count in D1

So there are several ways to do it. You'll want to look at op-code byte counts and cycle timing when there isn't any other reason to choose between them.

Backing up a bit to register-wide rotations, 8- and 16-bit wide rotate instructions are missing in the 6800, 6801, and 6809. You have to explicitly move the old carry bit(s) in when you need to do register-wide. The general approach is to make a copy and do two logical shifts, one in the desired direction and count, the other in the complementary count in the opposite direction, then bit-or the results together:


    * accumulator-wide ROL by 3 / ROR by 5:
	LDAA	0,X	; copy
	LSL	0,X	; shift left by 3
	LSL	0,X
	LSL	0,X
	LSRA		; shift right by 5
	LSRA		; (use the faster shift accumulator)
	LSRA
	LSRA
	LSRA
	ORAA	0,X	; put results together

This works because logical shifts leave 0 bits behind in the vacated bits.

[JMR202501040801 addendum: (Shifting to 6800 on purpose here. We'll see why below.) ]

The 6800, 6801, and 6809 do not have an instruction to OR the accumulators together, but when you know the bits are set in only one result or the other, you can add them together to get the same result:


    * accumulator-wide ROL by 3 / ROR by 5:
	LDAB	0,X	; copy
	TBA
	LSLA	; shift left by 3
	LSLA
	LSLA
	LSRB	; shift right by 5
	LSRB
	LSRB
	LSRB
	LSRB
	ABA	; put the results together

[JMR202501041102 correction:]

Instead of LSRB above, I had originally written RORB, which is, of course, going to mix bits in a semi-arbitrary non-random way, thus, a bug. Erk.

[JMR202501041102 correction end.]

[JMR202501040801 addendum:]

(6809 does not even have an ABA instruction. You have to push it to a stack and add post-increment. So you should OR post-increment, so that you remember that ADDing was a substitute for ORring.

It's easy to think it was unnecessary cost cutting, and I tend to lean toward the solution that the 6309's hidden instruction set used, of having the ADDR (and ORR) register to register instructions with arguments like TFR and EXG use. But I also recognize the awkwardness of providing an addition path between D and the index registers outside the LEA instructions. Whether or not you include a tertiary ALU, the data paths get pretty complex. With modern design tools, it's not so hard, but it was not all that easy with the tools they had in 1978.)

[JMR202501040801 addendum end.]

When it's a single bit left rotation, there's a neat trick:


    ROLB8	LSLB		; 8-bit wide rotation
	ADCB	#0	; move the carry in

Single bit right doesn't fair so well, however:


    ROR8B	LSRB
	BCC	ROR8BN
	ORAB	#$80
ROR8BN	NOP	; next instruction

or, using the other accumulor to get the low bit in the carry first,


    ROR8B	TBA
	LSRA	; get low bit in carry first
	RORB

are the best I'm aware of.

What about 16-bit and 32-bit integer rotation? I think we have enough here to figure them out on the 6800, 6801, and 6809.

And Motorola gave us another way to optimize 32-bit shifts and rotates on the 68000:


    	SWAP	D0	; Effectively a 16-bit rotate on D0!

SWAP exchanges the low and high halves of the data register. It's a 16-bit instruction and only takes four cycles, so it's a good way to avoid 2 cycles per bit for really long shifts and rotates. For instance, if we want to shift a 32-bit integer in D0 left by 20 bits, we can use SWAP to speed it up:


    	SWAP	D0	; rotate by 16 bits
	CLR.W	D0	; mask out low half
	LSL.L	#4,D0	; shift the remaining 4 bits.

This is all very interesting, but it doesn't really seem to be taking us any closer to multiplying by ten? Have we just gotten lost in shifting and masking? Is this all just shifty business?

Shifting bits is not something people do every day, so I'm trying to expose you to a lot of things that you can do shifting bits before we use them for the real stuff.

If there's something you really want to check in the above, make up your own test code and check it. (I may have made a mistake? ;-)

If you find mistakes, leave me a note in the comments, please.

Otherwise, hang on for the ride.

It'll make more sense in the chapter after next, where we do some fast multiplying by a few small constants that are not powers of 2.

In the next chapter, you can look at some (incompletely tested) demonstration code for the 6800.

(Title Page/Index)

Monday, December 23, 2024

ALPP 03-XX (15) -- Numeric Output Conversion and Multiplying by Constants (Theory)

I'm leaving this here for reference, for the moment. Eventually, I intend to do a chapter including focus on shifts, and some of this will find place there.
You probably want to go here: https://joels-programming-fun.blogspot.com/2025/01/alpp-03-15-converting-numbers-output-input-multiplication-division-theory.html.

Numeric Output Conversion
and Multiplying by Constants
(Theory)

(Title Page/Index)

But that will require multiplying and dividing by ten.

Why? Because we usually interact with numbers in decimal base -- radix base ten.


    $ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
obase=2
4*a(1)
11.00100100001111110110101010001000100001011010001100001000110100101\
01
obase=16
4*a(1)
3.243F6A8885A308D2A

When you're working in binary, getting numbers in and out in decimal requires converting between binary and decimal, and converting between binary and radix base ten requires multiplying and dividing by ten.

To display each digit of a value in decimal going right, we have to divide by the largest power of ten we can, and convert the quotient to the ASCII code for that digit, repeating with the remainder until the remainder is less than ten.

Or we can work going to the left, by dividing by ten, converting the remainder to the ASCII code for that digit, repeating with the quotient until the quotient is less than ten.

Division, either way.

To input a decimal value from the keyboard, we get each digit in order, multiplying the accumulated value by ten before adding the digit we got, repeating until there are no more digits entered (or until the accumulated value overflows).

Or we can read all the digits first, count the number of digits, and multiply each digit by the appropriate power of ten as we go, and that also requires multiplication.

Multiplication, either way.

While we can do general multiplication and division on the 68000, we haven't really talked about it. And we haven't looked at how to synthesize multiplication and division on 8-bit CPUs that don't have them. I'll show you general routines for multiplication and division pretty soon, but I want to show why they work (and give you some clues about how to speed them up) by introducing multiplying and dividing by constants, which, by more than coincidence, can be useful in decimal input and output.

But we've already been multiplying and dividing by two and sixteen, haven't we?

Haven't we?

Let's look again at getting both binary and hexadecimal output. We need to understand what we are doing there.

When converting to binary from decimal or hexadecimal by hand, the usual approach (ignoring fractions) is


    Set the radix point (fraction/decimal point) on the right.
Do until all digits (bits) converted (until quotient is zero):
  Divide the number by 2, keeping both quotient and remainder.
  Write the remainder down as the next digit,
    going left from the radix point.
  Repeat with the quotient.

Now, even taking into account that this algorithmic description is rather loose, looking at what we were doing in the 6800 chapter, it looks different, doesn't it? (We were, in fact, converting to external binary from what we could call internal radix base 256. And that's not being absurd to say it that way, no.)

I mean, even ignoring the additional step of converting the remainder to a character for output, it's different. We were going left-to-right, and not even noticing the radix point until we were done, if then.

Let's look at the 6809 code for binary output again (since I think the 6809 code is easier to read):


    * Output a 0
OUT0	LDB	#'0
OUT01	PSHU	D
	LBSR	OUTC
	RTS
*
* Output a 1 
OUT1	LDB	#'1
	BRA	OUT01
* Rob code, shave a couple of bytes, waste a few cycles.
*
* Output the 8-bit binary (base two) number on the stack.
* For consistency, we are passing the byte in the low-order byte
* of a 16-bit word.
OUTB8	LDB	#8	; 8 bits
	STB	0,U	; Borrow the upper byte of the parameter.
OUTB8L	LSL	1,U	; Get the leftmost bit of the lower byte.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1
OUTB8D	DEC	,U
	BNE	OUTB8L	; loop if not Zero
	LEAU	2,U	; drop parameter bytes
	RTS

In modified human English, that's going to look like


    Do:
  Shift the bits left, capturing the bit carried off the top.
  Convert the captured bit to a character and
    write it down as the next digit,
    going right.
  Repeat until no bits remain to be converted.

Yep. Going the opposite direction. And the radix point just ended up where we stopped.

Completely backwards!

What's going on here?

You'll remember that I mentioned that shifting digits to the left (shifting the radix point to the right and filling with zeroes) is the same as multiplying by the radix.

You don't remember that I said that?

What did I say? Ah, here it is, in the chapter on hexadecimal output on the 6800:

... shifting is division and multiplication by powers of two. ...

a little before talking about moving the radix point in decimal numbers, which is the same as shifting decimal digits.

So, shifting bits to the left is multiplying by two. And shifting bits to the right is dividing by two.

And when we grabbed the bit that came off the high end into the carry, we were just grabbing the bits as the came off the top, right?

Here's how I want to see that. On the one hand we were multiplying by two. On the other hand, the top bit came off into the carry, and we grabbed it. So we were shifting right by 7 grabbing the quotient (from the carry).

Which is dividing by 2⁷ -- dividing by 128_ten.

This is because the byte is 8 bits, and the 8 bit register forms something mathematicians call a ring, which we aren't going to describe in detail because I don't want to put everyone to sleep.

But it's mathematics. We can rely on it once we understand it. Multiplication in a ring is division, and sometimes that is useful.

Now, we could do this:


    NUMBUF	RMB	34	; enough for 32 bits of output
*
CNVB8	TFR	DP,A	; point to the direct page
	CLRB
	TFR	D,Y
	LEAY	NUMBUF-LOCBAS,Y	; point to NUMBUF
	LEAY	9,Y	; start at the right
	CLR	,-Y	; NUL terminate it
	LDA	#8	; 8 bits
CNVB8L	LDB	#'0'	; ASCII '0'
	LSR	1,U	; Get the lowest bit into the carry
	ADCB	#0	; convert it to ASCII
	STB	,-Y	; build the string right-to-left
CNVB8D	DECA
	BNE	CNVB8L	; loop until counted out
	STY	,U	; return the address of the buffer
	RTS		; (this ought to work, anyway)

And that would be in the order of working right-to-left, and then we could take the address that CNVB8 returns and pass it off to OUTS, and print the number as a string.

But that would require an intermediate buffer (NUMBUF above), and I want to be able to output binary and hexadecimal without intermediate buffers. (And without explicit multiply and divide instructions.)

The intermediate buffer is where we set the radix point on the right so we can chop the less significant digits off first and write them down going to the left.

Intermediate buffers make debugging more difficult.

You can see that I would want to be able to output decimal numbers without intermediate buffers, too, right? Maybe?

Can it be done?

(If you aren't really following me, well, stick around for the ride. It does eventually begin to make sense. I think.)

We've seen that it can be done with hexadecimal. In the 6809 code we had


    ASC0	EQU	'0	; Some assemblers won't handle 'c constants well.
ASC9	EQU	'9
ASCA	EQU	'A
ASCXGAP	EQU	ASCA-ASC9-1	; Gap between '9' and 'A' for hexadecimal
*
* Mask off and convert the nybble in B to ASCII numeric,
* including hexadecimals
OUTH4	ANDB	#$0F	; mask it off
	ADDB	#ASC0	; Add the ASCII for '0'
	CMPB	#ASC9	; Greater than '9'?
	BLS	OUTH4D	; no, output as is.
	ADDB	#ASCXGAP	; Adjust it to 'A' - 'F'
OUTH4D	CLRA
	STD	,--U
	LBSR	OUTC
	RTS
*
* Output an 8-bit byte in hexadecimal,
* byte as a 16-bit parameter on PSP.
OUTHX8	LDB	1,U	; get the byte
	LSRB
	LSRB
	LSRB
	LSRB
	BSR	OUTH4
	LDB	1,U
	BSR	OUTH4
	LEAU	NATWID,U
	RTS

You can see we were capturing the high nybble (four bits) by shifting left four times, right?

Shifting right four is the same as shifting left four, capturing as we go, ~~right~~ correct? (Sorry about that.)

And, rather than shifting the low four back into place, we can grab the top bits from a copy, and then grab the bottom bits from the original, masking the top bits off.

Multiplying by sixteen in an 8-bit ring is dividing by sixteen in the 8-bit ring (within the conditions of the ring).

With a little thought, we could figure out how to output the character code in radix base 4 (quarternary) or 8 (octal) by this method, as well. Any power of 2 would be just a matter of shifting the bits appropriately and capturing and adjusting the resultant output value to a symbol that represents the output value.

One problem with octal base -- radix base eight -- is that 8 is 2³, which requires 3 bits per octal digit. And, where we can fit 2 hex digits exactly in an 8-bit byte, or four quaternary digits exactly, octal digits end up with a bit left over. Two bits, I mean. (Sorry. ;)

Which means the left-most digit when converting a byte has only two bits, and can only range between 0 and 3 instead of 0 and 7. And you have to account for that as you convert.

Which means that, converting a byte from left-to-right, when you start by shifting a digit off the top, you have to shift only 2 bits for the first digit. Instead of multiplying by 8 (which is dividing by 2⁵, or dividing by 32) the first time, you have to multipy by 4 (divide by 2⁶, or 64) to get that first octal digit on the left.

Thinking about this, 64 is the maximum power of 8 that fits in a byte. 2⁹, 512, does not fit in a byte. Keep this in mind.

And if you're converting a byte to octal from right to left, you still have to remember to only shift twice on the last shift.

How about base ten, then? Can we do something like this with base ten?

Octal cuts binary integers up three bits per octal digit. Hexadecimal cuts it up into four bits per hex digit. How many bits per decimal digit? It's clearly not a whole number of bits, and we don't know how to shift by anything but whole numbers of bits, so it doesn't look hopeful.

As a digression (8-o), say you encode decimal in four bits per decimal digit. Could this work?

Let's see.

In hexadecimal, we can record a digit from 0_sixteen to F_sixteen in four bits. So, what if we decide to only record digit values 0 through 9? It's a little bit wasteful, but it's enough to encode a decimal digit in four bits.

Let's see it:

Yep, it can be encoded.

This is binary coded decimal, or BCD.

But, 10011001_two ($99 => 99_sixteen) is (128 + 16 + 8 + 1), which is equal to 153_ten.

Where 10011001_BCD (also $99) is 99_ten.

Eaaaoooooohhh confusion!

It turns out you can add and subtract directly in BCD, although it takes an extra step or so to handle carries correctly. And, of course, you can multiply and divide, and the algorithms look like what you'd do by hand. And of course shifting left one BCD digit (four bits) at a time does work out to multiplying by ten, and shifting right one BCD digit at a time works out to dividing by ten.

But it's a bit wasteful of bits.

In BCD, you can encode numbers from 0 to 99,999,999 in four bytes (eight nybbles), or 32 bits.

0 to 99,999,999 in binary (00000000000000 to 101111101011110000011111111) requires only 27 bits, which fits in just less than six and a half bytes (one bit short of seven nybbles).

Okay, it's not really all that wasteful. (In fact, if I understand correctly, my favorite multi-precision command-line *nix tool, bc, demonstrated again above, operates in BCD.)

But now we have issues when we want to convert BCD to binary, so that we can use numbers as addresses and such. It just shifts the problems around. (In bc, we usually aren't working with addresses, by the way.)

Yes, we'll have to talk about BCD at some point.

The purpose of the digression was to try to give you a little more space and perspective before we dig into multiplying by shifts and adds.

This theoretical stuff is getting long. Let's look at some actual code to multiply by constant powers of 2.

(Title Page/Index)

Wednesday, December 18, 2024

ALPP 03-XX -- Radix Output

False start, kept for reference.

Radix Output

(Title Page/Index)

Now that we've debugged getting a key from the ST's keyboard and outputting its ASCII code value in hexadecimal and binary on the 68000, a natural next step would be to learn how to parse numbers from the input. But that will require multiplying and dividing by ten.

Why? Because we usually interact with numbers in decimal base -- radix base ten.

While we can do that on the 68000, we haven't really talked about it, and we haven't looked at how to synthesize multiplication and division on 8-bit CPUs that don't have them.

So, instead of going directly to parsing numbers, I want to look at multiplication and division, at least enough to be able to multiply and divide by ten.

But we've already been multiplying and dividing by two and sixteen, haven't we?

Haven't we?

Let's look again at getting both binary and hexadecimal output. We need to understand what we are doing there.

When converting to binary from base ten by hand, the usual approach (ignoring fractions) is


    Set the radix point (fraction/decimal point) on the right.
Do until all digits (bits) converted (until quotient is zero):
  Divide the number by 2, keeping both quotient and remainder.
  Convert the remainder to a character and 
    write it down as the next digit,
    going left from the radix point.
  Repeat with the quotient.

Now, even taking into account that this algorithmic description is rather loose, looking at what we were doing in the 6800 chapter, it looks different, doesn't it?

We were going left-to-right, and not even noticing the radix point until we were done, if then.

Let's look at the 6809 code again (since I think the 6809 code is easier to read):


    * Output a 0
OUT0	LDB	#'0
OUT01	PSHU	D
	LBSR	OUTC
	RTS
*
* Output a 1 
OUT1	LDB	#'1
	BRA	OUT01
* Rob code, shave a couple of bytes, waste a few cycles.
*
* Output the 8-bit binary (base two) number on the stack.
* For consistency, we are passing the byte in the low-order byte
* of a 16-bit word.
OUTB8	LDB	#8	; 8 bits
	STB	0,U	; Borrow the upper byte of the parameter.
OUTB8L	LSL	1,U	; Get the leftmost bit of the lower byte.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1
OUTB8D	DEC	,U
	BNE	OUTB8L	; loop if not Zero
	LEAU	2,U	; drop parameter bytes
	RTS

In human language, that's going to look like


    Do:
  Shift the bits left, capturing the bit off the top.
  Convert the captured bit to a character and
    write it down as the next digit,
    going right.
  Repeat until no bits remain to be converted.

Yep. Going the opposite direction. And the radix point just ended up where we stopped.

Completely backwards!

What's going on here?

You'll remember that I mentioned that shifting digits to the left (shifting the radix point to the right and filling with zeroes) is the same as multiplying by the radix.

You don't remember that I said that?

What did I say? Ah, here it is, in the chapter on hexadecimal output on the 6800:

... shifting is division and multiplication by powers of two. ...

a little before talking about moving the radix point in decimal numbers, which is the same as shifting decimal digits.

So, shifting bits to the left is multiplying by two. And shifting bits to the right is dividing by two.

And when we grabbed the bit that came off the high end into the carry, we were just grabbing the bits as the came off, right?

Here's how I want to see that. On the one hand we were multiplying by two. On the other hand, the top bit came off into the carry, and we grabbed it. So we were shifting left bey7.

Which is dividing by 2⁷, dividing by 128.

This is because the byte is 8 bits, and the 8 bits form something mathematicians call a ring, which we aren't going to describe in detail because I don't want to put everyone to sleep.

But it's mathematics. We can rely on it. Multiplication in a ring is division, and sometimes that is useful.

Now, we could do this:


    NUMBUF	RMB	34	; enough for 32 bits of output
*
CNVB8	TFR	DP,A	; point to the direct page
	CLRB
	TFR	D,Y
	LEAY	NUMBUF-LOCBAS,Y	; point to NUMBUF
	LEAY	9,Y	; start at the right
	CLR	,-Y	; NUL terminate it
	LDA	#8	; 8 bits
CNVB8L	LDB	#'0'	; ASCII '0'
	LSR	1,U	; Get the lowest bit into the carry
	ADCB	#0	; convert it to ASCII
	STB	,-Y	; build the string right-to-left
CNVB8D	DECA
	BNE	CNVB8L	; loop until counted out
	STY	,U	; return the address of the buffer
	RTS		; (this ought to work, anyway)

Now we can take the address that CNVB8 returns and pass it off to OUTS, and print the number as a string.

*********

With a little thought, we could figure out how to output the character code in base four or eight, as well. Any power of 2 would be just a matter of shifting the bits appropriately and adjusting the resultant value to a symbol that represents the value.

For any base ten or less, the adjustment is really straightforward in ASCII-based characters -- just adding the value to the ASCII-based character code value of '0'. (This works for UNICODE, too, but we won't be going there.)

For any base up to sixteen, if the resultant symbol exceeds the ASCII value of '9', we further add one less than the difference between the ASCII for 'A' and the ASCII for '9'. Or we can test the value first and add the appropriate adjustment in one step:

ASCII for '0' if less than or equal to 9
and ASCII for ten less than 'A' if greater than 9.

We saw the former method in the 6800 code for hexadecimal output:


    ASC0	EQU	'0	; Some assemblers won't handle 'c constants well.
ASC9	EQU	'9
ASCA	EQU	'A
ASCXGAP	EQU	ASCA-ASC9-1	; Gap between '9' and 'A' for hexadecimal
*
* Mask off and convert the nybble in B to ASCII numeric,
* including hexadecimals
OUTH4	ANDB	#$0F	; mask it off
	ADDB	#ASC0	; Add the ASCII for '0'
	CMPB	#ASC9	; Greater than '9'?
	BLS	OUTH4D	; no, output as is.
	ADDB	#ASCXGAP	; Adjust it to 'A' - 'F'
	...

This would also work as it is for a radix higher than sixteen, if we accept the approach usually taken in radix eleven through sixteen and continue with it, up to base 36 (highest valued digit 'Z').

There are reasons we may not want to do that, but it could be done.

Anyway, we know we can handle the adjustment in the cases that interest us most.

Now, let's look again at how we got each digit.

For binary, it was easy. Shift a bit off the left (high-order) end of the binary integer and convert it to ASCII '0' or '1':


    * Output a 0
OUT0	LDAB	#'0
OUT01	JSR	PPSHD
	JSR	OUTC
	RTS
*
* Output a 1 
OUT1	LDAB	#'1
	BRA	OUT01
* :::
OUTB8L	LSL	1,X	; Get the leftmost bit.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1

For hexadecimal, it may not be quite as clear that was what we were doing -- shifting a digit's worth of bits off the left, capturing them, and converting them to ASCII:


    	LDAB	1,X	; get the byte
	LSRB		; move the hexadecimal digit into place
	LSRB
	LSRB
	LSRB
	BSR	OUTRAD	; convert to ASCII and output
	LDX	PSP
	LDAB	1,X
	ANDB	#$0F	; mask the high digit off
	BSR	OUTRAD	; convert to ASCII and output

Say WHAT?!?!? Those are right shifts! And then no shifts! just a bit-AND to mask off the ...

Yeah, it would have been a little bit more plain like this:


    	CLRB		; ready to capture high four bits
	LSL	1,X	; get high bit off top 
	RORB		; capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	BSR	OUTRAD
	CLRB		; ready to capture next four bits
	LSL	1,X	; get next bit off top 
	RORB		; capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	LSL	1,X	; get next bit off top 
	RORB		; shift over and capture it
	BSR	OUTRAD

If you can't see from just reading the code that the result in B and the output is the same, go ahead and substitute these lines of code into the code for chapter 03-05 and trace through it, watching the bits shift around.

In either method, we shift the high four bits of the byte we're putting out in order, into the low four bits of B.

Then, in the method above, we shift the low four bits back into the low four bits where they came from. But in the method of chapter 03-05 way we just leave them there and mask the high bits off.

If you think of the byte register as a ring of 8 bits, you might be able to see the bits coming back around.

There's another way of looking at it. In chapter 03-05, we noted that shifting digits left one column is the same as multiplying by the radix.

Shifting a decimal number left (by adding a zero to the right and moving the decimal fraction point to the right of the added zero) is the same as multiplying by ten.

Shifting a binary number left one bit is the same as multiplying it by two.

Shifting a hexadecimal digit left by one digit is the same as multiplying by sixteen. Or, shifting a binary number left by four bits is the same as multiplying by sixteen.

How about shifting right? It's the same as dividing by the radix. We'll look at that in a bit.

So, back to thinking about output in base 4.

If we want to output in base four, we can shift two bits left, capturing and outputting each pair as we go. Do it four times and we've got the byte output in quaternary base.

How about octal? If we shift three bits, then three more, we've only got two left, and that doesn't work. So what we should have done is recognize that we only had 8 bits to shift and only shifted two bits to start, then shifted three and three.

Why?

It's helpful to note that or FF_sixteen (all bits 1, 255_ten) is 377_eight. That's how you wright the maximum value of an 8-bit byte in base eight. And the high digit of that is 3, which only takes two bits in binary. So it makes sense that you would only shift off two bits for the first digit.

Now, if you were doing two bytes at once, that would be five sets of three bits and one bit for the most significant digit. 177777_eight. is how you write FFFF_sixteen (65535_ten), the maximum value of a sixteen-bit number, for octal. For binary, it's sixteen digits: 11111111_two. For quaternary, it's eight digits: 33333333_four.

So we are getting some ideas how output in base four or base eight would work, and how to output sixteen bit values in any radix base that is a power of two. It's just shifting.

But base ten doesn't work like this.

Why?

Let me take you on a short detour through something called binary-coded decimal (BCD).

In hexadecimal, we can record a digit from 0_sixteen to F_sixteen in four bits, right?

Well, what if we decide to only record digit values 0 through 9? It's a little bit wasteful, but it's enough to encode a decimal digit in four bits.

Let's see it:

Yep, it can be done.

But, 10011001_two (99_sixteen) is (128 + 16 + 8 + 1) equal to 153_ten.

Where 10011001_BCD is 99_ten. Eaaaoooooohhh confusion!

But maybe we can see that shifting a BCD number four bits to the left is multiplying by ten? Maybe?

It is. We can play with that later. Let's set BCD aside for a moment.

The point is that, where we can divide binary numbers into fields of n bits for any radix base 2ⁿ, and we can even do something like that for binary coded decimal, trying to divide a straight binary number into fields of radix base ten is going to have us trying to use fractions of bits.

And we don't now how to do that.

I don't think anyone knows a good, simple way to do it, other than repeatedly dividing by ten, which isn't very simple in binary (which is why this chapter is so long).

Dividing is shifting right. Right? (Sorry.)

It is.

I pointed out that shifting left 1 bit is the same as shifting right 7? Well, if you capture the bits correctly, anyway.

I'm going to use 6809 code for this example instead of 6800, because we can focus a bit better on what we are doing without having a lot of DEX instructions getting in the way.

Here's what we did for the 6809 binary output in chapter 03-03:


    OUTB8	LDB	#8	; 8 bits
	STB	0,U	; Borrow the upper byte of the parameter.
OUTB8L	LSL	1,U	; Get the leftmost bit of the lower byte.
	BCS	OUTB81
OUTB80	BSR	OUT0
	BRA	OUTB8D
OUTB81	BSR	OUT1
OUTB8D	DEC	,U
	BNE	OUTB8L	; loop if not Zero

Instead of shifting out the top and capturing the carry (multiplying by two and capturing the overflow), and writing the number left-to-right, let's divide by two and build the output string for the number right-to-left:


    NUMBUF	RMB	34	; enough for 32 bits of output
*
CNVB8	TFR	DP,A	; point to the direct page
	CLRB
	TFR	D,Y
	LEAY	NUMBUF-LOCBAS,Y	; point to NUMBUF
	LEAY	9,Y	; start at the right
	CLR	,-Y	; NUL terminate it
	LDA	#8	; 8 bits
CNVB8L	LDB	#'0'	; ASCII '0'
	LSR	1,U	; Get the lowest bit into the carry
	ADCB	#0	; convert it to ASCII
	STB	,-Y	; build the string right-to-left
CNVB8D	DECA
	BNE	CNVB8L	; loop until counted out
	STY	,U	; return the address of the buffer
	RTS		; (this ought to work, anyway)

Now we can take the address that CNVB8 returns and pass it off to OUTS, and print the number as a string.

And we could take the same approach with the hexadecimal conversion, dividing by sixteen -- shifting four bits right and capturing them in order -- and converting and storing right-to-left.

(But we would actually make a copy, mask the high bits out, convert and store, then divide by sixteen for the next digit, because it's quicker that way. But we will ignore the optimization.)

If we think about it, when we convert from decimal to binary or hexadecimal by hand, that's the way we do it. We divide by the base we are converting to, capture the remainder and write that, writing from right-to-left. And it works from decimal to binary or hexadecimal. Or any radix base to any radix base.

Why not do that in the first place?

Several reasons. One is that it's useful to be able to get numbers in and out without sending them through a conversion string buffer. Another is that shifting bits in registers and memory is one of the more useful things you can learn about, especially for assembly language. Yet another is, well, ...

Now you know that dividing and multiplying by powers of two is easy, right?

On the 68000, we have general multiply and divide, at least for 16 bits.

On the 6809 and 6801, we have byte multiply to 16 bits. No divide. (Multiply is much easier than divide.)

On the 6800 we have no multiply and no divide.

We are going to have to synthesize some multiplication and division. Also, even on the 68000, multiplication and division cost more than shifts in CPU cycle counts.

It would be nice to be have a quick way to multiply and divide by constants other than powers of 2, wouldn't it? Especially by ten?

Why, yes, it would. Let's do it. Multiplication is easier. Let's do some middle-school algebra:

10X == 2(5X)
5X == (4 + 1)X => 10X == 2((4+1)X)
(4 + 1)X == 4X + X => 10X == 2(4X + X)

4X == 2(2X) => 10X == 2(2(2X)+X)

Let's build that up from adds and shifts:


    *
MUL10	LDD	,U	; X
	CLR	,-U	; for overflow (parameter 1 off)
	CLR	,-U	; 16 bits (parameter 2 off)
	ASLB		; 2X
	ROLA
	ROL	1,U
	ASLB		; 2(2X)
	ROLA
	ROL	1,U
	ADDD	2,U	; 2(2X)+X
	BCC	MUL10N
	INC	1,U
MUL10N	ASLB		; 2(2(2X)+X) == 10X
	ROLA
	ROL	1,U
	STD	2,U
	RTS

(Title Page/Index)

Sunday, April 26, 2026

Monday, January 13, 2025

Converting Numbers for Output and Inputwith Multiplication and Division

Output, Working Left-to-right

Output, Working Right-to-left

Efficiency

Input, Working Left-to-Right

Input, Working Right-to left

Efficiency

Approaching Implementation

Monday, January 6, 2025

Demonstrating Left Shift --6800

Saturday, January 4, 2025

Multiplying by Small Constants(Shift Left and Add)

Tuesday, December 24, 2024

Multiplying by Powers of Two(Shift Left)

Monday, December 23, 2024

Numeric Output Conversionand Multiplying by Constants(Theory)

Wednesday, December 18, 2024

Radix Output

Converting Numbers for Output and Input
with Multiplication and Division

Demonstrating Left Shift --
6800

Multiplying by Small Constants
(Shift Left and Add)

Multiplying by Powers of Two
(Shift Left)

Numeric Output Conversion
and Multiplying by Constants
(Theory)