Friday, March 1, 2024

ALPP 01-01 -- Accumulating Integer Results in the M6809, 6800, 6801, and M68000

Accumulating Integer Results
on the M6809, 6800, 6801, and M68000

(Title Page/Index)

What is an accumulator?

When you're working through a problem of some kind, you generally need some place to keep a record of your current results. As an example, think of adding up a total:

8 + 5 + 2 + 7 + 4 = 26 (if I didn't mess up the math)

If you can do that in your head, you probably don't even think about it, but you probably do keep a running total:

8
+ 5 = 13
+ 2 = 15
+ 7 = 22
+ 4 = 26 (Okay, I didn't mess up the math)
If you do that on a piece of paper the way I just did it here, your "accumulator area" is the space to the right of the = sign next to the current number you're adding at each step. 

But if you do it in you're head, unless you've trained yourself (or have natural photographic memory or something), you only keep the running total (the accumulated value) and the number you are currently adding. That's your effective accumulator in your head. 

It's the natural current focus point of calculations.

In the 6800/6801 and the 6809, you have two small dedicated accumulators, usually called A and B. (In the 6801 and 6809, you can concatenate them together for a single, medium-sized Double accumulator D, for some operations, more later.)

In the 68000, you have eight large data registers which can be used as accumulators, usually called D0 to D7. (Usually. Some assemblers use different conventions.)

As a totally contrived sequence of operations, let's do the above in 6800 assembly language:

* Adding a sequence of constant values:
	LDAA	#8	; Gotta start somewhere.
	ADDA	#5
	ADDA	#2
	ADDA	#7
	ADDA	#4
	NOP		; landing pad for breakpoint 

(6801 is the same, or can be.) 

  • LDAA is the mnemonic for the "Load accumulator A" (LoaD Accumulator A) instruction (operation code).
  • The # hash mark indicates that the numeric value is immediate.
  • ADDA is for "ADD to accumulator A". 
  • NOP is a no-operation op code, in this case, as noted, for a landing pad where I can set a breakpoint.
  • And the notes after the semicolon are comments for human readers (mostly), and the computer (mostly) ignores them.

Now, this entire sequence is known in advance, and should usually be optimized to a single instruction

	LDAA	#26	; optimized version

But if we optimize away the operations now, we can't see what they are doing. So we won't do that.  Not yet.

If you are wondering where the result is supposed to show up, good. 

Let's break out a debugger and trace it through:

% t on
% b 100a
Breakpoint set at 100A
% c 1000

          0 A=00 B=00 X=0000 SP=00FF ------          1000: 86 08    LDA #08   EA=1001 D=08
          1 A=08 B=00 X=0000 SP=00FF ------          1002: 8B 05    ADDA #05  EA=1003 D=05
          2 A=0D B=00 X=0000 SP=00FF ------          1004: 8B 02    ADDA #02  EA=1005 D=02
          3 A=0F B=00 X=0000 SP=00FF ------          1006: 8B 07    ADDA #07  EA=1007 D=07
          4 A=16 B=00 X=0000 SP=00FF H-----          1008: 8B 04    ADDA #04  EA=1009 D=04

Breakpoint!
>         5 A=1A B=00 X=0000 SP=FF8A ------          100A: 01       NOP                      

Looks good, right? Nothing happening in B, X, and SP. The H half-carry flag gets set and then goes away, we see the PC counting through the code, and the machine code is there with the assembler mnemonics. 

(Above, EA is something we call the effective address of the operand, and D is the data operand referenced at the effective address, more on that later.)  

But, wait, ... since when does 8 + 5 = what? D? And we don't see 26 in there at the end, or anywhere?

What went wrong? Computers are never wrong! (cough.) 

Must be something we did. 

Let's try a different sequence of numbers. Here's a well-known one:

1 + 2 + 3 + 4 + 5 + 6 = 21

Yep. ( 7 × 6 ) ÷ 2 is 21. Code:

	LDAA	#0	; Clear A out.
	ADDA	#1	; Add 1 to A.
	ADDA	#2	; Add 2.
	ADDA	#3	; Etc.
	ADDA	#4
	ADDA	#5
	ADDA	#6
	NOP		; landing pad
That ought to work. Trace it in the debugger:
% b 100e
Breakpoint set at 100E
% t on
% c 1000
          0 A=00 B=00 X=0000 SP=FF8A ------          1000: 86 00    LDA #00   EA=1001 D=00
          1 A=00 B=00 X=0000 SP=FF8A ---Z--          1002: 8B 01    ADDA #01  EA=1003 D=01
          2 A=01 B=00 X=0000 SP=FF8A ------          1004: 8B 02    ADDA #02  EA=1005 D=02
          3 A=03 B=00 X=0000 SP=FF8A ------          1006: 8B 03    ADDA #03  EA=1007 D=03
          4 A=06 B=00 X=0000 SP=FF8A ------          1008: 8B 04    ADDA #04  EA=1009 D=04
          5 A=0A B=00 X=0000 SP=FF8A ------          100A: 8B 05    ADDA #05  EA=100B D=05
          6 A=0F B=00 X=0000 SP=FF8A ------          100C: 8B 06    ADDA #06  EA=100D D=06

Breakpoint!
>         7 A=15 B=00 X=0000 SP=FF8A H-----          100E: 01       NOP                      

Much better. Load in a 0 and the result in A is 0. Add 1 to get 1, add 2 to get 3, and 3 to get 6, add 4 to get A ...

Erk. Giving 8 + 5 = D a D grade is understandable. What grade do we give to 6 + 4 = A?

Heh. The computer is not (this time) wrong. 

So let's talk about

Hexadecimal!

Simple debuggers like the one I'm using here display many results in hexadecimal instead of decimal, to save time, space on the screen, writing code, and just to make things more meaningful. (You doubt that last one? Trust me on this.)

For the time being, and maybe a quick refresher, here's a quick hexadecimal-to-decimal conversion chart, up to 31ten:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 A B C D
E F
 
16
17 18
19
20
21
22
23
24
25
26 27
28
29
30
31
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F

So, 8 + 5 does, in fact, equal Dsixteen (D base sixteen) in the usual (modern) way of writing hexadecimal.

And 6 + 4 does equal Asixteen (A base sixteen).

Yes, it's inconvenient when you've spent so much of your life getting used to decimal, but we'll try to take time to make sure we don't get lost in the hexadecimal. 

So, from the conversion chart, 22ten is 16sixteen. Maybe that's confusing, but let's look at why.

Remember that each column in base ten is a power of ten, increasing to the left. (That's most-significant column first.) 

22ten is 2 times ten plus 2 times one: 

twenty plus two, or 20ten + 2ten.

And each column in base sixteen is a power of sixteen, increasing to the left.

16sixteen is 1 times sixteen plus 6 times one: 

sixteen plus six, or 10sixteen + 6sixteen

(To help us not get confused, we can read that as "one-zero base sixteen plus six base sixteen".)

So, working in hexadecimal (and checking in decimal),

8sixteen
+ 5sixteen = Dsixteen (13ten)
+ 2sixteen = Fsixteen (15ten)
+ 7sixteen = 16sixteen (16ten + 6ten = 22ten)
+ 4sixteen = 1Asixteen (16ten + 10ten = 26ten)

There's our 26!

And, 

0sixteen
+ 1sixteen = 1sixteen (1ten)
+ 2sixteen = 3sixteen (3ten)
+ 3sixteen = 6sixteen (6ten)
+ 4sixteen = Asixteen (10ten)
+ 5sixteen = Fsixteen (15ten)
+ 6sixteen = 15sixteen (16ten + 5ten = 21ten)

And that's enough of non-decimal numeric bases for the moment.  

For comparison, the 6809 version of the first sequence is almost the same in assembler:

	LDA	#8	; 6809 mnemonic: LoaD accumulator A
	ADDA	#5	; Mnemonic shared with 6800/6801.
	ADDA	#2
	ADDA	#7
	ADDA	#4 

Looking at that in a 6809 simulator:

% b 100a
Breakpoint set at 100A
% t on
% c 1000

          0 A=00 B=00 X=0000 Y=0000 U=0000 S=00FF P=00 --------            1000: 86 08        LDA #$08
          1 A=08 B=00 X=0000 Y=0000 U=0000 S=00FF P=00 --------            1002: 8B 05        ADDA #$05
          2 A=0D B=00 X=0000 Y=0000 U=0000 S=00FF P=00 --------            1004: 8B 02        ADDA #$02
          3 A=0F B=00 X=0000 Y=0000 U=0000 S=00FF P=00 --------            1006: 8B 07        ADDA #$07
          4 A=16 B=00 X=0000 Y=0000 U=0000 S=00FF P=00 --H-----            1008: 8B 04        ADDA #$04

Breakpoint!
>         5 A=1A B=00 X=0000 Y=0000 U=0000 S=00FF P=00 --------            100A: 12           NOP                        

That looks a lot like the 6800 emulation session, which is no real surprise. An extra index register (Y), an extra stack register (U), the direct page register (P). But it's a long line already, so the effective address and data are left out. And the accumulated sums are the same, as they should be.

And the 68000 version looks like this, using default register width instructions (which we will need to talk about later):

	MOVEQ	#8,D0	; MOVEQ lets us ignore register width.
	ADDQ	#5,D0	; Default register width is okay.
	ADDQ	#2,D0	; D0 is equivalent to 6800 A? Not quite?
	ADDQ	#7,D0
	ADDQ	#4,D0

MOVEQ is for "MOVE Quick"; ADDQ is for "ADD Quick". Both are somewhat optimized versions of the move and add instructions. MOVEQ affects the entire register, so any register size for the following ADD instructions is okay for the range of numbers we are using here.  (MOVEQ operands can be from -128 to 127; ADDQ operands can be from 1 to 8. Again, more later.)

With all the registers in the 68000, doing a one-line complete register dump becomes rather awkward. You'd need a screen able to show something like 300 characters per line, and then you'd wear your eyes out looking for what changed on each line. So you don't do that.

Here's what the top part of the screen looks like in the devpak debugger on the Hatari Atari ST emulator screen:

d0 = 00000000  ....    a0 = 00000000 602E 0206 00E0 0030 0001  ..........
d1 = 00000000  ....    a1 = 00000000 602E 0206 00E0 0030 0001  ..........
d2 = 00000000  ....    a2 = 00000000 602E 0206 00E0 0030 0001  ..........
d3 = 00000000  ....    a3 = 00000000 602E 0206 00E0 0030 0001  ..........
d4 = 00000000  ....    a4 = 00000000 602E 0206 00E0 0030 0001  ..........
d5 = 00000000  ....    a5 = 00000000 602E 0206 00E0 0030 0001  ..........
d6 = 00000000  ....    a6 = 00000000 602E 0206 00E0 0030 0001  ..........
d7 = 00000000  ....    a7 = 00000000 602E 0206 00E0 0030 0001  ..........
SR:0000    U          ssp = 00000000 602E 0206 00E0 0030 0001  ..........
PC:00E00030  bra $E0004E

After the data register on each line is the ASCII interpretation of the contents of the data register. Then, after the address register on  each line is the contents of ten bytes of  memory in hexadecimal, then the same ten bytes shown as if they were ASCII characters. SR is the status register, ssp is the system stack pointer, PC is the program counter. And that's a lot to take in.

Even without the extra interpretations and memory content, you can see the problem, right?

If the debugger would just show you the registers that changed, that would be great. But I don't have one of those kinds of debuggers for the 68000. 

I can pretend to be one, and I guess that will have to do. Here's what it might look like:

% t on
% b 1800a
Breakpoint set at 0001800A
% c 18000

  0 D0=00000000 A7=00017FF8 ----- 00018000: 7008 MOVEQ   #8,D0
  1 D0=00000008 A7=00017FF8 ----- 00018002: 5A40 ADDQ    #5,D0
  2 D0=0000000D A7=00017FF8 ----- 00018004: 5440 ADDQ    #2,D0
  3 D0=0000000F A7=00017FF8 ----- 00018006: 5E40 ADDQ    #7,D0
  4 D0=00000016 A7=00017FF8 ----- 00018008: 5840 ADDQ    #4,D0

Breakpoint!
> 5 D0=0000001A A7=00017FF8 ----- 0001800A: 4E71 NOP

Now, you're probably wanting to try some of this fun stuff yourself.

I plan to show how to build a hexadecimal calculator as part of the tutorial, but that's a bit down the road, yet.

Until then, hexadecimal math is supported by many desktop calculators on personal and workstation computers. Look for such things as programmer mode, and you'll probably find binary and octal modes as well within the programmer mode. 

Or, if you have a Unix or Linux OS workstation, you probably also have the Unix command-line arbitrary precision basic calculator utility, bc. That allows you to set the input and output conversion bases to arbitrary positive bases (greater than one), and can be fun and instructive to play with:

obase=16
ibase=16
8 + 8
10
8 + 5 + 2 + 7 + 4
1A

Or, the programming language Forth has a similar conversion base, just one for both input and output:

16 BASE !  ok
8 5 + 2 + 7 + 4 + . 1A  ok

(You might find post-fix math confusing, I suppose.) 

Many versions of Forth are available, I tend to use gforth, because it is available in the Debian and Ubuntu repositories.

Playing with hexadecimal math is fun, but what about to using a debugger like I have shown above to learn assembler? 

There are, of course, many ways to get hardware and/or software to practice on. 

Hardware is good, but I don't have room for hardware. So I am using Joe H. Allen's emulator EXORsim for the 6800, 6801, and 6809, and the Atari ST emulator Hatari for the 68000. Since it will be useful for the reader to follow along, I'll explain at least one way to get each of these, but this chapter is long enough already.

What about other CPUs, like the 8080, Z-80, 6502, 8086, ARM, RISC-V, Power PC, ... ? 

This is a fairly time-intensive project. Doing just the four I've already described above is plenty to fill my plate. And add to that describing how to set up the environment to run the code. 

But when I get to a good pause point, I do intend to come back and add chapters to demonstrate code for some of those processors.

So, how to get the emulators ...

(Title Page/Index


ALPP -- Assembly Language Programming Primer -- Title Page

Assembly Language Primer

Joel Matthew Rees

Amagasaki, Japan

Copyright 2024 Joel Matthew Rees


Preface


Basics

  1. Accumulating Integer Results in the M6809, 6800, 6801, and M68000
  2.  














Sunday, February 25, 2024

Optimizing Direct-call Forth in 6809. 68000, and 6800/6801

(This is a little doodling inspired by a post suggesting stackless Forth in the Minimalist Computing Facebook group. Doing this for 6809, 68000, and 6800.) 

Start with a bit of fig Forth to practice optimizing. All words called by COUNT are leaf words, making them easy to in-line. 

The fig source code ai am working with, borrowed from my transcriptions of the 6800 model, with a few comments on what I am doing:

* From Dave Lion, et. al, fig Forth model for 6800:
* Converting to direct call, in-lining, and optimizing.
* This is low-hanging fruit, but easy to see how it could work.
* Should be possible to do this all mechanically, I think.
* I'm going to pretend I know what the optimizer that
* does not yet exist will do for 6809, 68000, 6801 and 6800.
figCOUNT:
	FDB	DOCOL,DUP,ONEP,SWAP,CAT
	FDB	SEMIS
* 
figCOUNTlinearsource:
	FDB	DOCOL
	FDB	DUP
	FDB	ONEP
	FDB	SWAP
	FDB	CAT
	FDB	SEMIS

I'll start with the 6809 because it should be easy and straightforward. It probably isn't important, but, to emphasize that I'm doing away with the VM overhead and just working within a run-time that projects the VM onto the 6809 IPA, I'm showing the direct calls in the 6809 assembler, along with the functions called.

Below the in-line source, I'm showing the results of each theoretical optimization pass, starting with removing the easy extraneous data movement on the stack and in registers, and proceeding with moving things around and combining operations. Then I show the final source, along with an optional version that may not be easy to reach along mechanical paths:

******************************
* First, 6809
6809COUNTcall:
*	LBSR	DOCOL	; direct call
	LBSR	DUP
	LBSR	ONEP
	LBSR	SWAP
	LBSR	CAT
*	LBSR	SEMIS
	RTS

* Leaf routines defined as
6809DUP
	LDD	,U
	STD	,--U
	RTS

6809ONEP
*	INC	1,U	; harder to optimize, keep it simple.
*	BNE	ONEPNC
*	INC	,U
*ONEPNC	RTS
	LDD	,U
	ADDD	#1
	STD	,U
	RTS

6809SWAP
*	PULU	D,X	; harder to optimize, keep it simple.
*	EXG	D,X	; could also do separate stores, etc.
*	PSHU	D,X
*	RTS
*	
	LDD	,U
	LDX	2,U
	STD	2,U
	STX	,U
	RTS

6809CAT
	CLRA
	LDB	[,U]
	STD	,U
	RTS

* bringing the calls in-line:
6809COUNTinline:
	LDD	,U	; DUP
	STD	,--U
*
	LDD	,U	; 1+
	ADDD	#1
	STD	,U
*
	LDD	,U	; SWAP
	LDX	2,U
	STD	2,U
	STX	,U
*
	CLRA		; C@
	LDB	[,U]
	STD	,U
*
	RTS

* Vacuum out the data motion on the stack:
6809COUNTpass1
	LDD	,U	; DUP
*	STD	,--U
	LEAU	-2,U
*
*	LDD	,U	; 1+
	ADDD	#1
*	STD	,U
*
*	LDD	,U	; SWAP
	LDX	2,U
	STD	2,U
	STX	,U
*
	CLRA		; C@
	LDB	[,U]
	STD	,U
	RTS

* Combine and simplify:
6809COUNTpass2
	LDD	,U	; DUP
	LEAU	-2,U
*
	ADDD	#1	; 1+
*
	LDX	2,U	; SWAP
	STD	2,U	; Misordering possible.
*	STX	,U
*
	CLRA		; C@
*	LDB	[,U]
	LDB	,X
	STD	,U
	RTS

* Postpone stack operations:
6809COUNTpass3
	LDD	,U	; DUP
*	LEAU	-2,U
*
	ADDD	#1	; 1+
*
*	LDX	2,U	; SWAP
	LDX	,U	; SWAP
*	STD	2,U
	STD	,U
*
	CLRA		; C@
	LDB	,X
*	STD	,U
	STD	,--U
*
	RTS

6809COUNTrearrange
*	LDD	,U	; DUP
	LDX	,U
*
*	ADDD	#1	; 1+
*
*	LDX	,U	; SWAP
*	STD	,U
*
	CLRA		; C@
*	LDB	,X
	LDB	,X+
	STX	,U
*
	STD	,--U
	RTS
*
6809COUNTfinal
	LDX	,U
	CLRA		; C@
	LDB	,X+
	STX	,U
	STD	,--U
	RTS
*

* compare (Could this be done mechanically, too?):
6809COUNTmaybe:
	LDX	,U
	CLRA
	LDB	,X+
	STD	,--U
	STX	2,U
	RTS

The 68000 code follows the 6809 code's optimization paths rather closely, since they both support high-level run-time models quite well, and in similar ways.

******************************
* Now 68000:
68KCOUNTcall:
*	BSR.W	DOCOL	; direct call
	BSR.W	DUP
	BSR.W	ONEP
	BSR.W	SWAP
	BSR.W	CAT
*	BSR.W	SEMIS
	RTS

* Leaf routines defined as
68KDUP
*	MOVE.L	(A6),-(A6)	; Harder to optimize, keep it simple.
*	RTS
	MOVE.L	(A6),D0
	MOVE.L	D0,-(A6)
	RTS

68KONEP
*	ADD.L	#1,(A6)	; Harder to optimize, keep it simple.
*	RTS
	MOVE.L	(A6),D0	; Keep it simple
	ADD.L	#1,D0
	MOVE.L	D0,(A6)
	RTS

68KSWAP
*	MOVEM.L	(A6),D0/D1	; Harder to optimize, keep it simple
*	EXG	D0,D1
*	MOVEM.L	D0/D1,(A6)
*	RTS
	MOVE.L	(A6),D0
	MOVE.L	2(A6),D1
	MOVE.L	D0,2(A6)
	MOVE.L	D1,(A6)
	RTS

68KCAT
	CLR.L	D0	; zero-extend
	MOVE.L	(A6),A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
	RTS

* in-line:
68KCOUNTinline:
	MOVE.L	(A6),D0	; DUP
	MOVE.L	D0,-(A6)
*
	MOVE.L	(A6),D0	; 1+
	ADD.L	#1,D0
	MOVE.L	D0,(A6)
*
	MOVE.L	(A6),D0	; SWAP
	MOVE.L	2(A6),D1
	MOVE.L	D0,2(A6)
	MOVE.L	D1,(A6)
*
	CLR.L	D0	; C@
	MOVE.L	(A6),A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Vacuum out the data motion on the stack:
68KCOUNTpass1
	MOVE.L	(A6),D0	; DUP
*	MOVE.L	D0,-(A6)
	LEA	-4(A6),A6
*
*	MOVE.L	(A6),D0	; 1+
	ADD.L	#1,D0
*	MOVE.L	D0,(A6)
*
*	MOVE.L	(A6),D0	; SWAP
	MOVE.L	2(A6),D1	; Misordering possible.
	MOVE.L	D0,2(A6)
*	MOVE.L	D1,(A6)
*
	CLR.L	D0	; C@
*	MOVE.L	(A6),A0
	MOVE.L	D1,A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Combine and simplify:
68KCOUNTpass2
	MOVE.L	(A6),D0	; DUP
	LEA	-4(A6),A6
*
	ADD.L	#1,D0	; 1+
*
*	MOVE.L	4(A6),D1	; SWAP
	MOVE.L	4(A6),A0	; SWAP
	MOVE.L	D0,4(A6)
*
	CLR.L	D0	; C@
*	MOVE.L	D1,A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Postpone stack operations:
68KCOUNTpass3
	MOVE.L	(A6),D0	; DUP
*	LEA	-4(A6),A6
*
	ADD.L	#1,D0	; 1+
*
*	MOVE.L	4(A6),A0	; SWAP
	MOVE.L	(A6),A0	; SWAP
*	MOVE.L	D0,4(A6)
	MOVE.L	D0,(A6)
*
	CLR.L	D0	; C@
	MOVE.B	(A0),D0
*	MOVE.L	D0,(A6)
	MOVE.L	D0,-(A6)
*
	RTS

68KCOUNTrearrange
*	MOVE.L	(A6),D0	; DUP
	MOVE.L	(A6),A0
*
*	ADD.L	#1,D0	; 1+
*
*	MOVE.L	(A6),A0	; SWAP
*	MOVE.L	D0,(A6)
*
	CLR.L	D0	; C@
*	MOVE.B	(A0),D0
	MOVE.B	(A0)+,D0
	MOVE.L	A0,(A6)
	MOVE.L	D0,-(A6)
*
	RTS

68KCOUNTfinal
	MOVE.L	(A6),A0
	CLR.L	D0	; C@
	MOVE.B	(A0)+,D0
	MOVE.L	A0,(A6)
	MOVE.L	D0,-(A6)
	RTS


* compare (Could this be done, too?):
68KCOUNTmaybe:
	MOVE.L	(A6),A0
	CLR.L	D0
	MOVE.B	(A0)+,D0
	MOVE.L	D0,-(A6)
	MOVE.L	A0,4(A6)
	RTS

The 6801's 16-bit support, with more primitive resources, induces a different path:

******************************
* Next, 6801

* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP	RMB	2	; parameter stack pointer
* DTEMPA	RMB	2	; temp for SWAP, ...

6801COUNTcall:
*	JSR	DOCOL	; direct call
	JSR	DUP
	JSR	ONEP
	JSR	SWAP
	JSR	CAT
*	JSR	SEMIS
	RTS

* Leaf routines defined as
6801DUP
	LDX	PSP
	LDD	0,X
	DEX
	DEX
	STD	0,X
	STX	PSP
	RTS

6801ONEP
*	LDX	PSP
*	INC	1,X	; harder to optimize, keep it simple.
*	BNE	ONEPNC
*	INC	0,X
*ONEPNC	STX	PSP
*	RTS
	LDX	PSP
	LDD	0,X
	ADDD	#1
	STD	0,X
	RTS

6801SWAP
*	LDX	PSP	; this uses no static local variable,
*	LDAA	0,X	; but it will be harder to optimize
*	LDAB	2,X
*	STAA	2,X
*	STAB	0,X
*	LDAA	1,X
*	LDAB	3,X
*	STAA	3,X
*	STAB	1,X
*	RTS
	LDX	PSP
	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
	RTS
 
6801CAT
	LDX	PSP
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
	RTS

* in-line:
6801COUNTinline:
	LDX	PSP	; DUP
	LDD	0,X
	DEX
	DEX
	STD	0,X
	STX	PSP
*
	LDX	PSP	; 1+
	LDD	0,X
	ADDD	#1
	STD	0,X
*
	LDX	PSP	; SWAP
	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
*
	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
*
	RTS

* Vacuum out the data motion on the stack:
6801COUNTpass1
	LDX	PSP	; DUP
	LDD	0,X
	DEX
	DEX
*	STD	0,X
	STX	PSP
*
*	LDX	PSP	; 1+
*	LDD	0,X
	ADDD	#1
*	STD	0,X
*
*	LDX	PSP	; SWAP
*	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
*
*	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
*
	RTS

* Combine and simplify is stuck at this point,
6801COUNTrearrange
	LDX	PSP	; DUP
*	LDD	0,X
	DEX
	DEX
	STX	PSP
*
*	ADDD	#1	; 1+
*
**	STD	DTEMPA	; SWAP
*	PULB
*	PULA
*	LDD	2,X
*	STD	0,X
**	LDD	DTEMPA
*	PULB
*	PULA
*	STD	2,X
	LDD	2,X	; SWAP
	STD	0,X
*
	LDX	0,X	; C@
	CLRA
	LDB	0,X
*
	LDX	PSP
	STD	0,X
*
	LDD	2,X
	ADDD	#1	; 1+
	STD	2,X
*
	RTS

* Postponing stack operations is already done:
6801COUNTfinal
	LDX	PSP
	DEX
	DEX
	STX	PSP
	LDD	2,X	; DUP {SWAP}
	STD	0,X
	LDX	0,X	; C@
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
	LDD	2,X	; 1+
	ADDD	#1
	STD	2,X
	RTS

*6801COUNTmaybe	; No obvious alternate paths.

The 6800's lack of 16-bit support induces yet different paths, which are (surprisingly?) similar to the 6809's and 68000's paths:

******************************
* And, 6800

* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP	RMB	2	; parameter stack pointer
DTEMPA	RMB	2	; temp for SWAP, ...

6800COUNTcall:
*	JSR	DOCOL	; direct call
	JSR	DUP
	JSR	ONEP
	JSR	SWAP
	JSR	CAT
*	JSR	SEMIS
	RTS

* Leaf routines defined as
6800DUP
	LDX	PSP	; will doing this one byte at a time
	LDAA	0,X	; be easier to optimize or harder?
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
	RTS

6800ONEP
	LDX	PSP
	INC	1,X	; Have to add byte at a time anyway.
	BNE	ONEPNC
	INC	0,X
ONEPNC	STX	PSP
	RTS
*	LDX	PSP
*	LDAA	0,X
*	LDAB	1,X
*	ADDB	#1
*	ADCA	#0
*	STAA	0,X
*	STAB	1,X
*	RTS

6800SWAP
*	LDX	PSP	; Use accumulaters for intermediates
*	LDAA	0,X	; Requires special case to recognize what's where.
*	LDAB	2,X
*	STAA	2,X
*	STAB	0,X
*	LDAA	1,X
*	LDAB	3,X
*	STAA	3,X
*	STAB	1,X
*	RTS
	LDX	PSP	; Should be easier to optimize.
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
	RTS
 
6800CAT
	LDX	PSP
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
	RTS

* in-line:
6800COUNTinline:
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
*
	LDX	PSP	; 1+
	INC	1,X
	BNE	ONEPNC
	INC	0,X
ONEPNC
	STX	PSP
*
	LDX	PSP	; SWAP
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
* 
	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Vacuum out the easy data motion:
6800COUNTpass1
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
*	LDX	PSP	; 1+
	INC	1,X
	BNE	ONEPNC
	INC	0,X
ONEPNC
*	STX	PSP
*
*	LDX	PSP	; SWAP
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
* 
*	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Some easy combinations and data movement tracking:
6800COUNTpass1_1
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
*	INC	1,X	; 1+
	INCB		; 1+
	BNE	ONEPNC
*	INC	0,X
	INCA
ONEPNC
*
*	LDAA	0,X	; SWAP should almost always optimize out.
*	LDAB	1,X
*	PSHB		; avoid opportunities to make interrupt-time issues
*	PSHA
*	LDAA	2,X
*	LDAB	3,X
*	STAA	0,X
*	STAB	1,X
*	PULB
*	PULA
	STAA	2,X	; SWAP
	STAB	3,X
* 
	LDX	0,X	; C@
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Combine one more,
6800COUNTrearrange
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
	INCB		; 1+
	BNE	ONEPNC
	INCA
ONEPNC
	STAA	2,X	; SWAP
	STAB	3,X
* 
	LDX	0,X	; C@
*	CLRA
	LDB	0,X
	LDX	PSP
*	STAA	0,X
	CLR	0,X
	STAB	1,X
*
	RTS

* Final
6800COUNTfinal
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
	INCB		; 1+
	BNE	ONEPNC
	INCA
ONEPNC
	STAA	2,X	; SWAP
	STAB	3,X
	LDX	0,X	; C@
	LDB	0,X
	LDX	PSP
	CLR	0,X
	STAB	1,X
	RTS

*6800COUNTmaybe	; No obvious alternate paths.

JFTR, this code has not been particuly tested.

Sunday, February 11, 2024

Assembly Language Programming Primer -- Preface

Preface

Lance Leventhal wrote a series of books on assembly language programming that many regard as the Assembly Language Programming Bible for their favorite microprocessor.

I theoretically own, somewhere, if the rain hasn't destroyed them yet, a copy of two of his books, the one on the 6809 and the one on the 68000. Thirty years or so ago, I kept trying to use them to help me with a certain project that I have given way too much of my life to, until I realized two things:

The first was that I already understood everything in both of those books, just from studying Motorola's programmers' manuals. Motorola had some really amazing documents at the time. Not perfect, but amazing.

The second was that the approach he had taken was leading me away from my goals in that project in subtle ways. He wasn't teaching assembly language programming, he was teaching a specific discipline that was being taken up by many in the industry as intellectual meta-infrastructure.

Not incidentally, I was a fan of an internal combustion engine that I read about in Popular Science that a guy named Turner designed, that a friend once looked at and said, "That's a U-joint turned into an engine." I don't remember if it was Turner or the author of the Popular Science article on the engine who called it a rotary V engine -- as in rotary V-8, which makes it sound like a contradiction in terms. 

Another friend doubted that the torsional stresses could be properly taken care of. I asked him about camshafts, and he said, because we've already done that with camshafts.

Maybe I'm a hopeless fan of underdogs.

Turning that engine into a product for ordinary consumers would have required creating a support infrastructure -- mechanics and repair shops trained in subtle technical differences that would not just know how to work on it, but how it works. In addition, there would be parts suppliers, and sales networks and such. But the ordinary internal combustion engine support infrastructure is different, and somehow gets in the way of building support structures for odd-ball engines like that Turner rotary V, and the Wankel engine that Mazda finally gave up on, among many others. 

Infrastructure for electric vehicles has been significantly helped by a certain billionaire's involvement. (And rotary engines seem to being brought along, in a comeback as range extenders.)

Established infrastructure gets in the way of many interesting things in our society.

Somebody famous has said, 

The perfect is the enemy of the good.

Many somebodies have parroted this truistic aphorism. 

We can logically invert this. 

The mediocre good is the enemy of perfection, and therefore the enemy of progress.

Now, mind you, human ideals can never be realized. Ideal is not real. God's ways are not our ways, and his thoughts are not our thoughts (Isaiah 58: 8, 9). But that is not the same thing as saying we should prioritize the mediocre good because perfection is hard to achieve. 

Nor is it saying that we can get along without ideals. Ideals are necessary, just not the ultimate end. Ideals keep us from becoming mediocre.

I'm rambling.

What does this side ramble have to do with assembly language primers?

I'm trying to find a step forward on making something real of at least parts of that project that has eaten too much of my life. (Am I a glutton for punishment?) 

Some FB friends who like Leventhal's book on the 6809 asked why I said it's not my favorite, and have challenged me to write a better primer. 

So I'm thinking maybe such a set of tutorials would be a workable step forward on that project.

That's kind of what I have in mind for this series of blog posts.

But I need to lay out a warning up front that my approach to assembly language is subtly incompatible with most of the existing programming tools. It looks really good until you realize that, if you use my approach, you will sooner or later find yourself trying to, as it were, walk on air, unless you take the time to understand the differences and are willing to develop some of your own tools and share them.

I don't have the means to develop all the missing support tools by myself.

The CPUs aren't quite there, either.

My favorite existing processor, Motorola's 6809, isn't really quite up to my approach, because it's missing an indexing mode that would allow using the Load Effective Address instructions to take the address of a variable in the run-time direct page. We can work around this, but it consumes instructions and processor cycles.

The 68000 isn't quite up to it, either. Not because the 68020 is too complex and the original 68000 doesn't have 32-bit constant offset index modes and it's hard to get Motorola's CPU32 without a lot of SOC stuff that you don't want to mess with initializing, etc. The problem is that the 68000 has all these index registers, but using them as segment registers gets in the way of using them as index registers. LoL. Not really a problem. We can work around this, as well.

Speaking of segment registers, Intel's x86 could have been a bit more workable -- if still clunky, but BP and SP are stuck in the same segment unless you use segment override. And you really don't want to do that because it gets in the way -- BP designed as frame base pointer instead of parameter stack pointer is definitely one of the things we would be fighting against on x86. There are workarounds, such as ignoring that the two stacks are in the same segment. (The Extra Segment doesn't work for this, last time I tried it. That is, you really don't want to be losing access to the parameter stack whenever you want to move strings.) I think I'll let someone else play with that. At least, I'm not interested in doing so again right now. I've seen what happens when that siren song takes its toll.

Existing CPUs that implement the primitives of Forth-like languages or LISP kernels don't quite do the job either. At least, I have never found one such that included support for such things as virtualizing address spaces. Real processors are all designed to work within the existing meta-infrastructure.

Motorola's 6800 and 6801 actually get closer than you might expect (ignoring the lack of direct-page opcodes for the unary operators), if you define the use of a few direct page variables as virtual registers -- software parameter stack pointer and such -- and define support routines for the virtual registers.

RISC processors can be similarly supported, but need macros instead of routines because of the cost of branches in deeply pipelined architectures. 

Of course, that's also true of any processor that doesn't include the necessary registers and operations as part of the native CPU architecture, but it consumes cycles and memory resources. And there is always that siren song of false optimizations that lures you away, into the existing meta-infrastructure.

I keep hearing from people rediscovering "stackless". That's one of those false optimizations. Why it's a false optimization takes a lot of explanation, though. I expect I'll at least partly cover that in the process here.

As we go, I'll demonstrate the necessary virtual registers and support routines for the 6800 and 6801. Once we see what we can do with them and what walls we run into without, we can talk about why and how to apply it other CPUs.

But tutorials need to start simply. Explaining why one needs a parameter stack while you are teaching how to build a software stack while you are explaining what LDAA 0,X means gets kind of unwieldy.

And if you are looking for a primer, all this talk of esoteric problems isn't getting you started writing assembly language programs.

Anyway, be warned:

If you proceed with this series, I'm going to teach you things that will make you dissatisfied with the tools you have available, and I don't have the means of creating the missing tools by myself. I hardly have the time to write the tutorial chapters.

So, what does the example I just gave of the 6800/6801 indexed mode load accumulator A

    LDAA 0,X

mean? And the 6809 equivalent, 

    LDA    ,X

And the 68000 near-equivalent, 

    MOVE.B (A0),D0

and the difference between that and, say,

    MOVE.L (A0),D0

on the 68000?

Ick. Jumping in too deep already, perhaps. Definitely needs context and underlying concepts. Maybe start with accumulators and data registers.

(Title Page/Index)