Sunday, February 25, 2024

Optimizing Direct-call Forth in 6809. 68000, and 6800/6801

(This is a little doodling inspired by a post suggesting stackless Forth in the Minimalist Computing Facebook group. Doing this for 6809, 68000, and 6800.) 

Start with a bit of fig Forth to practice optimizing. All words called by COUNT are leaf words, making them easy to in-line. 

The fig source code ai am working with, borrowed from my transcriptions of the 6800 model, with a few comments on what I am doing:

* From Dave Lion, et. al, fig Forth model for 6800:
* Converting to direct call, in-lining, and optimizing.
* This is low-hanging fruit, but easy to see how it could work.
* Should be possible to do this all mechanically, I think.
* I'm going to pretend I know what the optimizer that
* does not yet exist will do for 6809, 68000, 6801 and 6800.
figCOUNT:
	FDB	DOCOL,DUP,ONEP,SWAP,CAT
	FDB	SEMIS
* 
figCOUNTlinearsource:
	FDB	DOCOL
	FDB	DUP
	FDB	ONEP
	FDB	SWAP
	FDB	CAT
	FDB	SEMIS

I'll start with the 6809 because it should be easy and straightforward. It probably isn't important, but, to emphasize that I'm doing away with the VM overhead and just working within a run-time that projects the VM onto the 6809 IPA, I'm showing the direct calls in the 6809 assembler, along with the functions called.

Below the in-line source, I'm showing the results of each theoretical optimization pass, starting with removing the easy extraneous data movement on the stack and in registers, and proceeding with moving things around and combining operations. Then I show the final source, along with an optional version that may not be easy to reach along mechanical paths:

******************************
* First, 6809
6809COUNTcall:
*	LBSR	DOCOL	; direct call
	LBSR	DUP
	LBSR	ONEP
	LBSR	SWAP
	LBSR	CAT
*	LBSR	SEMIS
	RTS

* Leaf routines defined as
6809DUP
	LDD	,U
	STD	,--U
	RTS

6809ONEP
*	INC	1,U	; harder to optimize, keep it simple.
*	BNE	ONEPNC
*	INC	,U
*ONEPNC	RTS
	LDD	,U
	ADDD	#1
	STD	,U
	RTS

6809SWAP
*	PULU	D,X	; harder to optimize, keep it simple.
*	EXG	D,X	; could also do separate stores, etc.
*	PSHU	D,X
*	RTS
*	
	LDD	,U
	LDX	2,U
	STD	2,U
	STX	,U
	RTS

6809CAT
	CLRA
	LDB	[,U]
	STD	,U
	RTS

* bringing the calls in-line:
6809COUNTinline:
	LDD	,U	; DUP
	STD	,--U
*
	LDD	,U	; 1+
	ADDD	#1
	STD	,U
*
	LDD	,U	; SWAP
	LDX	2,U
	STD	2,U
	STX	,U
*
	CLRA		; C@
	LDB	[,U]
	STD	,U
*
	RTS

* Vacuum out the data motion on the stack:
6809COUNTpass1
	LDD	,U	; DUP
*	STD	,--U
	LEAU	-2,U
*
*	LDD	,U	; 1+
	ADDD	#1
*	STD	,U
*
*	LDD	,U	; SWAP
	LDX	2,U
	STD	2,U
	STX	,U
*
	CLRA		; C@
	LDB	[,U]
	STD	,U
	RTS

* Combine and simplify:
6809COUNTpass2
	LDD	,U	; DUP
	LEAU	-2,U
*
	ADDD	#1	; 1+
*
	LDX	2,U	; SWAP
	STD	2,U	; Misordering possible.
*	STX	,U
*
	CLRA		; C@
*	LDB	[,U]
	LDB	,X
	STD	,U
	RTS

* Postpone stack operations:
6809COUNTpass3
	LDD	,U	; DUP
*	LEAU	-2,U
*
	ADDD	#1	; 1+
*
*	LDX	2,U	; SWAP
	LDX	,U	; SWAP
*	STD	2,U
	STD	,U
*
	CLRA		; C@
	LDB	,X
*	STD	,U
	STD	,--U
*
	RTS

6809COUNTrearrange
*	LDD	,U	; DUP
	LDX	,U
*
*	ADDD	#1	; 1+
*
*	LDX	,U	; SWAP
*	STD	,U
*
	CLRA		; C@
*	LDB	,X
	LDB	,X+
	STX	,U
*
	STD	,--U
	RTS
*
6809COUNTfinal
	LDX	,U
	CLRA		; C@
	LDB	,X+
	STX	,U
	STD	,--U
	RTS
*

* compare (Could this be done mechanically, too?):
6809COUNTmaybe:
	LDX	,U
	CLRA
	LDB	,X+
	STD	,--U
	STX	2,U
	RTS

The 68000 code follows the 6809 code's optimization paths rather closely, since they both support high-level run-time models quite well, and in similar ways.

******************************
* Now 68000:
68KCOUNTcall:
*	BSR.W	DOCOL	; direct call
	BSR.W	DUP
	BSR.W	ONEP
	BSR.W	SWAP
	BSR.W	CAT
*	BSR.W	SEMIS
	RTS

* Leaf routines defined as
68KDUP
*	MOVE.L	(A6),-(A6)	; Harder to optimize, keep it simple.
*	RTS
	MOVE.L	(A6),D0
	MOVE.L	D0,-(A6)
	RTS

68KONEP
*	ADD.L	#1,(A6)	; Harder to optimize, keep it simple.
*	RTS
	MOVE.L	(A6),D0	; Keep it simple
	ADD.L	#1,D0
	MOVE.L	D0,(A6)
	RTS

68KSWAP
*	MOVEM.L	(A6),D0/D1	; Harder to optimize, keep it simple
*	EXG	D0,D1
*	MOVEM.L	D0/D1,(A6)
*	RTS
	MOVE.L	(A6),D0
	MOVE.L	2(A6),D1
	MOVE.L	D0,2(A6)
	MOVE.L	D1,(A6)
	RTS

68KCAT
	CLR.L	D0	; zero-extend
	MOVE.L	(A6),A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
	RTS

* in-line:
68KCOUNTinline:
	MOVE.L	(A6),D0	; DUP
	MOVE.L	D0,-(A6)
*
	MOVE.L	(A6),D0	; 1+
	ADD.L	#1,D0
	MOVE.L	D0,(A6)
*
	MOVE.L	(A6),D0	; SWAP
	MOVE.L	2(A6),D1
	MOVE.L	D0,2(A6)
	MOVE.L	D1,(A6)
*
	CLR.L	D0	; C@
	MOVE.L	(A6),A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Vacuum out the data motion on the stack:
68KCOUNTpass1
	MOVE.L	(A6),D0	; DUP
*	MOVE.L	D0,-(A6)
	LEA	-4(A6),A6
*
*	MOVE.L	(A6),D0	; 1+
	ADD.L	#1,D0
*	MOVE.L	D0,(A6)
*
*	MOVE.L	(A6),D0	; SWAP
	MOVE.L	2(A6),D1	; Misordering possible.
	MOVE.L	D0,2(A6)
*	MOVE.L	D1,(A6)
*
	CLR.L	D0	; C@
*	MOVE.L	(A6),A0
	MOVE.L	D1,A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Combine and simplify:
68KCOUNTpass2
	MOVE.L	(A6),D0	; DUP
	LEA	-4(A6),A6
*
	ADD.L	#1,D0	; 1+
*
*	MOVE.L	4(A6),D1	; SWAP
	MOVE.L	4(A6),A0	; SWAP
	MOVE.L	D0,4(A6)
*
	CLR.L	D0	; C@
*	MOVE.L	D1,A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Postpone stack operations:
68KCOUNTpass3
	MOVE.L	(A6),D0	; DUP
*	LEA	-4(A6),A6
*
	ADD.L	#1,D0	; 1+
*
*	MOVE.L	4(A6),A0	; SWAP
	MOVE.L	(A6),A0	; SWAP
*	MOVE.L	D0,4(A6)
	MOVE.L	D0,(A6)
*
	CLR.L	D0	; C@
	MOVE.B	(A0),D0
*	MOVE.L	D0,(A6)
	MOVE.L	D0,-(A6)
*
	RTS

68KCOUNTrearrange
*	MOVE.L	(A6),D0	; DUP
	MOVE.L	(A6),A0
*
*	ADD.L	#1,D0	; 1+
*
*	MOVE.L	(A6),A0	; SWAP
*	MOVE.L	D0,(A6)
*
	CLR.L	D0	; C@
*	MOVE.B	(A0),D0
	MOVE.B	(A0)+,D0
	MOVE.L	A0,(A6)
	MOVE.L	D0,-(A6)
*
	RTS

68KCOUNTfinal
	MOVE.L	(A6),A0
	CLR.L	D0	; C@
	MOVE.B	(A0)+,D0
	MOVE.L	A0,(A6)
	MOVE.L	D0,-(A6)
	RTS


* compare (Could this be done, too?):
68KCOUNTmaybe:
	MOVE.L	(A6),A0
	CLR.L	D0
	MOVE.B	(A0)+,D0
	MOVE.L	D0,-(A6)
	MOVE.L	A0,4(A6)
	RTS

The 6801's 16-bit support, with more primitive resources, induces a different path:

******************************
* Next, 6801

* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP	RMB	2	; parameter stack pointer
* DTEMPA	RMB	2	; temp for SWAP, ...

6801COUNTcall:
*	JSR	DOCOL	; direct call
	JSR	DUP
	JSR	ONEP
	JSR	SWAP
	JSR	CAT
*	JSR	SEMIS
	RTS

* Leaf routines defined as
6801DUP
	LDX	PSP
	LDD	0,X
	DEX
	DEX
	STD	0,X
	STX	PSP
	RTS

6801ONEP
*	LDX	PSP
*	INC	1,X	; harder to optimize, keep it simple.
*	BNE	ONEPNC
*	INC	0,X
*ONEPNC	STX	PSP
*	RTS
	LDX	PSP
	LDD	0,X
	ADDD	#1
	STD	0,X
	RTS

6801SWAP
*	LDX	PSP	; this uses no static local variable,
*	LDAA	0,X	; but it will be harder to optimize
*	LDAB	2,X
*	STAA	2,X
*	STAB	0,X
*	LDAA	1,X
*	LDAB	3,X
*	STAA	3,X
*	STAB	1,X
*	RTS
	LDX	PSP
	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
	RTS
 
6801CAT
	LDX	PSP
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
	RTS

* in-line:
6801COUNTinline:
	LDX	PSP	; DUP
	LDD	0,X
	DEX
	DEX
	STD	0,X
	STX	PSP
*
	LDX	PSP	; 1+
	LDD	0,X
	ADDD	#1
	STD	0,X
*
	LDX	PSP	; SWAP
	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
*
	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
*
	RTS

* Vacuum out the data motion on the stack:
6801COUNTpass1
	LDX	PSP	; DUP
	LDD	0,X
	DEX
	DEX
*	STD	0,X
	STX	PSP
*
*	LDX	PSP	; 1+
*	LDD	0,X
	ADDD	#1
*	STD	0,X
*
*	LDX	PSP	; SWAP
*	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
*
*	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
*
	RTS

* Combine and simplify is stuck at this point,
6801COUNTrearrange
	LDX	PSP	; DUP
*	LDD	0,X
	DEX
	DEX
	STX	PSP
*
*	ADDD	#1	; 1+
*
**	STD	DTEMPA	; SWAP
*	PULB
*	PULA
*	LDD	2,X
*	STD	0,X
**	LDD	DTEMPA
*	PULB
*	PULA
*	STD	2,X
	LDD	2,X	; SWAP
	STD	0,X
*
	LDX	0,X	; C@
	CLRA
	LDB	0,X
*
	LDX	PSP
	STD	0,X
*
	LDD	2,X
	ADDD	#1	; 1+
	STD	2,X
*
	RTS

* Postponing stack operations is already done:
6801COUNTfinal
	LDX	PSP
	DEX
	DEX
	STX	PSP
	LDD	2,X	; DUP {SWAP}
	STD	0,X
	LDX	0,X	; C@
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
	LDD	2,X	; 1+
	ADDD	#1
	STD	2,X
	RTS

*6801COUNTmaybe	; No obvious alternate paths.

The 6800's lack of 16-bit support induces yet different paths, which are (surprisingly?) similar to the 6809's and 68000's paths:

******************************
* And, 6800

* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP	RMB	2	; parameter stack pointer
DTEMPA	RMB	2	; temp for SWAP, ...

6800COUNTcall:
*	JSR	DOCOL	; direct call
	JSR	DUP
	JSR	ONEP
	JSR	SWAP
	JSR	CAT
*	JSR	SEMIS
	RTS

* Leaf routines defined as
6800DUP
	LDX	PSP	; will doing this one byte at a time
	LDAA	0,X	; be easier to optimize or harder?
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
	RTS

6800ONEP
	LDX	PSP
	INC	1,X	; Have to add byte at a time anyway.
	BNE	ONEPNC
	INC	0,X
ONEPNC	STX	PSP
	RTS
*	LDX	PSP
*	LDAA	0,X
*	LDAB	1,X
*	ADDB	#1
*	ADCA	#0
*	STAA	0,X
*	STAB	1,X
*	RTS

6800SWAP
*	LDX	PSP	; Use accumulaters for intermediates
*	LDAA	0,X	; Requires special case to recognize what's where.
*	LDAB	2,X
*	STAA	2,X
*	STAB	0,X
*	LDAA	1,X
*	LDAB	3,X
*	STAA	3,X
*	STAB	1,X
*	RTS
	LDX	PSP	; Should be easier to optimize.
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
	RTS
 
6800CAT
	LDX	PSP
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
	RTS

* in-line:
6800COUNTinline:
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
*
	LDX	PSP	; 1+
	INC	1,X
	BNE	ONEPNC
	INC	0,X
ONEPNC
	STX	PSP
*
	LDX	PSP	; SWAP
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
* 
	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Vacuum out the easy data motion:
6800COUNTpass1
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
*	LDX	PSP	; 1+
	INC	1,X
	BNE	ONEPNC
	INC	0,X
ONEPNC
*	STX	PSP
*
*	LDX	PSP	; SWAP
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
* 
*	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Some easy combinations and data movement tracking:
6800COUNTpass1_1
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
*	INC	1,X	; 1+
	INCB		; 1+
	BNE	ONEPNC
*	INC	0,X
	INCA
ONEPNC
*
*	LDAA	0,X	; SWAP should almost always optimize out.
*	LDAB	1,X
*	PSHB		; avoid opportunities to make interrupt-time issues
*	PSHA
*	LDAA	2,X
*	LDAB	3,X
*	STAA	0,X
*	STAB	1,X
*	PULB
*	PULA
	STAA	2,X	; SWAP
	STAB	3,X
* 
	LDX	0,X	; C@
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Combine one more,
6800COUNTrearrange
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
	INCB		; 1+
	BNE	ONEPNC
	INCA
ONEPNC
	STAA	2,X	; SWAP
	STAB	3,X
* 
	LDX	0,X	; C@
*	CLRA
	LDB	0,X
	LDX	PSP
*	STAA	0,X
	CLR	0,X
	STAB	1,X
*
	RTS

* Final
6800COUNTfinal
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
	INCB		; 1+
	BNE	ONEPNC
	INCA
ONEPNC
	STAA	2,X	; SWAP
	STAB	3,X
	LDX	0,X	; C@
	LDB	0,X
	LDX	PSP
	CLR	0,X
	STAB	1,X
	RTS

*6800COUNTmaybe	; No obvious alternate paths.

JFTR, this code has not been particuly tested.

Sunday, February 11, 2024

ALPP -- Assembly Language Programming Primer -- Preface

Preface

(Title Page/Index)

Lance Leventhal wrote a series of books on assembly language programming that many regard as the Assembly Language Programming Bible for their favorite microprocessor.

I theoretically own, somewhere, if the rain hasn't destroyed them yet, a copy of two of his books, the one on the 6809 and the one on the 68000. Thirty years or so ago, I kept trying to use them to help me with a certain project that I have given way too much of my life to, until I realized two things:

The first was that I already understood everything in both of those books, just from studying Motorola's programmers' manuals. Motorola had some really amazing documents at the time. Not perfect, but amazing.

The second was that the approach he had taken was leading me away from my goals in that project in subtle ways. He wasn't teaching assembly language programming, he was teaching a specific discipline that was being taken up by many in the industry as intellectual meta-infrastructure.

Not incidentally, I was a fan of an internal combustion engine that I read about in Popular Science that a guy named Turner designed, that a friend once looked at and said, "That's a U-joint turned into an engine." I don't remember if it was Turner or the author of the Popular Science article on the engine or someone else (Bricklin?) who called it a rotary V engine -- as in rotary V-8, which made it sound like a contradiction in terms. 

Another friend doubted that the torsional stresses could be properly taken care of. I asked him about torsion in camshafts, and he said, because we've already taken care of that with camshafts. (In other words, the engineering had already been done on cam shafts, who was going to pay for the engineering on this new way to transfer power?)

Maybe I'm a hopeless fan of underdogs.

Turning that engine into a product for ordinary consumers would have required creating a support infrastructure -- mechanics and repair shops trained in subtle technical differences that would not just know how to work on it, but how it works. In addition, there would be parts suppliers, and sales networks and such. But the ordinary internal combustion engine support infrastructure is different, and somehow gets in the way of building support structures for odd-ball engines like that Turner rotary V, and the Wankel engine that Mazda finally (mostly) gave up on (for cars), among many others. 

With no infrastructure, trying to commercialize that engine was a little like walking on air.

Infrastructure for electric vehicles has been significantly helped by a certain billionaire's involvement. (And rotary engines seem to be being brought along, in a comeback as range extenders.)

Established infrastructure gets in the way of many interesting things in our society.

Somebody famous has said, 

The perfect is the enemy of the good.

Many somebodies have parroted this truistic aphorism. 

We can logically invert this. 

The mediocre good is the enemy of perfection, and therefore the enemy of progress.

Now, mind you, human ideals can never be realized. Ideal is not real. God's ways are not our ways, and his thoughts are not our thoughts (Isaiah 58: 8, 9). But that is not the same thing as saying we should prioritize the mediocre good because perfection is hard to achieve. 

Nor is it saying that we can get along without ideals. Ideals are necessary, just not the ultimate end. Ideals keep us from becoming mediocre.

I'm rambling.

What does this side ramble have to do with assembly language primers?

I'm trying to find a step forward on making something real of at least parts of that project that has eaten too much of my life. (Am I a glutton for punishment?) 

Some FB friends who like Leventhal's book on the 6809 asked why I said it's not my favorite, and have challenged me to write a better primer. 

So I'm thinking maybe such a set of tutorials would be a workable step forward on that project.

That's kind of what I have in mind for this series of blog posts -- a public record of my work, but trying to make it accessible to people who try to read it.

So I need to lay out a warning up front that my approach to assembly language is subtly incompatible with most of the existing programming tools. It looks really good until you realize that, if you use my approach, you will sooner or later find yourself trying to, as it were, walk on air (no infrastructure), unless you take the time to understand the differences and are willing to develop some of your own tools and share them.

I don't have the means to develop all the missing support tools by myself.

The existing CPUs aren't quite there, either.

My favorite existing processor, Motorola's venerable 6809, isn't really quite up to my approach, because it's missing an indexing mode that would allow using the Load Effective Address instructions to take the address of a variable in the run-time direct page. We can work around this, but it consumes instructions and processor cycles.

(Speaking of the 6809, it's surprising how many professional engineers pick it up and see the U index register as a frame pointer rather than a second stack pointer.)

The 68000 isn't quite up to it, either. This is not because the 68020 is too complex and the original 68000 doesn't have 32-bit constant offset index modes and it's hard to get Motorola's CPU32 without a lot of SOC stuff that you don't want to mess with initializing, etc.  (This is all true, but.) The problem is that the 68000 has all these index registers, but using them as segment registers gets in the way of using them as index registers. LoL. Not really a problem, but a problem. We can work around this, as well.

Speaking of segment registers, Intel's x86 could have been a bit more workable -- if still clunky, but BP and SP are stuck in the same segment unless you use segment override. And you really don't want to do that because it's dangerous and gets in the way -- BP designed as frame base pointer instead of parameter stack pointer is definitely one of the things we would be fighting against on x86. There are workarounds, such as ignoring that the two stacks are in the same segment. (The Extra Segment doesn't work for this, the last time I tried it. That is, you really don't want to be losing access to the parameter stack every time you want to move strings.) I think I'll let someone else play with that. At least, I'm not interested in doing so again right now. I've seen what happens when that siren song takes its toll.

Existing CPUs that implement the primitives of Forth-like languages or LISP kernels don't quite do the job either. At least, I have never found one such that included support for such things as virtualizing address spaces. 

Real processors are all designed to work within the existing meta-infrastructure.

Motorola's 6800 and 6801 actually get closer than you might expect (ignoring the lack of direct-page opcodes for the unary operators), if you define the use of a few direct page variables as virtual registers -- software parameter stack pointer and such -- and define support routines for the virtual registers. They would be even closer if Motorola hadn't taken a false optimization in the design and left out unary op-codes for the read-modify-write group of operators. (Hindsight is twenty/twenty, particularly from the passenger seat.)

RISC processors can be similarly supported, but need macros instead of routines for missing operators because of the cost of branches in deeply pipelined architectures. Modified RISC, like ARM and PowerPC, support building the new infrastructure rather well, but will still need macros for certain common operations.

Of course, that's also true of any processor that doesn't include the necessary registers and operations as part of the native CPU architecture, but it consumes cycles and memory resources. And there is always that siren song of false optimizations that lures you away, into the existing meta-infrastructure.

I keep hearing from people rediscovering "stackless". That's one of those false optimizations. Why it's a false optimization takes a lot of explanation, though. I expect I'll at least partly cover that -- implicitly -- in the process here.

As we go, I'll demonstrate the necessary virtual registers and support routines for the 6800 and 6801. Once we see what we can do with them and what walls we run into without, we can talk about why and how to apply it other CPUs.

But tutorials need to start simply. Explaining why one needs a parameter stack while you are teaching how to build a software stack while you are explaining what LDAA 0,X means gets kind of unwieldy.

And if you are looking for a primer, all this talk of esoteric problems isn't getting you started writing assembly language programs.

Anyway, be warned:

If you proceed with this series, I'm going to teach you things that will make you dissatisfied with the tools you have available, and I don't have the means of creating the missing tools by myself. I hardly have the time to write the tutorial chapters.

So, what does the example I just gave of the 6800/6801 indexed mode load accumulator A

    LDAA 0,X

mean? And the 6809 equivalent, 

    LDA    ,X

And the 68000 near-equivalent, 

    MOVE.B (A0),D0

and the difference between that and, say,

    MOVE.L (A0),D0

on the 68000?

Ick. Jumping in too deep already, perhaps. Definitely needs context and underlying concepts. 

Maybe start with accumulators and data registers.

(Title Page/Index)