(This is a little doodling inspired by a post suggesting stackless Forth in the Minimalist Computing Facebook group. Doing this for 6809, 68000, and 6800.)
Start with a bit of fig Forth to practice optimizing. All words called by COUNT are leaf words, making them easy to in-line.
The fig source code ai am working with, borrowed from
my transcriptions of the 6800 model, with a few comments on what I am doing:
* From Dave Lion, et. al, fig Forth model for 6800:
* Converting to direct call, in-lining, and optimizing.
* This is low-hanging fruit, but easy to see how it could work.
* Should be possible to do this all mechanically, I think.
* I'm going to pretend I know what the optimizer that
* does not yet exist will do for 6809, 68000, 6801 and 6800.
figCOUNT:
FDB DOCOL,DUP,ONEP,SWAP,CAT
FDB SEMIS
*
figCOUNTlinearsource:
FDB DOCOL
FDB DUP
FDB ONEP
FDB SWAP
FDB CAT
FDB SEMIS
I'll start with the 6809 because it should be easy and straightforward. It probably isn't important, but, to emphasize that I'm doing away with the VM overhead and just working within a run-time that projects the VM onto the 6809 IPA, I'm showing the direct calls in the 6809 assembler, along with the functions called.
Below the in-line source, I'm showing the results of each theoretical
optimization pass, starting with removing the easy extraneous data movement on
the stack and in registers, and proceeding with moving things around and
combining operations. Then I show the final source, along with an optional
version that may not be easy to reach along mechanical paths:
******************************
* First, 6809
6809COUNTcall:
* LBSR DOCOL ; direct call
LBSR DUP
LBSR ONEP
LBSR SWAP
LBSR CAT
* LBSR SEMIS
RTS
* Leaf routines defined as
6809DUP
LDD ,U
STD ,--U
RTS
6809ONEP
* INC 1,U ; harder to optimize, keep it simple.
* BNE ONEPNC
* INC ,U
*ONEPNC RTS
LDD ,U
ADDD #1
STD ,U
RTS
6809SWAP
* PULU D,X ; harder to optimize, keep it simple.
* EXG D,X ; could also do separate stores, etc.
* PSHU D,X
* RTS
*
LDD ,U
LDX 2,U
STD 2,U
STX ,U
RTS
6809CAT
CLRA
LDB [,U]
STD ,U
RTS
* bringing the calls in-line:
6809COUNTinline:
LDD ,U ; DUP
STD ,--U
*
LDD ,U ; 1+
ADDD #1
STD ,U
*
LDD ,U ; SWAP
LDX 2,U
STD 2,U
STX ,U
*
CLRA ; C@
LDB [,U]
STD ,U
*
RTS
* Vacuum out the data motion on the stack:
6809COUNTpass1
LDD ,U ; DUP
* STD ,--U
LEAU -2,U
*
* LDD ,U ; 1+
ADDD #1
* STD ,U
*
* LDD ,U ; SWAP
LDX 2,U
STD 2,U
STX ,U
*
CLRA ; C@
LDB [,U]
STD ,U
RTS
* Combine and simplify:
6809COUNTpass2
LDD ,U ; DUP
LEAU -2,U
*
ADDD #1 ; 1+
*
LDX 2,U ; SWAP
STD 2,U ; Misordering possible.
* STX ,U
*
CLRA ; C@
* LDB [,U]
LDB ,X
STD ,U
RTS
* Postpone stack operations:
6809COUNTpass3
LDD ,U ; DUP
* LEAU -2,U
*
ADDD #1 ; 1+
*
* LDX 2,U ; SWAP
LDX ,U ; SWAP
* STD 2,U
STD ,U
*
CLRA ; C@
LDB ,X
* STD ,U
STD ,--U
*
RTS
6809COUNTrearrange
* LDD ,U ; DUP
LDX ,U
*
* ADDD #1 ; 1+
*
* LDX ,U ; SWAP
* STD ,U
*
CLRA ; C@
* LDB ,X
LDB ,X+
STX ,U
*
STD ,--U
RTS
*
6809COUNTfinal
LDX ,U
CLRA ; C@
LDB ,X+
STX ,U
STD ,--U
RTS
*
* compare (Could this be done mechanically, too?):
6809COUNTmaybe:
LDX ,U
CLRA
LDB ,X+
STD ,--U
STX 2,U
RTS
The 68000 code follows the 6809 code's optimization paths rather closely, since they both support high-level run-time models quite well, and in similar ways.
******************************
* Now 68000:
68KCOUNTcall:
* BSR.W DOCOL ; direct call
BSR.W DUP
BSR.W ONEP
BSR.W SWAP
BSR.W CAT
* BSR.W SEMIS
RTS
* Leaf routines defined as
68KDUP
* MOVE.L (A6),-(A6) ; Harder to optimize, keep it simple.
* RTS
MOVE.L (A6),D0
MOVE.L D0,-(A6)
RTS
68KONEP
* ADD.L #1,(A6) ; Harder to optimize, keep it simple.
* RTS
MOVE.L (A6),D0 ; Keep it simple
ADD.L #1,D0
MOVE.L D0,(A6)
RTS
68KSWAP
* MOVEM.L (A6),D0/D1 ; Harder to optimize, keep it simple
* EXG D0,D1
* MOVEM.L D0/D1,(A6)
* RTS
MOVE.L (A6),D0
MOVE.L 2(A6),D1
MOVE.L D0,2(A6)
MOVE.L D1,(A6)
RTS
68KCAT
CLR.L D0 ; zero-extend
MOVE.L (A6),A0
MOVE.B (A0),D0
MOVE.L D0,(A6)
RTS
* in-line:
68KCOUNTinline:
MOVE.L (A6),D0 ; DUP
MOVE.L D0,-(A6)
*
MOVE.L (A6),D0 ; 1+
ADD.L #1,D0
MOVE.L D0,(A6)
*
MOVE.L (A6),D0 ; SWAP
MOVE.L 2(A6),D1
MOVE.L D0,2(A6)
MOVE.L D1,(A6)
*
CLR.L D0 ; C@
MOVE.L (A6),A0
MOVE.B (A0),D0
MOVE.L D0,(A6)
*
RTS
* Vacuum out the data motion on the stack:
68KCOUNTpass1
MOVE.L (A6),D0 ; DUP
* MOVE.L D0,-(A6)
LEA -4(A6),A6
*
* MOVE.L (A6),D0 ; 1+
ADD.L #1,D0
* MOVE.L D0,(A6)
*
* MOVE.L (A6),D0 ; SWAP
MOVE.L 2(A6),D1 ; Misordering possible.
MOVE.L D0,2(A6)
* MOVE.L D1,(A6)
*
CLR.L D0 ; C@
* MOVE.L (A6),A0
MOVE.L D1,A0
MOVE.B (A0),D0
MOVE.L D0,(A6)
*
RTS
* Combine and simplify:
68KCOUNTpass2
MOVE.L (A6),D0 ; DUP
LEA -4(A6),A6
*
ADD.L #1,D0 ; 1+
*
* MOVE.L 4(A6),D1 ; SWAP
MOVE.L 4(A6),A0 ; SWAP
MOVE.L D0,4(A6)
*
CLR.L D0 ; C@
* MOVE.L D1,A0
MOVE.B (A0),D0
MOVE.L D0,(A6)
*
RTS
* Postpone stack operations:
68KCOUNTpass3
MOVE.L (A6),D0 ; DUP
* LEA -4(A6),A6
*
ADD.L #1,D0 ; 1+
*
* MOVE.L 4(A6),A0 ; SWAP
MOVE.L (A6),A0 ; SWAP
* MOVE.L D0,4(A6)
MOVE.L D0,(A6)
*
CLR.L D0 ; C@
MOVE.B (A0),D0
* MOVE.L D0,(A6)
MOVE.L D0,-(A6)
*
RTS
68KCOUNTrearrange
* MOVE.L (A6),D0 ; DUP
MOVE.L (A6),A0
*
* ADD.L #1,D0 ; 1+
*
* MOVE.L (A6),A0 ; SWAP
* MOVE.L D0,(A6)
*
CLR.L D0 ; C@
* MOVE.B (A0),D0
MOVE.B (A0)+,D0
MOVE.L A0,(A6)
MOVE.L D0,-(A6)
*
RTS
68KCOUNTfinal
MOVE.L (A6),A0
CLR.L D0 ; C@
MOVE.B (A0)+,D0
MOVE.L A0,(A6)
MOVE.L D0,-(A6)
RTS
* compare (Could this be done, too?):
68KCOUNTmaybe:
MOVE.L (A6),A0
CLR.L D0
MOVE.B (A0)+,D0
MOVE.L D0,-(A6)
MOVE.L A0,4(A6)
RTS
The 6801's 16-bit support, with more primitive resources, induces a different path:
******************************
* Next, 6801
* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP RMB 2 ; parameter stack pointer
* DTEMPA RMB 2 ; temp for SWAP, ...
6801COUNTcall:
* JSR DOCOL ; direct call
JSR DUP
JSR ONEP
JSR SWAP
JSR CAT
* JSR SEMIS
RTS
* Leaf routines defined as
6801DUP
LDX PSP
LDD 0,X
DEX
DEX
STD 0,X
STX PSP
RTS
6801ONEP
* LDX PSP
* INC 1,X ; harder to optimize, keep it simple.
* BNE ONEPNC
* INC 0,X
*ONEPNC STX PSP
* RTS
LDX PSP
LDD 0,X
ADDD #1
STD 0,X
RTS
6801SWAP
* LDX PSP ; this uses no static local variable,
* LDAA 0,X ; but it will be harder to optimize
* LDAB 2,X
* STAA 2,X
* STAB 0,X
* LDAA 1,X
* LDAB 3,X
* STAA 3,X
* STAB 1,X
* RTS
LDX PSP
LDD 0,X
* STD DTEMPA ; Faster, but uses statically allocated variable
PSHB ; avoid opportunities to make interrupt-time issues
PSHA
LDD 2,X
STD 0,X
* LDD DTEMPA
PULB
PULA
STD 2,X
RTS
6801CAT
LDX PSP
LDX 0,X
CLRA
LDB 0,X
LDX PSP
STD 0,X
RTS
* in-line:
6801COUNTinline:
LDX PSP ; DUP
LDD 0,X
DEX
DEX
STD 0,X
STX PSP
*
LDX PSP ; 1+
LDD 0,X
ADDD #1
STD 0,X
*
LDX PSP ; SWAP
LDD 0,X
* STD DTEMPA ; Faster, but uses statically allocated variable
PSHB ; avoid opportunities to make interrupt-time issues
PSHA
LDD 2,X
STD 0,X
* LDD DTEMPA
PULB
PULA
STD 2,X
*
LDX PSP ; C@
LDX 0,X
CLRA
LDB 0,X
LDX PSP
STD 0,X
*
RTS
* Vacuum out the data motion on the stack:
6801COUNTpass1
LDX PSP ; DUP
LDD 0,X
DEX
DEX
* STD 0,X
STX PSP
*
* LDX PSP ; 1+
* LDD 0,X
ADDD #1
* STD 0,X
*
* LDX PSP ; SWAP
* LDD 0,X
* STD DTEMPA ; Faster, but uses statically allocated variable
PSHB ; avoid opportunities to make interrupt-time issues
PSHA
LDD 2,X
STD 0,X
* LDD DTEMPA
PULB
PULA
STD 2,X
*
* LDX PSP ; C@
LDX 0,X
CLRA
LDB 0,X
LDX PSP
STD 0,X
*
RTS
* Combine and simplify is stuck at this point,
6801COUNTrearrange
LDX PSP ; DUP
* LDD 0,X
DEX
DEX
STX PSP
*
* ADDD #1 ; 1+
*
** STD DTEMPA ; SWAP
* PULB
* PULA
* LDD 2,X
* STD 0,X
** LDD DTEMPA
* PULB
* PULA
* STD 2,X
LDD 2,X ; SWAP
STD 0,X
*
LDX 0,X ; C@
CLRA
LDB 0,X
*
LDX PSP
STD 0,X
*
LDD 2,X
ADDD #1 ; 1+
STD 2,X
*
RTS
* Postponing stack operations is already done:
6801COUNTfinal
LDX PSP
DEX
DEX
STX PSP
LDD 2,X ; DUP {SWAP}
STD 0,X
LDX 0,X ; C@
CLRA
LDB 0,X
LDX PSP
STD 0,X
LDD 2,X ; 1+
ADDD #1
STD 2,X
RTS
*6801COUNTmaybe ; No obvious alternate paths.
The 6800's lack of 16-bit support induces yet different paths, which are (surprisingly?) similar to the 6809's and 68000's paths:
******************************
* And, 6800
* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP RMB 2 ; parameter stack pointer
DTEMPA RMB 2 ; temp for SWAP, ...
6800COUNTcall:
* JSR DOCOL ; direct call
JSR DUP
JSR ONEP
JSR SWAP
JSR CAT
* JSR SEMIS
RTS
* Leaf routines defined as
6800DUP
LDX PSP ; will doing this one byte at a time
LDAA 0,X ; be easier to optimize or harder?
LDAB 1,X
DEX
DEX
STAA 0,X
STAB 1,X
STX PSP
RTS
6800ONEP
LDX PSP
INC 1,X ; Have to add byte at a time anyway.
BNE ONEPNC
INC 0,X
ONEPNC STX PSP
RTS
* LDX PSP
* LDAA 0,X
* LDAB 1,X
* ADDB #1
* ADCA #0
* STAA 0,X
* STAB 1,X
* RTS
6800SWAP
* LDX PSP ; Use accumulaters for intermediates
* LDAA 0,X ; Requires special case to recognize what's where.
* LDAB 2,X
* STAA 2,X
* STAB 0,X
* LDAA 1,X
* LDAB 3,X
* STAA 3,X
* STAB 1,X
* RTS
LDX PSP ; Should be easier to optimize.
LDAA 0,X ; SWAP should almost always optimize out.
LDAB 1,X
PSHB ; avoid opportunities to make interrupt-time issues
PSHA
LDAA 2,X
LDAB 3,X
STAA 0,X
STAB 1,X
PULB
PULA
STAA 2,X
STAB 3,X
RTS
6800CAT
LDX PSP
LDX 0,X
CLRA
LDB 0,X
LDX PSP
STAA 0,X
STAB 1,X
RTS
* in-line:
6800COUNTinline:
LDX PSP ; DUP
LDAA 0,X
LDAB 1,X
DEX
DEX
STAA 0,X
STAB 1,X
STX PSP
*
LDX PSP ; 1+
INC 1,X
BNE ONEPNC
INC 0,X
ONEPNC
STX PSP
*
LDX PSP ; SWAP
LDAA 0,X ; SWAP should almost always optimize out.
LDAB 1,X
PSHB ; avoid opportunities to make interrupt-time issues
PSHA
LDAA 2,X
LDAB 3,X
STAA 0,X
STAB 1,X
PULB
PULA
STAA 2,X
STAB 3,X
*
LDX PSP ; C@
LDX 0,X
CLRA
LDB 0,X
LDX PSP
STAA 0,X
STAB 1,X
*
RTS
* Vacuum out the easy data motion:
6800COUNTpass1
LDX PSP ; DUP
LDAA 0,X
LDAB 1,X
DEX
DEX
STAA 0,X
STAB 1,X
STX PSP ; Make the push permanent.
*
* LDX PSP ; 1+
INC 1,X
BNE ONEPNC
INC 0,X
ONEPNC
* STX PSP
*
* LDX PSP ; SWAP
LDAA 0,X ; SWAP should almost always optimize out.
LDAB 1,X
PSHB ; avoid opportunities to make interrupt-time issues
PSHA
LDAA 2,X
LDAB 3,X
STAA 0,X
STAB 1,X
PULB
PULA
STAA 2,X
STAB 3,X
*
* LDX PSP ; C@
LDX 0,X
CLRA
LDB 0,X
LDX PSP
STAA 0,X
STAB 1,X
*
RTS
* Some easy combinations and data movement tracking:
6800COUNTpass1_1
LDX PSP ; DUP
LDAA 0,X
LDAB 1,X
DEX
DEX
STAA 0,X
STAB 1,X
STX PSP ; Make the push permanent.
*
* INC 1,X ; 1+
INCB ; 1+
BNE ONEPNC
* INC 0,X
INCA
ONEPNC
*
* LDAA 0,X ; SWAP should almost always optimize out.
* LDAB 1,X
* PSHB ; avoid opportunities to make interrupt-time issues
* PSHA
* LDAA 2,X
* LDAB 3,X
* STAA 0,X
* STAB 1,X
* PULB
* PULA
STAA 2,X ; SWAP
STAB 3,X
*
LDX 0,X ; C@
CLRA
LDB 0,X
LDX PSP
STAA 0,X
STAB 1,X
*
RTS
* Combine one more,
6800COUNTrearrange
LDX PSP ; DUP
LDAA 0,X
LDAB 1,X
DEX
DEX
STAA 0,X
STAB 1,X
STX PSP ; Make the push permanent.
*
INCB ; 1+
BNE ONEPNC
INCA
ONEPNC
STAA 2,X ; SWAP
STAB 3,X
*
LDX 0,X ; C@
* CLRA
LDB 0,X
LDX PSP
* STAA 0,X
CLR 0,X
STAB 1,X
*
RTS
* Final
6800COUNTfinal
LDX PSP ; DUP
LDAA 0,X
LDAB 1,X
DEX
DEX
STAA 0,X
STAB 1,X
STX PSP ; Make the push permanent.
INCB ; 1+
BNE ONEPNC
INCA
ONEPNC
STAA 2,X ; SWAP
STAB 3,X
LDX 0,X ; C@
LDB 0,X
LDX PSP
CLR 0,X
STAB 1,X
RTS
*6800COUNTmaybe ; No obvious alternate paths.
JFTR, this code has not been particuly tested.