Sunday, February 25, 2024

Optimizing Direct-call Forth in 6809. 68000, and 6800/6801

(This is a little doodling inspired by a post suggesting stackless Forth in the Minimalist Computing Facebook group. Doing this for 6809, 68000, and 6800.) 

Start with a bit of fig Forth to practice optimizing. All words called by COUNT are leaf words, making them easy to in-line. 

The fig source code ai am working with, borrowed from my transcriptions of the 6800 model, with a few comments on what I am doing:

* From Dave Lion, et. al, fig Forth model for 6800:
* Converting to direct call, in-lining, and optimizing.
* This is low-hanging fruit, but easy to see how it could work.
* Should be possible to do this all mechanically, I think.
* I'm going to pretend I know what the optimizer that
* does not yet exist will do for 6809, 68000, 6801 and 6800.
figCOUNT:
	FDB	DOCOL,DUP,ONEP,SWAP,CAT
	FDB	SEMIS
* 
figCOUNTlinearsource:
	FDB	DOCOL
	FDB	DUP
	FDB	ONEP
	FDB	SWAP
	FDB	CAT
	FDB	SEMIS

I'll start with the 6809 because it should be easy and straightforward. It probably isn't important, but, to emphasize that I'm doing away with the VM overhead and just working within a run-time that projects the VM onto the 6809 IPA, I'm showing the direct calls in the 6809 assembler, along with the functions called.

Below the in-line source, I'm showing the results of each theoretical optimization pass, starting with removing the easy extraneous data movement on the stack and in registers, and proceeding with moving things around and combining operations. Then I show the final source, along with an optional version that may not be easy to reach along mechanical paths:

******************************
* First, 6809
6809COUNTcall:
*	LBSR	DOCOL	; direct call
	LBSR	DUP
	LBSR	ONEP
	LBSR	SWAP
	LBSR	CAT
*	LBSR	SEMIS
	RTS

* Leaf routines defined as
6809DUP
	LDD	,U
	STD	,--U
	RTS

6809ONEP
*	INC	1,U	; harder to optimize, keep it simple.
*	BNE	ONEPNC
*	INC	,U
*ONEPNC	RTS
	LDD	,U
	ADDD	#1
	STD	,U
	RTS

6809SWAP
*	PULU	D,X	; harder to optimize, keep it simple.
*	EXG	D,X	; could also do separate stores, etc.
*	PSHU	D,X
*	RTS
*	
	LDD	,U
	LDX	2,U
	STD	2,U
	STX	,U
	RTS

6809CAT
	CLRA
	LDB	[,U]
	STD	,U
	RTS

* bringing the calls in-line:
6809COUNTinline:
	LDD	,U	; DUP
	STD	,--U
*
	LDD	,U	; 1+
	ADDD	#1
	STD	,U
*
	LDD	,U	; SWAP
	LDX	2,U
	STD	2,U
	STX	,U
*
	CLRA		; C@
	LDB	[,U]
	STD	,U
*
	RTS

* Vacuum out the data motion on the stack:
6809COUNTpass1
	LDD	,U	; DUP
*	STD	,--U
	LEAU	-2,U
*
*	LDD	,U	; 1+
	ADDD	#1
*	STD	,U
*
*	LDD	,U	; SWAP
	LDX	2,U
	STD	2,U
	STX	,U
*
	CLRA		; C@
	LDB	[,U]
	STD	,U
	RTS

* Combine and simplify:
6809COUNTpass2
	LDD	,U	; DUP
	LEAU	-2,U
*
	ADDD	#1	; 1+
*
	LDX	2,U	; SWAP
	STD	2,U	; Misordering possible.
*	STX	,U
*
	CLRA		; C@
*	LDB	[,U]
	LDB	,X
	STD	,U
	RTS

* Postpone stack operations:
6809COUNTpass3
	LDD	,U	; DUP
*	LEAU	-2,U
*
	ADDD	#1	; 1+
*
*	LDX	2,U	; SWAP
	LDX	,U	; SWAP
*	STD	2,U
	STD	,U
*
	CLRA		; C@
	LDB	,X
*	STD	,U
	STD	,--U
*
	RTS

6809COUNTrearrange
*	LDD	,U	; DUP
	LDX	,U
*
*	ADDD	#1	; 1+
*
*	LDX	,U	; SWAP
*	STD	,U
*
	CLRA		; C@
*	LDB	,X
	LDB	,X+
	STX	,U
*
	STD	,--U
	RTS
*
6809COUNTfinal
	LDX	,U
	CLRA		; C@
	LDB	,X+
	STX	,U
	STD	,--U
	RTS
*

* compare (Could this be done mechanically, too?):
6809COUNTmaybe:
	LDX	,U
	CLRA
	LDB	,X+
	STD	,--U
	STX	2,U
	RTS

The 68000 code follows the 6809 code's optimization paths rather closely, since they both support high-level run-time models quite well, and in similar ways.

******************************
* Now 68000:
68KCOUNTcall:
*	BSR.W	DOCOL	; direct call
	BSR.W	DUP
	BSR.W	ONEP
	BSR.W	SWAP
	BSR.W	CAT
*	BSR.W	SEMIS
	RTS

* Leaf routines defined as
68KDUP
*	MOVE.L	(A6),-(A6)	; Harder to optimize, keep it simple.
*	RTS
	MOVE.L	(A6),D0
	MOVE.L	D0,-(A6)
	RTS

68KONEP
*	ADD.L	#1,(A6)	; Harder to optimize, keep it simple.
*	RTS
	MOVE.L	(A6),D0	; Keep it simple
	ADD.L	#1,D0
	MOVE.L	D0,(A6)
	RTS

68KSWAP
*	MOVEM.L	(A6),D0/D1	; Harder to optimize, keep it simple
*	EXG	D0,D1
*	MOVEM.L	D0/D1,(A6)
*	RTS
	MOVE.L	(A6),D0
	MOVE.L	2(A6),D1
	MOVE.L	D0,2(A6)
	MOVE.L	D1,(A6)
	RTS

68KCAT
	CLR.L	D0	; zero-extend
	MOVE.L	(A6),A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
	RTS

* in-line:
68KCOUNTinline:
	MOVE.L	(A6),D0	; DUP
	MOVE.L	D0,-(A6)
*
	MOVE.L	(A6),D0	; 1+
	ADD.L	#1,D0
	MOVE.L	D0,(A6)
*
	MOVE.L	(A6),D0	; SWAP
	MOVE.L	2(A6),D1
	MOVE.L	D0,2(A6)
	MOVE.L	D1,(A6)
*
	CLR.L	D0	; C@
	MOVE.L	(A6),A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Vacuum out the data motion on the stack:
68KCOUNTpass1
	MOVE.L	(A6),D0	; DUP
*	MOVE.L	D0,-(A6)
	LEA	-4(A6),A6
*
*	MOVE.L	(A6),D0	; 1+
	ADD.L	#1,D0
*	MOVE.L	D0,(A6)
*
*	MOVE.L	(A6),D0	; SWAP
	MOVE.L	2(A6),D1	; Misordering possible.
	MOVE.L	D0,2(A6)
*	MOVE.L	D1,(A6)
*
	CLR.L	D0	; C@
*	MOVE.L	(A6),A0
	MOVE.L	D1,A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Combine and simplify:
68KCOUNTpass2
	MOVE.L	(A6),D0	; DUP
	LEA	-4(A6),A6
*
	ADD.L	#1,D0	; 1+
*
*	MOVE.L	4(A6),D1	; SWAP
	MOVE.L	4(A6),A0	; SWAP
	MOVE.L	D0,4(A6)
*
	CLR.L	D0	; C@
*	MOVE.L	D1,A0
	MOVE.B	(A0),D0
	MOVE.L	D0,(A6)
*
	RTS

* Postpone stack operations:
68KCOUNTpass3
	MOVE.L	(A6),D0	; DUP
*	LEA	-4(A6),A6
*
	ADD.L	#1,D0	; 1+
*
*	MOVE.L	4(A6),A0	; SWAP
	MOVE.L	(A6),A0	; SWAP
*	MOVE.L	D0,4(A6)
	MOVE.L	D0,(A6)
*
	CLR.L	D0	; C@
	MOVE.B	(A0),D0
*	MOVE.L	D0,(A6)
	MOVE.L	D0,-(A6)
*
	RTS

68KCOUNTrearrange
*	MOVE.L	(A6),D0	; DUP
	MOVE.L	(A6),A0
*
*	ADD.L	#1,D0	; 1+
*
*	MOVE.L	(A6),A0	; SWAP
*	MOVE.L	D0,(A6)
*
	CLR.L	D0	; C@
*	MOVE.B	(A0),D0
	MOVE.B	(A0)+,D0
	MOVE.L	A0,(A6)
	MOVE.L	D0,-(A6)
*
	RTS

68KCOUNTfinal
	MOVE.L	(A6),A0
	CLR.L	D0	; C@
	MOVE.B	(A0)+,D0
	MOVE.L	A0,(A6)
	MOVE.L	D0,-(A6)
	RTS


* compare (Could this be done, too?):
68KCOUNTmaybe:
	MOVE.L	(A6),A0
	CLR.L	D0
	MOVE.B	(A0)+,D0
	MOVE.L	D0,-(A6)
	MOVE.L	A0,4(A6)
	RTS

The 6801's 16-bit support, with more primitive resources, induces a different path:

******************************
* Next, 6801

* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP	RMB	2	; parameter stack pointer
* DTEMPA	RMB	2	; temp for SWAP, ...

6801COUNTcall:
*	JSR	DOCOL	; direct call
	JSR	DUP
	JSR	ONEP
	JSR	SWAP
	JSR	CAT
*	JSR	SEMIS
	RTS

* Leaf routines defined as
6801DUP
	LDX	PSP
	LDD	0,X
	DEX
	DEX
	STD	0,X
	STX	PSP
	RTS

6801ONEP
*	LDX	PSP
*	INC	1,X	; harder to optimize, keep it simple.
*	BNE	ONEPNC
*	INC	0,X
*ONEPNC	STX	PSP
*	RTS
	LDX	PSP
	LDD	0,X
	ADDD	#1
	STD	0,X
	RTS

6801SWAP
*	LDX	PSP	; this uses no static local variable,
*	LDAA	0,X	; but it will be harder to optimize
*	LDAB	2,X
*	STAA	2,X
*	STAB	0,X
*	LDAA	1,X
*	LDAB	3,X
*	STAA	3,X
*	STAB	1,X
*	RTS
	LDX	PSP
	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
	RTS
 
6801CAT
	LDX	PSP
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
	RTS

* in-line:
6801COUNTinline:
	LDX	PSP	; DUP
	LDD	0,X
	DEX
	DEX
	STD	0,X
	STX	PSP
*
	LDX	PSP	; 1+
	LDD	0,X
	ADDD	#1
	STD	0,X
*
	LDX	PSP	; SWAP
	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
*
	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
*
	RTS

* Vacuum out the data motion on the stack:
6801COUNTpass1
	LDX	PSP	; DUP
	LDD	0,X
	DEX
	DEX
*	STD	0,X
	STX	PSP
*
*	LDX	PSP	; 1+
*	LDD	0,X
	ADDD	#1
*	STD	0,X
*
*	LDX	PSP	; SWAP
*	LDD	0,X
*	STD	DTEMPA	; Faster, but uses statically allocated variable
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDD	2,X
	STD	0,X
*	LDD	DTEMPA
	PULB
	PULA
	STD	2,X
*
*	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
*
	RTS

* Combine and simplify is stuck at this point,
6801COUNTrearrange
	LDX	PSP	; DUP
*	LDD	0,X
	DEX
	DEX
	STX	PSP
*
*	ADDD	#1	; 1+
*
**	STD	DTEMPA	; SWAP
*	PULB
*	PULA
*	LDD	2,X
*	STD	0,X
**	LDD	DTEMPA
*	PULB
*	PULA
*	STD	2,X
	LDD	2,X	; SWAP
	STD	0,X
*
	LDX	0,X	; C@
	CLRA
	LDB	0,X
*
	LDX	PSP
	STD	0,X
*
	LDD	2,X
	ADDD	#1	; 1+
	STD	2,X
*
	RTS

* Postponing stack operations is already done:
6801COUNTfinal
	LDX	PSP
	DEX
	DEX
	STX	PSP
	LDD	2,X	; DUP {SWAP}
	STD	0,X
	LDX	0,X	; C@
	CLRA
	LDB	0,X
	LDX	PSP
	STD	0,X
	LDD	2,X	; 1+
	ADDD	#1
	STD	2,X
	RTS

*6801COUNTmaybe	; No obvious alternate paths.

The 6800's lack of 16-bit support induces yet different paths, which are (surprisingly?) similar to the 6809's and 68000's paths:

******************************
* And, 6800

* Somewhere, preferably in the direct page,
* Must NOT be used by interrupt-time routines!
* -- Either save it or have interrupts use another PSP
PSP	RMB	2	; parameter stack pointer
DTEMPA	RMB	2	; temp for SWAP, ...

6800COUNTcall:
*	JSR	DOCOL	; direct call
	JSR	DUP
	JSR	ONEP
	JSR	SWAP
	JSR	CAT
*	JSR	SEMIS
	RTS

* Leaf routines defined as
6800DUP
	LDX	PSP	; will doing this one byte at a time
	LDAA	0,X	; be easier to optimize or harder?
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
	RTS

6800ONEP
	LDX	PSP
	INC	1,X	; Have to add byte at a time anyway.
	BNE	ONEPNC
	INC	0,X
ONEPNC	STX	PSP
	RTS
*	LDX	PSP
*	LDAA	0,X
*	LDAB	1,X
*	ADDB	#1
*	ADCA	#0
*	STAA	0,X
*	STAB	1,X
*	RTS

6800SWAP
*	LDX	PSP	; Use accumulaters for intermediates
*	LDAA	0,X	; Requires special case to recognize what's where.
*	LDAB	2,X
*	STAA	2,X
*	STAB	0,X
*	LDAA	1,X
*	LDAB	3,X
*	STAA	3,X
*	STAB	1,X
*	RTS
	LDX	PSP	; Should be easier to optimize.
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
	RTS
 
6800CAT
	LDX	PSP
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
	RTS

* in-line:
6800COUNTinline:
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
*
	LDX	PSP	; 1+
	INC	1,X
	BNE	ONEPNC
	INC	0,X
ONEPNC
	STX	PSP
*
	LDX	PSP	; SWAP
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
* 
	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Vacuum out the easy data motion:
6800COUNTpass1
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
*	LDX	PSP	; 1+
	INC	1,X
	BNE	ONEPNC
	INC	0,X
ONEPNC
*	STX	PSP
*
*	LDX	PSP	; SWAP
	LDAA	0,X	; SWAP should almost always optimize out.
	LDAB	1,X
	PSHB		; avoid opportunities to make interrupt-time issues
	PSHA
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	PULB
	PULA
	STAA	2,X
	STAB	3,X
* 
*	LDX	PSP	; C@
	LDX	0,X
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Some easy combinations and data movement tracking:
6800COUNTpass1_1
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
*	INC	1,X	; 1+
	INCB		; 1+
	BNE	ONEPNC
*	INC	0,X
	INCA
ONEPNC
*
*	LDAA	0,X	; SWAP should almost always optimize out.
*	LDAB	1,X
*	PSHB		; avoid opportunities to make interrupt-time issues
*	PSHA
*	LDAA	2,X
*	LDAB	3,X
*	STAA	0,X
*	STAB	1,X
*	PULB
*	PULA
	STAA	2,X	; SWAP
	STAB	3,X
* 
	LDX	0,X	; C@
	CLRA
	LDB	0,X
	LDX	PSP
	STAA	0,X
	STAB	1,X
*
	RTS

* Combine one more,
6800COUNTrearrange
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
*
	INCB		; 1+
	BNE	ONEPNC
	INCA
ONEPNC
	STAA	2,X	; SWAP
	STAB	3,X
* 
	LDX	0,X	; C@
*	CLRA
	LDB	0,X
	LDX	PSP
*	STAA	0,X
	CLR	0,X
	STAB	1,X
*
	RTS

* Final
6800COUNTfinal
	LDX	PSP	; DUP
	LDAA	0,X
	LDAB	1,X
	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP	; Make the push permanent.
	INCB		; 1+
	BNE	ONEPNC
	INCA
ONEPNC
	STAA	2,X	; SWAP
	STAB	3,X
	LDX	0,X	; C@
	LDB	0,X
	LDX	PSP
	CLR	0,X
	STAB	1,X
	RTS

*6800COUNTmaybe	; No obvious alternate paths.

JFTR, this code has not been particuly tested.

No comments:

Post a Comment