Wednesday, December 4, 2024

ALPP 02-33 -- Ascending the Right Island -- Frameless Examples (Single- & Split-stack): 6809

Yet another couple of useful bits, from the bottom of the pool.

Ascending the Right Island --
 Frameless Examples (Single- & Split-stack):
6809

(Title Page/Index)

 

Now that we have worked through both the single-stack and split-stack frameless examples for the 6801, we can finally get back to the code that started this detour (6809 version) and strip out the code for maintaining the stack frames. 

On higher-level architectures like the 6809, the stack frame maintenance code can be so non-intrusive that it can be easy to fail to notice it. 

But it can still get in the way. So I'm going ahead and showing the code without it here, in single-stack no-frame and split-stack frameless discipline. 

Frameless does mean we have to keep track of what's on the stack(s).

And there's really not much more left to talk about, although we want to remember that, because we are making specific use of the direct page, the entry address is $2000 instead $80.

First the single-stack version. You'll want to compare with the single-stack stack frame version for the 6809 to get a better feel for what is happening with stack frames, as well as with the single-stack no frame version for the 6801 to see how the 6809's addressing modes make things easier.
* 16-bit addition as example of single-stack no frame discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$2000	; MDOS says this is a good place for usr stuff.
*	SETDP	$20	; for some other assemblers
	SETDP	$2000	; for EXORsim
*
ENTRY	LBRA	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP
SSAVE	RMB	2	; a place to keep S so we can return clean
SSAVEX	EQU	6	; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE	RMB	2	; just for kicks, save U, too
USAVEX	EQU	SSAVEX+2
DPSAVE	RMB	2	; a place to keep DP so we can return clean
DPSAVEX	EQU	USAVEX+2
	RMB	4	; bumper
XWORK	RMB	2	; For saving an index register temporarily
XWORKX	EQU	DPSAVEX+6
HPPTR	RMB	2	; heap pointer (not yet managed)
HPPTRX	EQU	XWORKX+2
HPALL	RMB	2	; heap allocation pointer
HPALLX	EQU	HPPTRX+2
	RMB	4	; bumper
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	HPALLX+6
GAP1	RMB	2	; Mark the bottom of the gap
GAP1X	EQU	FINALX+4
*
LB_ADDR	EQU	ENTRY
*
*
	SETDP	0	; Not yet set up
	ORG	$2100	; Give the DP room.
	RMB	4	; a little bumper space
SSTKLIM	RMB	96	; roughly 16 levels of call
SSTKLIMX	EQU	$104
* 			; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS	RMB	4	; for canary return
SSTKBASX	EQU	SSTKLIMX+96
SSTKBMP	RMB	4	; a little bumper space
SSTKBMPX	EQU	SSTKBASX+4
*
HBASE	RMB	$1024		; Not using or managing heap yet.
HBASEX	EQU	SSTKBMPX+4
HLIM	RMB	4	; bumper
HLIMX	EQU	HBASEX+$1024
*
*
INISTK	TFR	DP,A
	CLRB
	TFR	D,X		; save old DP base for a moment
	LEAY	ENTRY,PCR	; Set up new DP base
	TFR	Y,D
	TFR	A,DP		; Now we can access DP variables correctly.
*	SETDP	$20	; some other assemblers
	SETDP	$2000	; EXORsim
	STX	<DPSAVE		; technically only need to save high byte
	STU	<USAVE
	PULS	X		; get return address
	STS	<SSAVE		; Save what the monitor gave us.
	LEAS	SSTKBMPX,Y	; Move to our own stack
	LEAY	STKUNDR,PCR	; fake return to stack underflow handler
	PSHS	Y		; 
	PSHS	Y		; one more fake return to handler
	CLRB			; A still has run-time DP
	ADDD	#HBASEX		; calculat EA
	TFR	D,Y		; as if we actually had a heap
	STY	<HPPTR
	STY	<HPALL
	JMP	,X	; return via X
*
***
* Stack after call when fuctions are called by MAIN
* with two parameters
* (#0 means no local variables)
* We will return result in D:X
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [--------]
* [--------] 
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] 
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit D:X D high, X low
ADD16S	LDX	#-1	; sign extend right
	TST	2,S	; sign bit, anyway
	BMI	ADD16SR
	LEAX	1,X	; 0
ADD16SR	PSHS	X	; push right extension (parameters 4 offset)
	LDX	#-1	; negative
	LDD	6,S	; left
	BMI	ADD16SL
	LEAX	1,X	; 0
ADD16SL	PSHS	X	; push left extension (parameters 6 offset)
	ADDD	6,S	; add right
	TFR	D,X	; save low
	PULS	D	; get left sign extension (parameters 4 offset)
	ADCB	1,S	; carry is still safe
	ADCA	,S	; high word complete
	LEAS	2,S	; drop temporary
	RTS		; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit D:X D high
ADD16U	LDD	4,S	; left
	ADDD	2,S	; add right
	TFR	D,X	; save low
	LDD	#0	; extend
	ADCB	#0	; extend Carry unsigned (could ROL in)
	RTS		; C, N valid, Z not valid
*
* Etc.
*
***
* Stack at entry when called by MAIN
* (#0 means no local variables)
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [VAR1_1--]
* [VAR1_2--] <= PARAM2_1
* [PARAM2_1] (pointer to VAR1_2)
* [PARAM2_2]
* [RETADR1 ] 
*
* To show how to access caller's local variables through pointer
* instead of walking stack --
* Add 16-bit signed parameter
* to 32 bit caller's 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDX	#-1	; sign extend 1st parameter
	TST	2,S
	BMI	ADD16SIP
	LEAX	1,X
ADD16SIP	PSHS	X	; parameters now 4 offset
	LDX	6,S	; pointer -- LDD [6,X] gets the high half
	LDD	2,X	; caller's 2nd variable, low
	ADDD	4,S	; 1st parameter
	STD	2,X	; update low half
	LDD	,X	; caller's 2nd variable, high
	ADCB	1,S	; sign extension
	ADCA	,S	; high byte 
	STD	,X	; update
	LEAS	2,S	; drop temporary
	RTS		; C, N valid, Z not valid
*
*
***
* Stack after allocating local variables
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2] <= SP
*
MAIN	LDD	#0	; allocate and initialize
	TFR	D,X
	PSHS	D,X
	PSHS	D,X
*
	LDX	#$1234
	LDD	#$CDEF
	PSHS	D,X
	LBSR	ADD16U	; result in D:X should be $E023
	STX	2,S
	LDD	#$8765
	STD	0,S
	LBSR	ADD16S	; result in D:X should be $FFFF6788 (and carry set)
	STX	6,S	; result in 2nd local variable
	STD	4,S
	LEAX	4,S	; calculate address of 2nd variable to pass in
	STX	2,S
	LDD	#$A5A5
	STD	,S	
	LBSR	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	LDD	4,S
	STD	<FINAL
	LDD	6,S
	STD	<FINAL+2
	LEAS	12,S	; drop both the used parameters and the local variables together
	RTS		; C, N still valid, Z still not
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= SP
***
*
START	NOP
	LBSR	INISTK
	NOP
*
*
	LBSR	MAIN
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	<SSAVE	; restore the monitor stack pointer
	LDU	<USAVE	; restore U
	LDD	<DPSAVE	; restore the monitor DP last
	TFR	A,DP
	SETDP	0	; For lack of a better way to set it.
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	JMP	[$FFFE]	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

Again, not much to say about the split-stack code. other than that you'll want to compare it with the split-stack stack frame version for the 6809 and the split-stack stack frame version for the 6801, for the same reasons as mentioned above. to get a better feel of the differences.
* 16-bit addition as example of split-stack frame-free discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$2000	; MDOS says this is a good place for usr stuff.
*	SETDP	$20	; for lwasm and some other assemblers
	SETDP	$2000	; for EXORsim and some other assemblers
*
ENTRY	LBRA	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP
SSAVE	RMB	2	; a place to keep S so we can return clean
SSAVEX	EQU	4	; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE	RMB	2	; just for kicks, save U, too
USAVEX	EQU	SSAVEX+2
DPSAVE	RMB	2	; a place to keep DP so we can return clean
DPSAVEX	EQU	USAVEX+2
	RMB	4	; bumper
XWORK	RMB	2	; For saving an index register temporarily
XWORKX	EQU	DPSAVEX+6
HPPTR	RMB	2	; heap pointer (not yet managed)
HPPTRX	EQU	XWORKX+2
HPALL	RMB	2	; heap allocation pointer
HPALLX	EQU	HPPTRX+2
	RMB	4	; bumper
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	HPALLX+6
GAP1	RMB	2	; Mark the bottom of the gap
GAP1X	EQU	FINALX+4
*
LB_ADDR	EQU	ENTRY
*
*
	SETDP	0	; Not yet set up
	ORG	$2100	; Give the DP room.
	RMB	4	; a little bumper space
SSTKLIM	RMB	32	; 16 levels of call
SSTKLIMX	EQU	$104	; Skip over the DP page.
* 			; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS	RMB	4	; for canary return
SSTKBASX	EQU	SSTKLIMX+32
SSTKBMP	RMB	4	; a little bumper space
SSTKBMPX	EQU	SSTKBASX+4
PSTKLIM	RMB	64	; about 16 levels of call at two parameters per call
PSTKLIMX	EQU	SSTKBMPX+4
PSTKBAS	RMB	4	; bumper space -- parameter stack is pre-dec
PSTKBASX	EQU	PSTKLIMX+64
*
HBASE	RMB	$1024		; Not using or managing heap yet.
HBASEX	EQU	PSTKBASX+4
HLIM	RMB	4	; bumper
HLIMX	EQU	HBASEX+$1024
*
*
* Calculate DP because we don't have DP relative in index postbyte:
INISTKS	TFR	DP,A
	CLRB
	TFR	D,X		; save old DP base for a moment
	LEAY	ENTRY,PCR	; Set up new DP base
	TFR	Y,D
	TFR	A,DP		; Now we can access DP variables correctly.
*	SETDP	$20	; some other assemblers
	SETDP	$2000	; EXORsim
	STX	<DPSAVE		; technically only need to save high byte
	STU	<USAVE
	PULS	X		; get return address
	STS	<SSAVE		; Save what the monitor gave us.
	LEAS	SSTKBMPX,Y	; Move to our own return stack
	LEAU	PSTKBASX,Y	; and our own parameter stack
	LEAY	STKUNDR,PCR	; fake return to stack underflow handler
	PSHS	Y
	PSHS	Y		; one more fake return to stack underflow handler
	CLRB			; A still has run-time DP
	ADDD	#HBASEX		; calculat EA
	TFR	D,Y		; as if we actually had a heap
	STY	<HPPTR
	STY	<HPALL
	JMP	,X	; return via X
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ]
* [RETADR1 ]
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit
ADD16S	LDX	#-1	; sign extend right
	TST	,U	; sign bit, anyway (Use Y to show it can be used.)
	BMI	ADD16SR
	LEAX	1,X	; 0
ADD16SR	PSHU	X	; push right extension (parameters 2 offset)
	LDX	#-1	; negative
	LDD	4,U	; left
	BMI	ADD16SL
	LEAX	1,X	; 0
ADD16SL	PSHU	X	; push left extension (parameters 4 offset)
	ADDD	4,U	; add right
	STD	6,U	; save low
	PULU	D	; get left sign extension (parameters 2 offset)
	ADCB	1,U	; carry is still safe
	ADCA	,U++	; high word complete, tricky postinc (parameters 0 offset)
	STD	,U
	RTS		; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 32-bit
* output parameter:
*   17-bit sum in 32-bit 
ADD16U	LDD	2,U	; left
	ADDD	,U	; add right
	STD	2,U	; save low
	LDD	#0	; extend
	ROLB		; extend Carry unsigned (could ADC #0)
	STD	,U
	RTS		; C, N valid, Z not valid
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with two 16-bit parameters,
* [32:VAR1_1--]
* [32:VAR1_2--] <= PARAM2_1
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDD	#-1	; sign extend addend parameter
	TST	,U
	BMI	ADD16SIP
	LDD	#0
ADD16SIP	PSHU	D	; save sign extension (parameters 2 offset)
	LDX	4,U	; get pointer to variable
	LDD	2,X	; caller's 2nd variable, low
	ADDD	2,U	; addend parameter
	STD	2,X	; update low half
	LDD	,X	; caller's 2nd variable, high
	ADCB	1,U	; sign extension low byte
	ADCA	,U	; high byte
	STD	,X	; store result
	LEAU	6,U	; drop temporary and parameters -- no return parameter
	RTS		; C, N valid, Z not valid
*
*
***
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= RSP
*
* Parameter stack after local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN	LDD	#0	; allocate and initialize
	TFR	D,X
	PSHU	D,X
	PSHU	D,X
	LDX	#$1234
	LDD	#$CDEF
	PSHU	D,X	; 8 bytes local, 4 bytes parameter, 12 bytes offset
	LBSR	ADD16U	; 32-bit result on parameter stack should be $0000E023
	LEAU	2,U	; drop high part (could be optimized out).
	LDD	#$8765
	PSHU	D
	LBSR	ADD16S	; result on parameter stack should be $FFFF6788 (and carry set)
	PULU	D,X	; 4 bytes of used parameters removed from stack (local variables on top)
	STX	2,U	; low half, store in local variable
	STD	,U	; high half
	LEAX	,U	; point to 2nd variable
	LDD	#$A5A5
	PSHU	D,X	; X pushed first
	LBSR	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	LDD	2,U
	STD	<FINAL+2
	LDD	,U
	STD	<FINAL
	LEAU	8,U
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= RSP (S)
***
* (who knows?) <= PSP (U)
***
*
***
* Return stack will be just the return addresses:
* [RETADRNN  ]
*
* Return stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= RSP
*
*
* Parameter stack after initialization, mark:
* [<unknown] <= PSP
*
START	LBSR	INISTKS
*
	LBSR	MAIN
*
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	<SSAVE	; restore the monitor stack pointer
	LDU	<USAVE	; restore U
	LDD	<DPSAVE	; restore the monitor DP last
	TFR	A,DP
	SETDP	0	; For lack of a better way to set it.
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	JMP	[$FFFE]	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

As always, I have stepped through the code and made sure it does what I say it does. 

If reading through it and comparing it with other version brings up questions that stepping through the code doesn't answer, go ahead and leave me a comment.

From here, you can either go ahead to digging into outputting binary numbers, or (when I get it ready) you can look at one more set of examples for frameless discipline, on the 68000.

--

Ah, more squirrels to chase. I mean, more daydreams.

With the stack split up, we might be able to see how a simple hysteric spill-fill cache could significantly optimize calls and returns.

Calls and returns cost, in addition to the code and cycles to load the new PC, cycles to save and restore the old. With the combined stack, they also tend to incur code and cycle costs in moving parameters into place and saving and restoring registers.

With a cache attached to the return stack pointer, saves and restores can happen in parallel with fetching, decoding, and executing instructions, effectively hiding the basic call/return overhead. 

Here's what I mean by hysteric spill/fill:

Say the cache has sixteen entries (32 bytes).

When pushing a new return address crosses the boundary between the 12th and 13th entry -- 3/4 ful, the cache controller starts pushing saved addresses off the other end into main RAM, to make more room. It watches the bus so that it can do so when the bus is not busy with instruction fetches or data or DMA accesses, unless it the cache completely fills, in which case it gets the bus at higher priority than instruction fetches. 

It will keep pushing addresses out until the cache is half-empty again, or until a return cancels the fill. 

Returns will work in reverse. As long as it is nested more than four calls deep, it will try to keep at least four return addresses in the cache, schedule reads to bring addresses back in from RAM when the boundary between the 5th and 4th entries is crossed.

It uses a cache base and limit register to maintain position in the stack address space, and a stack base and limit register to tell the controller when to come to a hard stop, and when to initiate stack overflow or underflow interupt/exception processing.

I assume that you have noticed that splitting the stack helps relieve the costs of moving parameters into place, and can even be of some relief relative to saving and restoring registers.

The parameter stack is not as regularly structured as the return address stack, but it could profitably be cached in a similar manner, with a larger cache, either double or quadruple size.

Both of these caches should be paired, to enable fast context switching. Or maybe done in sets of four, but I'm not sure the 6809 would benefit from four of each. One for the current process and one to be writing back to RAM after a process switch should be enough.

And I guess, since I've commented after the 6801 examples about how the direct page should be a bank of memory to use as pseudo-registers, I should mention the concept of a cache for the direct page here. This would also be paired, with the switch activated when the DP is set. There would need to be several different strategies for filling the new cache and writing back the dirty entries from the old cache, plus a way of setting priority for differnt regions of the direcgt page.

Caching the direct page would conflict with using it for I/O devices, so I'm thinking the 6809 wants a second direct page (specifiable in the index post-byte) just for I/O. 

Heh. Daydreams, indeed. This is just an 8/16-bit processor with a 16-bit address space. Too greedy. Unless we had a true 16/32-bit descendant of the 6809.

Ah. Sorry for the further distractions. 

The 68000 frameless examples

Or outputting binary numbers

 

(Title Page/Index)


 

 

 

 

No comments:

Post a Comment