Thursday, November 21, 2024

ALPP 02-28 -- Walking the Pontoons -- Split-stack Stack Frame Example: 6800

Well, this should be the last of the rubber bricks at the bottom of the pool, taking, as predicted, a long time, much longer than the 6809 example.

  Walking the Pontoons --
Split-stack Stack Frame Example:
6800

(Title Page/Index)

Now that we have worked out the concrete single-(combined)-stack stack frame example for the 6800, let's get a look at how the split stack discipline/paradigm improves things on this processor. 

The conversion of both the split-stack and single-stack stack frame code from the 68000 to the 6809 was straightforward. 

From the 6809 to the 6801, the single-stack conversion was rather hairy. It surprised me a little that the conversion from the 6801 to the 6800 was also rather hairy. The difference a few instructions makes can be rather dramatic, and the less support you have for address math, the more the combined single stack shows up as a bottleneck

By comparison, the conversion of the split-stack code was a little tricky from the 6809 to the 6801, but quite straightforward from the 6801 to the 6800. This really isn't a surprise, since the big jump from the 6809 to the 6801 is the code to support the software stack, and that code moved to the 6800 is primarily a matter of doing 16-bit additions and subtractions a byte at a time, which we now know is not that hard.

It's mostly just a matter of doing the less-significant byte before the more-significant, and remembering to include the carry on the more significant byte -- mostly just replacing one instruction with two.

Well, and some places where you use PSHX and PULX on the 6801, where on the 6800 you instead need to save X to a temporary and, as immediately as possible, grab it back in one or both accumulators and push it or do math on it. Or the reverse order, popping the address to accumulator(s), saving in the proper order to adjacent temporary bytes, maybe working on it, and loading it back into X as soon as possible. Oh, and make sure the stack pointer gets updated in the appropriate places.

I used both accumulators to make it clear what I was doing, but if you need to keep one accumulator available for something else, the process can be done just as efficiently with one accumulator. Even the addition and subtraction works in the same number of instructions with a single accumulator, you just have to complete moving and/or working on one byte before starting on the other.

This is, as I say, all straightforward -- because you separate the stacks. When you're working on paramaters, the parameter pointer is in X. When you're working on the return stack, you've pulled S into X. When you are pointing at other things, the pointer for that is in X.

When you have the stacks combined, there are times when you find yourself needing to point at two things at once, and finding yourself with only one X. Sure, you can save X off to a temporary, but then you have to be careful how you use that temporary for other things, or the code balloons in your face (if it doesn't just blow up).

Yes, this is a problem in allocating temporary variables and in being disciplined in how you use them, but it's also a problem in optimization against the discipline. You want to make efficient use of the on-CPU resources if you can.

I did hit one snag in the un-mark routine. With the 6801, the split stack made it unnecessary to actually have an un-mark routine, but the 6800 used enough instructions to make it worthwhile. But, in replacing the PULX instruction in that routine, I used INX instead of INS to update the return stack pointer, and lost a couple of hours staring at the code and tracing through it and wondering why the stack pointer ended up pointing to a frame boundary instead of a return address when the code was supposed to be returning from a subroutine.

And that was the hardest problem going from 6801 to 6800 in the split-stack discipline.

I want to note that I realized, while debugging this, that the FINAL variable really should have some bumper space after it. We know our test routines won't get anywhere even close to overwriting it. And it's only used at the end, so, even if a stack overflow overwrote it,  it wouldn't matter. But it still should have the buffer. Or, at least, I should be testing it to be still zero before we store the final value. It's something to keep in mind.

For reference, here's what the two stacks look like, from the discussion of the two-stack discipline on the 6801, the return stack:

* [PRETADR   ]
* [PCALLERFRM]
* [RETADR    ]
* [CALLERFRM ] <= RSP

and the (conceptual) parameter stack:

* [VARIABLES  ] <= CALLERFRM
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES  ] <= FP
* [TEMPORARIES]
* [PARAMETERS ] <= PSP

again, keeping offsets positive because we don't have the 6809's fancy negative offsets. 

I've left some routines i didn't end up using in the code, don't let that bother you.

So, the code, with the reminder that the bulk of the discussion is in the code itself. Please don't skip reading the code, watching out for comments that I forgot to update after fixing something.
* 16-bit addition as example of split-stack frame-less discipline on 6800
* using the direct page,
* with test code
* Joel Matthew Rees, October, November 2024
*
	OPT	6800
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for usr stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily
DWORK	RMB	2	; For saving D temporarily
ALCOUNT	RMB	1	; for counting utility routines
	RMB	1	; reserved
PSP	RMB	2	; parameter stack pointer
PTR	RMB	2	; frame pointer
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4	; Should've had a bumper after this (but it's only used at the end?)
SSTKLIM	RMB	64	; 16 levels of call
SSTKLMX	EQU	FINALX+4
SSTKBAS	RMB	6	; for canary return
SSTKBSX	EQU	SSTKLMX+64
SSTKFAK	RMB	2	; fake frame pointer, self-link
SSTKFAX	EQU	SSTKBSX+6	; 6801 is post-dec (post-store-decrement) push
SSTKBMP	RMB	4	; a little bumper space
SSTKBMX	EQU	SSTKFAX+2	; But we are going to init S through X
PSTKLIM	RMB	64	; 16 levels of call at two parameters per call
PSTKLMX	EQU	SSTKBMX+4
PSTKBAS	RMB	4	; bumper space -- parameter stack is pre-dec
PSTKBSX	EQU	PSTKLMX+64
PSTKBMP	RMB	4	; a little bumper space
PSTKBMX	EQU	PSTKBSX+4
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	PSTKBMX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
*
INISTKS	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE
	LDAA	LB_BASE		; bootstrap own return stack
	LDAB	LB_BASE+1
	LDX	#SSTKBSX	; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1
	ADCA	XWORK
	STAB	XWORK+1		; initial return stack pointer
	STAA	XWORK
*
	LDX	#SSTKNDR	; for fake return address
	STX	DWORK		; save it for a moment
	PULA		; pop real return address
	PULB
	LDX	XWORK	; ready own return stack pointer
	STS	SSAVE	; save stack pointer from monitor ROM
	TXS		; move to our own stack (let TXS convert it)
	PSHB		; put return address on own stack
	PSHA		; stack now ready for interrupts
*
	LDAA	DWORK	; error handler for fake return
	LDAB	DWORK+1
	STAA	0,X	; in the cell beyond empty stack pointer
	STAB	1,X	; prime the return stack with error handler
	STAA	4,X	; second fake return to error handler
	STAB	5,X
* 
	LDAA	LB_BASE		; bootstrap parameter stack
	LDAB	LB_BASE+1
	LDX	#PSTKBSX	; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1		; initial parameter stack pointer
	ADCA	XWORK
	STAA	PSP		; parameter stack now ready
	STAB	PSP+1
	STAA	FP		; initial frame pointer
	STAB	FP+1
	TSX			; pointing at return address below fake
	STAA	4,X		; empty parameter stack frame pointer for fake frame,
	STAB	5,X		; stacks primed
*
	LDAA	LB_BASE		; set up heap as if we actually had one
	LDAB	LB_BASE+1
	LDX	#HBASEX		; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1		; calculat EA
	ADCA	XWORK
	STAA	HPPTR
	STAB	HPPTR+1
	STAA	HPALL		; as if the heap were functional
	STAB	HPALL+1
	LDX	#CDBASE
	STX	XWORK
	LDAA	XWORK
	LDAB	XWORK+1
	SUBB	#4
	SBCA	#0
	STAA	HPLIM
	STAB	HPLIM+1
	RTS
*
*
***
* General structure of the stacks, 
*
* return stack is always in pairs:
* [PRETADR   ]
* [PCALLERFRM]
* [RETADR    ]
* [CALLERFRM ] <= RSP
*
* order of elements on the parameter stack,
* when they are present:
* [PARAMETERS ]
* [VARIABLES  ] <= CALLERFRM
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES  ] <= FP
* [TEMPORARIES]
* [PARAMETERS ] <= PSP
*
* Result is returned on parameter stack
*
***
* Utility routines
PPOPD	LDX	PSP
	LDAA	0,X
	LDAB	1,X
	INX
	INX
	STX	PSP
	RTS
*
PPSHD	LDX	PSP
	DEX
	DEX
	STX	PSP
	STAA	0,X
	STAB	1,X
	RTS
*
* subroutine to make sure we don'T forget anything
MARK	PULA	; move the return address down
	PULB
	DES
	DES
	TSX	; to stack the mark
	PSHB
	PSHA
	LDAA	FP
	LDAB	FP+1
	STAA	0,X	; mark old FP, no allocate
	STAB	1,X
	LDX	PSP
	STX	FP	; update frame pointer
	RTS
*
UNMK	PULA		; get the return address out of the way
	PULB
	TSX		; point to the return stack
	LDX	0,X	; get the old frame pointer
	STX	FP	; and restore it
	INS		; drop the old frame pointer
	INS
	PSHB		; put the return address back
	PSHA
	RTS
*
* Compromise between speed and reusability
* Enter here to load PSP and initialize to 0
* 8 bytes
ALCL8	CLRA	
	CLRB
* Enter here with initial value in A:B
ALCLD8	LDX	PSP
* Enter here with PSP loaded and initial value in D
ALCLI8	DEX
	DEX
	STAA	0,X
	STAB	1,X
ALCLI6	DEX
	DEX
	STAA	0,X
	STAB	1,X
ALCLI4	DEX
	DEX
	STAA	0,X
	STAB	1,X
ALCLI2	DEX
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
	RTS
*
* six bytes
ALCL6	CLRA
	CLRB
ALCLD6	LDX	PSP
	BRA	ALCLI6
*
* four bytes
ALCL4	CLRA
	CLRB
ALCLD4	LDX	PSP
	BRA	ALCLI4
*
* two bytes
ALCL2	CLRA
	CLRB
	LDX	PSP
	BRA	ALCLI4
*
*
PDROP_8	LDAB	#8	; saves two bytes, 7 vs. 3
	CLRA
* Add A:B to PSP -- negative for allocation, positive for deallocation
ADDPSP	ADDB	PSP+1
	ADCA	PSP
	STAA	PSP
	STAB	PSP+1
	LDX	PSP	; return with X ready
	RTS
*
* Just use ADDPSP with a negative 8 or whatever
*PDROP_8	LDAB	#8	; saves two bytes, 7 vs. 3
* deallocate count in B
*PDROPB	LDX	PSP	; 5 bytes to deallocate in-line
*	ABX		; vs. 3 bytes to call this.
*	STX	PSP	; ABX is useful for deallocation
*	RTS		; 5 bytes vs. 7 total
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry, after link:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ]
* [FRMPTR0==<EMPTYP>]
* [RETADR1 ]
* [FRMPTR1 ] <= RSP
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>] <= FRMPTR0,FRMPTR1
* [32:VAR1_1--]
* [32:VAR1_2--] <= FRMPTR1
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP,FP
*
****
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit
ADD16S	JSR	MARK	; mark, no allocate, X is PSP
*
	LDAB	#(-1)	; default negative
	TBA
	JSR	ALCLI4	; allocate 2 temporary cells and init
	LDX	FP	; (could have optimized that to 2 bytes.)
	TST	2,X	; the left-hand operand sign bit
	BMI	ADD16SR
	LDX	PSP
	CLR	2,X	; positive
	CLR	3,X
ADD16SR	LDX	FP
	TST	0,X	; the right-hand operand sign bit
	BMI	ADD16SL
	LDX	PSP
	CLR	0,X	; positive
	CLR	1,X
ADD16SL	LDX	FP
	LDAA	2,X	; left hand 
	LDAB	3,X
	ADDB	1,X	; right hand
	ADCA	0,X
	STAA	2,X	; store low half
	STAB	3,X
	LDX	PSP
	LDAA	2,X
	LDAB	3,X
	ADCB	1,X
	ADCA	0,X
	LDX	FP	; wouldn't need to do this if we tracked PSP extras
	STAA	0,X
	STAB	1,X
*
	STX	PSP	; drop the temporaries
	JSR	UNMK
	RTS
*
* The alternative without link, mark, or restore will be shown in the no-frame case.
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit 
ADD16U	JSR	MARK	; mark, no allocate, X is PSP
*
	LDX	FP
	LDAA	2,X	; left
	LDAB	3,X
	ADDB	1,X	; add right
	ADCA	0,X
	STAA	2,X	; save low
	STAB	3,X
	LDAB	#0	; extend
	ADCB	#0	; extend Carry unsigned (could ROL)
	STAB	1,X	; re-use right side to store high half
	CLR	0,X	; only bit 8 can be affected
*
	JSR	UNMK
	RTS
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with one 16-bit parameter,
* after mark (no local allocation)
* [<unknown>  ] <= FRMPTR0
* [32:VAR1_1  ]
* [32:VAR1_2  ] <= FRMPTR1
* [16:PARAM2_1] <= PSP,FP
*
* To show how to walk the stack --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit addend
* target parameter in caller
*   2nd 32-bit variable at offset -2*NATWID
* no output parameter:
ADD16SI	JSR	MARK
*
	LDAB	#(-1)	; make a temporary -1
	TBA
	JSR	ALCLI2	; (default to signed)
	LDX	FP
	TST	0,X	; test high byte
	BMI	ADD16SP
	LDX	PSP
	CLR	0,X	; zero extend
	CLR	1,X
ADD16SP	TSX
	LDX	0,X	; caller's FP
	LDAA	2,X	; caller's 2nd variable, low
	LDAB	3,X
	LDX	FP
	ADDB	1,X	; parameter
	ADCA	0,X
	TSX
	LDX	0,X
	STAA	2,X	; update low half with result
	STAB	3,X
	LDAA	0,X	; 2nd variable, high half
	LDAB	1,X
	LDX	PSP
	ADCB	1,X	; sign extension half
	ADCA	0,X
	TSX
	LDX	0,X
	STAA	0,X	; update high half
	STAB	1,X
*
	LDX	FP
	INX		; drop parameter
	INX	
	STX	PSP	; and sign temporary goes bye-bye, too
	JSR	UNMK
	RTS
*
*
***
* Return stack on entry:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ] <= RSP
*
* Return stack after link:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ]
* [FRMPTR0==<EMPTYP>] <= RSP
*
* Parameter stack after mark and local allocation
* [<unknown>] <= FRMPTR0
* [VAR1_1--]
* [VAR1_2--] <= PSP,FP
*
MAIN	JSR	MARK
	JSR	ALCL8	; allocate and clear 8 bytes
	STX	FP	; Point FP to base of local variables.
*
	LDAA	#$12
	LDAB	#$34
	JSR	PPSHD
	LDAA	#$CD
	LDAB	#$EF
	JSR	PPSHD
	JSR	ADD16U	; 32-bit result on parameter stack should be $0000E023
	LDAA	#$87
	LDAB	#$65
	LDX	PSP	; reuse parameter space, since order is okay
	STAA	0,X
	STAB	1,X
	JSR	ADD16S	; result on parameter stack should be $FFFF6788
	LDX	PSP
	LDAA	2,X	; result low half
	LDAB	3,X
	LDX	FP
	STAA	2,X	; to 2nd local variable low half
	STAB	3,X
	LDX	PSP
	LDAA	0,X	; result high half
	LDAB	1,X
	LDX	FP
	STAA	0,X	; to 2nd local variable high half
	STAB	1,X
	LDAA	#$A5
	TAB	
	JSR	PPSHD
	JSR	ADD16SI	; result in 2nd variable should be FFFF0D2D (Carry set)
	LDX	FP
	LDAA	2,X	; 2nd variable low half
	LDAB	3,X
	LDX	LB_BASE
	STAA	FINALX+2,X
	STAB	FINALX+3,X
	LDX	FP
	LDAA	0,X
	LDAB	1,X
	LDX	LB_BASE
	STAA	FINALX,X
	STAB	FINALX+1,X
*
	JSR	PDROP_8
	JSR	UNMK
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
*
***
* Return stack will always be in pairs:
* [RETADRNN  ]
* [CALLERFMNN]
*
* Return stack after initialization:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS <= RSP
*
* Return stack after saving previous mark:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>] <= RSP
*
* Parameter stack after initialization, mark:
* [<unknown]PSTKBAS <= PSP,FP==<EMPTYP>
*
START	JSR	INISTKS
	JSR	MARK	
*	LDAA	PSP
*	LDAB	PSP+1
*	JSR	PPSHD	; empty previous mark
*	STX	FP	; empty new mark
*
	JSR	MAIN
*
	JSR	UNMK
*
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
SSTKNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP
	NOP		; another landing pad to set breakpoint at
	NOP
	LDX	$FFFE
	JMP	0,X	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

As always, I have tested this code. it handles the stack frames properly and gets the right result in the right place. And, as always, I do not guarantee that this code can be generalized or automatically produced, by hand or by compiler.

As I keep saying, we have seen what this kind of code looks like without stack frames. I plan to come back and do these examples without stack frames, so we can get a clearer picture of how the disciplines affect the code. It may take several weeks, as I have to get back to some sort of day job to pay the rent.

I now (25 Nov.) have the functionality of this code done in single-stack discipline, without stack frames. Have a look at it.

At some point in the tutorial, where we've gone far enough to need to reference these concepts again, I will point you back here. If you haven't really followed what the stack frames business is all about, you will want to come back and re-read this then. You may want to, anyway.

For the time being, here's some simpler stuff -- getting numeric output in binary.


(Title Page/Index)


 

 

 

 

No comments:

Post a Comment