Tuesday, November 19, 2024

ALPP 02-27 -- Ascending the Wrong Island -- Single-stack Stack Frame Example: 6800

And this one has been sitting at the bottom of the pool for a while, as predicted, even longer than the 6809 example.

  Ascending the Wrong Island --
Single-stack Stack Frame Example:
6800

(Title Page/Index)

 

Now that we have seen how we can implement concrete examples of both single-stack and split stack stack frames on the 6801, let's see if we can get a better feel for what the 6801's extensions buy for us, by repeating those implementations using only the 6800's original instruction set.

The usual caveat -- I do not recommend stack frames, and I especially do not recommend combining parameters and return addresses on a single stack. Part of the reason we're doing this is to study addressing techniques, but the other part is to convince ourselves that we don't want to do this.

I started by working out an implementation of PUSHX and POPX routines, since the PSHX and PULX routines featured so prominently the the 6801 code. Late at night, when I had time to work on this, I typed without thinking, probably in 6809 mode or something,

PUSHX	STX	XWORK
	LDAA	XWORK
	LDAB	XWORK+1
	PSHB
	PSHA
	RTS
*
POPX	PULA
	PULB
	STAA	XWORK
	STAB	XWORK+1
	LDX	XWORK
	RTS

As we know, the results of this code on the 6801 would be humorous. I laughed at myself and went to bed.

(If these were macros, or if we were doing it in-line, this would actually be exactly what we'd do -- leaving off the RTS, of course. And, of course, if we needed to do a software stack on the 6809, the push and pop routines would be even more straightforward.)

But we have to dance around the return address. So it ends up something like this:

SPSHX	STX	XWORK
	DES
	DES
	TSX
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	LDAA	XWORK
	LDAB	XWORK+1
	STAA	2,X
	STAB	3,X
	RTS
*
SPULX	TSX
	LDX	2,X
	STX	XWORK
	LDAA	0,X
	LDAB	1,X
	STAA	2,X
	STAB	3,X
	LDX	XWORK
	INS
	INS
	RTS

which is disgustingly long. But necessary. Because of the return address dance.

With that written, I at least was confident (the next time I could work on it) that the same stack frames we used on the 6801 would be workable. (If you don't have the 6801 code open in another browser window for reference, go ahead and open it up, you'll want it handy to compare.) And if the stack frame would be the same, I could just convert the link and unlink from the 6801 code:

LINKF	DES
	DES
	DES
	DES
	TSX
	LDD	4,X
	STD	0,X
	LDD	VBP
	STD	4,X
	LDD	FP
	STD	2,X
	INX
	INX
	STX	FP
	STX	VBP	
	RTS
*
* No return value on stack
UNLKF	TSX
	LDD	2,X	; get old FP, dodge return address
	STD	FP	
	LDD	4,X	; old VBP
	STD	VBP
	LDD	0,X	; return address
	STD	4,X	; copy it so we can return
	INS		; drop 4 bytes
	INS
	INS
	INS
	RTS

It was a little bit trickier, since we don't have PSHX and PULX on the 6800, but it wasn't too bad.

And then I proceeded to work on converting the addition routines. (And in the process realized I had misnamed SUB16whatever, but I've taken care of that now.)

And I discovered that moving the return value back into the X and A:B registers at the end of the routines was waste motion. 

You have to use the accumulators and the index to simply get in and out of your subroutines' stack frames, so you're just thrashing the stack at procedure entrance and exit.

So I added two variables in the direct page for the return values, RETVHI and RETVLO.

And I paused to reflect for a moment whether they would have also been useful in the 6801 code. I think it would be a wash, really, because using these direct page variables for the return values means that the caller has to load them, and the 6801 does have PSHX and PULX.

Lots of stuff like that came up during the conversion.

Another issue that came up was that the EXORsim interactive assembler has a bug that makes the FDB (form double byte constant) directive unusable.

I brought in the PSH16I routine to load 16-bit literals out of the instruction stream:

PSH16I	TSX		; point to top of return address stack
	LDX	0,X	; point into the instruction stream
	LDAA	0,X	; high byte from instruction stream
	LDAB	1,X	; low byte from instruction stream	
	INS		; drop the return address we almost have in X
	INS
	PSHB		; replace it with the constant
	PSHA
	JMP	2,X	; return to the byte after the constant.

But you have to follow that with the two bytes you want to push onto the stack, and that's a FDB in the case of addresses and 16-bit offsets -- 

And EXORsim's interactive assembler doesn't help us split addresses up, so there were several places I loaded the 16-bit address or offset into the index register and called PSHX, or did similar things.

And the stack (and runtime) initialization (STKINI is a misnomer, isn't it?) needs these routines, so I moved stuff in there around so that I'd have the stack ready early, And there was some of the math that had to be done by hand in the process of setting the stack up. I and I end up doing some things by hand anyway. 

But you'll see a premonition of why this is all so meaningless in a routine that is just for the initialization code, UADD16.

* Utility 16-bit add, leave result in A:B
UADD16	TSX		; no frame
	LDAB	5,X	; left
	ADDB	3,X	; right		; because we can
	LDAA	4,X	; left
	ADCA	2,X	; right
	LDX	0,X
UADROP	INS		; drop return address and parameters
	INS
	INS
	INS
	INS
	INS
	JMP	0,X	; return via X

I didn't end up using USUB16, but I left it in for your enjoyment.

Anyway, it was by no means as straightforward as I had hoped (of course). I ended up trying a number things that didn't help, like defining PUSHD and POPD routines, and a SBX routine.

Admittedly, some of the complexities could be avoided by simply restricting the stack from crossing a 256-byte boundary, and leaving LOUD notes in the comments about ALWAYS keeping the size and location so that it doesn't. You'll note that this it is actually the case here that the size and location would allow us to optimize out carries for the stack pointer.

But, although shouting with capitals can be done in plain text, colored text cannot, so I don't think it's a wise example ...

No, that's not it. It just doesn't really solve enough of the problems to make stack frames a reasonable option. Try it yourself if you are not convinced.

As always, read the code and the comments. Don't assume I remembered to fix the comments after every edit, if the comments don't seem to match the code, they may not, or the code may be doing things you don't understand. Take time to work through it and be sure.

* 16-bit addition as example of single-stack stack frame discipline on 6800,
* with test code
* Joel Matthew Rees, October, November 2024
*
	OPT	6800
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for user stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily in leaf functions only
DWORK	RMB	2	; For saving D temporarily in leaf functions only
RETVHI	RMB	2	; high half of 32-bit return values (because we can't push X easily)
RETVLO	RMB	2	; 16-bit return values and low half (because loading and saving is redundant)
FP	RMB	2	; frame pointer
VBP	RMB	2	; variable base pointer
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
STKLIM	RMB	192	; roughly 16 to 20 levels of call
STKLIMX	EQU	FINALX+4
STKBAS	RMB	8	; for canary return
STKSZ	EQU	192	; for EXORsim assembler limits
STKBASX	EQU	STKLIMX+192	; must be STKLIMX+STKSZ -- assembler won't take symbol
STKFAK	RMB	2	; fake frame pointer, self-link
STKFAKX	EQU	STKBASX+8	; 6801 is post-dec (post-store-decrement) push
STKBMP	RMB	4	; a little bumper space
STKBMPX	EQU	STKFAKX+2	; But we are going to init S through X
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	STKBMPX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
*STKBASM	FDB	STKBASX	; Doesn't work within EXORsim assembler limits after all
*HBASEXM	FDB	HBASEX	; by avoiding splitting large constants up at assemble time
*
INISTK	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE		; local space functional
	LDAA	LB_BASE		; bootstrap own stack
	LDAB	LB_BASE+1
*	ADDB	STKBASM+1
*	ADCA	STKBASM
	LDX	#STKBASX	; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1
	ADCA	XWORK
*
	STAB	XWORK+1		; initial stack pointer
	STAA	XWORK
*
	LDX	#STKUNDR	; for fake return address
	STX	DWORK		; save it for a moment
*	
	PULA		; pop real return address
	PULB
	LDX	XWORK	; ready own stack pointer
	STS	SSAVE	; save stack pointer from monitor ROM
	TXS		; move to our own stack (let TXS convert it)
	PSHB		; put return address on own stack
	PSHA		; stack now ready for interrupts, utility routines
*
	LDAA	DWORK	; error handler for fake return
	LDAB	DWORK+1
	STAA	0,X	; in the cell beyond empty stack pointer
	STAB	1,X
	STAA	6,X	; full fake frame
	STAB	7,X
	LDAA	XWORK	; calculate final self-link
	LDAB	XWORK+1
	ADDB	#8
	ADCA	#0
	STAA	4,X	; fake VBP
	STAB	5,X
	STAA	8,X	; final self-link
	STAB	9,X
	INX		; prepare first fake stack frame links
	INX
	STX	FP	; get frame pointers ready
	STX	VBP
	STX	0,X	; first self-link for list terminator
*
	LDAA	LB_BASE	
	LDAB	LB_BASE+1
	PSHB
	PSHA
*	JSR	PSH16I 
*	FDB	HBASEX	; EXORsim's interactive assembler doesn't like FDBs.
	LDX	#HBASEX
	JSR	SPSHX
*
	JSR	UADD16
	STAA	HPPTR		; as if we were ready to use heap
	STAB	HPPTR+1
	STAA	HPALL
	STAB	HPALL+1
*	JSR	PSH16I	; FDBs
*	FDB	CDBASE
*	JSR	PSH16I
*	FDB	(-4)		; extra bumper
*	JSR	UADD16
	LDX	#CDBASE
	STX	XWORK
	LDAA	XWORK
	LDAB	XWORK+1
	SUBB	#4
	SBCA	#0
*
	STAA	HPLIM
	STAB	HPLIM+1
	RTS		; finally done, now can return
*
***
* Since negative index offsets are so expensive,
* we want to create a stack frame with only positive offsets.
* And we want the frame pointer to be pushed after the call,
* on entry to the local context.
* And the saved frame pointer needs to link to the previous one.
* And when we restore the previous frame, 
* we need to be able to restore the previous frame base.
*
* Cross-section of general frame structure in called routine:
* [{LOCVAR}] for calling routine
* [{TEMP}  ] for calling routine
* [PARAM   ] from calling routine
* [RETADR  ] to calling routine
* [VARBP   ] base of local variables in calling routine
* [FRMLNK  ] at entry to calling routine
* [LOCVAR  ] for called -- current -- routine
* [TEMP    ] for called -- current -- routine
* [(PARAM) ] to be passed to a further call
*
* Broader cross-section, showing chaining for routine 3, in-flight:
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK2 ] <= FRMLNK3
* [LOCVAR2 ] <= VARBP2
* [TEMP2   ]
* [PARAM3  ]
* [RETADR2 ]
* [VARBP2  ]
* [FRMLNK3 ] <= FP (frame pointer)
* [LOCVAR3 ] <= VBP (variable base pointer)
* [TEMP3   ]
* [(PARAM4)] <= SP (return stack pointer (6800 S is byte below))
***
*
***
* Utility routines
*
* Push low half of return value
PSHLH	TSX
	LDAA	0,X		; return address
	LDAB	1,X
	PSHB
	PSHA
	LDAA	RETVLO
	LDAB	RETVLO+1
	STAA	0,X
	STAB	1,X
	RTS
*
* Avoid the math to split 16-bit constants into two 8-bit loads,
* and push them while we are here.
* The constant follows the call in the instruction stream.
* Leaves constant in A:B, as well.
PSH16I	TSX		; point to top of return address stack
	LDX	0,X	; point into the instruction stream
	LDAA	0,X	; high byte from instruction stream
	LDAB	1,X	; low byte from instruction stream	
	INS		; drop the return address we almost have in X
	INS
	PSHB		; replace it with the constant
	PSHA
	JMP	2,X	; return to the byte after the constant.
*
* 8 bytes for the meat of this vs. 3 for the call.
* We end up using it a lot since EXORsim's interactive assembler doesn't do FDBs.
SPSHX	STX	XWORK
	DES
	DES
	TSX
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	LDAA	XWORK
	LDAB	XWORK+1
	STAA	2,X
	STAB	3,X
	RTS
*
* 6 bytes for the meat of this vs. 3 for the call, instead of FDB
TXD	STX	XWORK
	LDAA	XWORK
	LDAB	XWORK+1
	RTS
*
* Utility 16-bit add, leave result in A:B
UADD16	TSX		; no frame
	LDAB	5,X	; left
	ADDB	3,X	; right		; because we can
	LDAA	4,X	; left
	ADCA	2,X	; right
	LDX	0,X
UADROP	INS		; drop return address and parameters
	INS
	INS
	INS
	INS
	INS
	JMP	0,X	; return via X
*
* Utility 16-bit add, leave result in A:B
USUB16	TSX		; no frame
	LDAB	5,X	; left
	SUBB	3,X	; right		; because we can
	LDAA	4,X	; left
	SBCA	2,X	; right
	LDX	0,X
	BRA	UADROP	; drop return address and parameters
*
* Let the caller do allocation after.
LINKF	DES		; allocate room to push to
	DES
	DES
	DES
	TSX
	LDAA	4,X	; return address
	LDAB	5,X	; not sure of any reason to use or not use B
	STAA	0,X	; move it down to new top of stack
	STAB	1,X
	LDAA	VBP	; copy VBP and FP above return address
	LDAB	VBP+1
	STAA	4,X
	STAB	5,X
	LDAA	FP
	LDAB	FP+1
	STAA	2,X
	STAB	3,X
	INX
	INX
	STX	FP
	STX	VBP
	RTS
*
* No return value on stack
UNLKF	LDX	FP
	LDAA	2,X	; old VBP
	LDAB	3,X
	STAA	VBP
	STAB	VBP+1
	PULA		; get the return address	
	PULB
	STAA	2,X	; put return address in place
	STAB	3,X
	TXS		; drop temporaries and locals
	LDX	0,X	; get old FP
	STX	FP
	INS
	INS
	RTS
*
* We really don't want to put S in a temp if we can avoid it
ALOCS8	PULA
	PULB
ALOS8I	DES
	DES
ALOS6I	DES
	DES
ALOS4I	DES
	DES
ALOS2I	DES
	DES
	PSHB
	PSHA
	RTS
*
ALOCS6	PULA
	PULB
	BRA	ALOS6I
*
ALOCS4	PULA
	PULB
	BRA	ALOS4I
*
ALOCS2	PULA
	PULB
	BRA	ALOS2I
*
INI0_8	CLRA
	CLRB
* call with initialization value in A:B
INIS8	TSX
INIT8	STAA	8,X
	STAB	9.X
INIT6	STAA	6,X
	STAB	7.X
INIT4	STAA	4,X
	STAB	5.X
INIT2	STAA	2,X
	STAB	3,X
	RTS		; 0,X is return address!
*
INI0_6	CLRA
	CLRB
* call with initialization value in A:B
INIS6	TSX
	BRA	INIS6
*
INI0_4	CLRA
	CLRB
* call with initialization value in A:B
INIS4	TSX
	BRA	INIS4
*
INI0_2	CLRA
	CLRB
* call with initialization value in A:B
INIS2	TSX
	BRA	INIS2
*
DROP8	PULA
	PULB
	INS
	INS
DROP6I	INS
	INS
	INS
	INS
	INS
	INS
	PSHB
	PSHA
	RTS
*
DROP6	PULA
	PULB
	BRA	DROP6I
*
*
* Stack after LINK and allocation
* when functions are called by MAIN
* with two parameters
* We will return results in RETVHI:RETVLO in direct page
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ]
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ]
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FRMLNK0
* [32:VAR1_1]
* [32:VAR1_2] <= VARBP1
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK0 ] <= FP,SP,VBP
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit D:X D high, X low
* Does not alter the parameters.
ADD16S	JSR	LINKF
	TSX		; no local allocations
*
	LDAA	#(-1)	; prepare for sign extension
	TST	8,X	; the left-hand operand sign bit
	BMI	ADD16SR
	CLRA		; zero extend
ADD16SR	PSHA		; push left extension
	PSHA		; left sign cell below X now
	LDAA	#(-1)	; reload
	TST	6,X	; the right-hand operand sign bit
	BMI	ADD16SL
	CLRA		; zero extend
ADD16SL	PSHA		; push right extension
	PSHA
	TSX		; point to sign extensions
	LDAA	12,X	; left-hand low cell
	LDAB	13,X
	ADDB	11,X	; right-hand low cell
	ADCA	10,X
	STAA	RETVLO	; save low half of result
	STAB	RETVLO+1
	LDAA	2,X	; left-hand extension
	LDAB	3,X
	ADCB	1,X	; right-hand extension
	ADCA	0,X
	STAA	RETVHI	; Save high half of result
	STAB	RETVHI+1
*
	JSR	UNLKF	; drops temporaries
	RTS		; result is in RETVLO:RETVHI
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit D:X D high
ADD16U	JSR	LINKF
	TSX		; no local allocations
*
	LDAA	8,X	; left
	LDAB	9,X
	ADDB	7,X	; right
	ADCA	6,X
	STAA	RETVLO	; save low half
	STAB	RETVLO+1
	LDAB	#0
	ADCB	#0
	STAB	RETVHI+1	; save carry bit in high half
	CLR	RETVHI		; will never carry beyond bit 17
*
	JSR	UNLKF	; drops temporaries
	RTS		; result is in RETVLO:RETVHI
*
* Etc.
*
***
*
* Stack after LINK #0 when fuctions are called by MAIN
* with one input parameter
* (#0 means no local variables)
* [<SELF>  ] <= <SELF>
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ]
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ]
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FRMLNK0
* [32:VAR1_1]
* [32:VAR1_2] <= VARBP1
* [PARAM2_1]
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK0 ] <= FP,SP,VBP
*
* To show how to walk the stack --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit addend
* target parameter in caller
*   2nd 32-bit variable at offset -2*NATWID
* no output parameter:
ADD16SI	JSR	LINKF
	TSX		; no local variables 
*
	LDAA	#(-1)
	TST	6,X	; high byte of paramater
	BMI	ADD16SIP
	CLRA
ADD16SIP	PSHA	; save the sign extension half
	PSHA
	LDX	2,X	; get caller's VBP
	LDAA	2,X	; caller's 2nd variable, low
	LDAB	3,X
	LDX	FP
	ADDB	7,X	; parameter
	ADCA	6,X
	LDX	2,X	; caller's VBP
	STAA	2,X	; save result low half away
	STAB	3,X
	LDAA	0,X	; caller's 2nd variable, high
	LDAB	1,X
	TSX
	ADCB	1,X	; sign extension half
	ADCA	0,X
	LDX	FP
	LDX	2,X
	STAA	0,X	; save result high half away
	STAB	1,X
*
	JSR	UNLKF	; drops temporaries 
	RTS		; no result to load
*
*
***
* Stack after LINK
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ] 
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FP
* [32:VAR1_1]
* [32:VAR1_2] <= SP,VBP
*
MAIN	JSR	LINKF
	JSR	ALOCS8	; 2 calls, 6 bytes vs. 1 clr + 8 pushes , 9 bytes
	JSR	INI0_8
	TSX
	STX	VBP	; link and allocate complete
*
	JSR	PSH16I
*	FDB	$1234	; parameters
	FCB	$12
	FCB	$34
	JSR	PSH16I
*	FDB	$CDEF
	FCB	$CD
	FCB	$EF
	JSR	ADD16U	; result in D:X should be $E023
	INS		; drop one parameter, reuse other
	INS
	TSX
	LDAA	RETVLO	; four extra bytes compared to calling PSHLH
	LDAB	RETVLO+1
	STAA	0,X
	STAB	1,X	
	JSR	PSH16I
*	FDB	$8765
	FCB	$87
	FCB	$65
	JSR	ADD16S	; result in D:X should be $FFFF6788
	INS		; drop one parameter, reuse other
	INS
	LDX	VBP
	LDAA	RETVHI
	LDAB	RETVHI+1
	STAA	0,X
	STAB	1,X
	LDAA	RETVLO
	LDAB	RETVLO+1
	STAA	2,X
	STAB	3,X
	TSX
	LDAB	#$A5
	STAB	0,X	; $A5
	STAB	1,X	; $A5A5
	JSR	ADD16SI		; result in 2nd variable should be FFFF0D2D
	LDX	VBP		; get the result from our variable
	LDAA	2,X		; low half
	LDAB	3,X
	LDX	LB_BASE		; store it in FINAL, in process local space
	STAA	FINALX+2,X
	STAB	FINALX+3,X
	LDX	VBP
	LDAA	0,X		; high half
	LDAB	1,X
	LDX	LB_BASE
	STAA	FINALX,X
	STAB	FINALX+1,X
*
	JSR	UNLKF
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
* (who knows?) <= VBP
***
*
* Stack after initialization:
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,FP,VBP
* [STKUNDR ]STKBAS <= SP
***
* Stack after LINK (at call to MAIN)
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ] 
* [FRMLNKY=STKBAS+NATWID ] <= SP,FP,VBP
*
START	NOP
	JSR	INISTK
	NOP
*
	JSR	LINKF
*
	JSR	MAIN
*
	JSR	UNLKF
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	LDX	$FFFE	; alternatively, jmp through reset vector
	JMP	0,X
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

I had to use my own assembler to clean up some mistakes, but the code assembles and runs correctly in EXORsim. As always, I will make no guarantees that this code is appropriate to be generalized for compilers and such.

We've seen what this kind of code looks like without stack frames, but once I get the split-stack version of this code up, I'm planning to do the functionality without frames so you can really see it and compare. 

Again, if you're getting worn out, go ahead and move on to getting numeric output in binary.


(Title Page/Index)

 


 

 

Tuesday, November 12, 2024

ALPP 02-26 -- Walking the Pontoons -- Split-stack Stack Frame Example: 6801

And this one has been sitting at the bottom of the pool for a while, as predicted, even longer than the 6809 example.

  Walking the Pontoons --
Split-stack Stack Frame Example:
6801

(Title Page/Index)

Having worked through a concrete single-(combined)-stack stack frame example for the 6801, let's look at whether the split stack discipline/paradigm improves things (as I think it should).

In the split stack discipline, we no longer need frame links. Each frame pointer is simply stacked up like the VBP was in the single stack example. Walking from frame to frame is simply moving back in the return stack -- since we've decided to store the frame pointers there in this split stack discipline.

We decided this?

Okay, I decided it.

Here are some of my reasons -- 

  • we want linkage on one stack and parameter on the other. 
  • And one of the side benefits is that it's no longer a linked list, it's a stack of pointers, really easy to walk.

We can now simply point to the actual bottom of the fixed part of each frame, instead of the frame link.

And that means that local references can now be positive offset from the frame pointer, thanks to the natural structure in logic, reason, and mathematics -- the structure in the universe, if you will allow me, that either God or the initial conditions provided us with.

We might still wish we had the SBX instruction. It would make allocation without initialization more straightforward in many cases. But the Double accumulator and adding negatives can help get us over that. Sort-of.

As with the 6809 and 68000,  the return stack will always be in pairs.

However, rather than pushing the previous frame pointer immediately, we'll have the called routine push the frame pointer after allocating the local variables.

And it will be pushed in order of return address, caller frame pointer:

* [PRETADR   ]
* [PCALLERFRM]
* [RETADR    ]
* [CALLERFRM ] <= RSP

The parameter stack is just whatever is out there, but the conceptual order would be 

* [VARIABLES  ] <= CALLERFRM
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES  ] <= FP
* [TEMPORARIES]
* [PARAMETERS ] <= PSP

The difference here with the 6809 and 68000 implementations is that, instead of relying on negative offsets to access the variables as is the case there, we will move the frame pointer to the bottom of the variables after allocating them. Variables and parameters passed in will be at known positive offsets from the frame pointer.

The frame pointer we pushed before allocation was the frame pointer of the calling routine, which is as it should be. This will not allow us reliable (easy) means of accessing the calling function's temporary variables, but we shouldn't want to access them anyway.

And we note again that we no longer need to dodge around the return pointer and the saved stack frame pointer because they are on the other stack (where they should be).

You may be wondering about the local variable allocation and initialization, since the 6801 provides no native push instruction for the parameter stack, and the clean method we used in-line in the combined frames example:

	LDX	#0	; 3:  3#	3:  3~
	PSHX		; 1:  4 	4:  7
	PSHX		; 1:  5 	4: 11
	PSHX		; 1:  6 	4: 15
	PSHX		; 1:  7 	4: 21

is no longer available. 

We might want to use this, instead:

	LDX	#0
	JSR	PPSHD
	JSR	PPSHD
	JSR	PPSHD
	JSR	PPSHD

but it comes at a time cost:

PPOPD	LDX	PSP	; 2:  2#	4:  4~
	LDD	0,X	; 2:  4 	5:  9
	INX		; 1:  5 	3: 12
	INX		; 1:  6 	3: 15
	STX	PSP	: 2:  8 	4: 19
	RTS		: 1:  9 	5: 24 (+3#, +6~ call)
*
PPSHD	LDX	PSP	; 2:  2 	4:  4
	DEX		; 1:  3 	3:  7
	DEX		; 1:  4 	3: 10
	STX	PSP	: 2:  6		4: 14
	STD	0,X	: 2:  8 	5: 19
	RTS		: 1:  9 	5: 24 (+3#, +6~ call)

Four calls to PPSHD is going to take 120 cycles (plus the load Double accumulator), six times what the PSHX costs. That hurts.

There is a lot of redundant loading and storing of PSP going on in those routines, not to mention the basic call overhead -- the JSR and RTS instructions. 

Would doing it in a loop help?

* count in low byte of DWORD
ALIND0	LDD	#0	; 3:  3#	3:  3~
* count in pseudo-register BCOUNT,
* initial value in D
ALINDC	LDX	PSP	; 2:  2#	4:  4~
ALINDL	STD	0,X	; 2:  4 	5:  9
	DEX		; 1:  5 	3: 12
	DEX		; 1:  6 	3: 15
	DEC	BCOUNT	; 3:  9 	6: 21  (pseudo-register, but extended mode)
	BNE	ALINDL	; 2: 11 	3: 24 (20 * count + 4 => count 4=>84)
	STX	PSP	; 2: 13 	4: (84)+4
	RTS		; 1: 14 	5: (84)+9 (+3#, +6~ call)

That's better than our four calls to PPSHD, but still over 4 times the 4 PSHXes. And, in order to clear 16 bits at a time, I'm using a pseudo-register -- BCOUNT. 

And we are reminded that the 6801 does not have direct-page addressing modes for the read-modify-write (unary operator) instructions, so BCOUNT will be addressed in extended (absolute) mode. 

(Someday, I want to use programmed logic to implement a 6801 source-code compatible CPU with the op-codes moved around a bit, add SBX, and add direct-page mode op-codes for the read-modify-write -- unary -- instructions. It wouldn't save all that much time in the above loop, but I would also add address space decoding, so that the direct page could be popped out of the absolute address space, and you could do fancy things like bank-switch internal dual-ported RAM in the direct page, speeding access further and allowing really fast process context switches. Heh. Dreams.)

If we are really clever, we can line it all up and provide routines for allocating and clearing up  to four 16-bit local variables like this:

ALCL8	LDD	#0	; 3:  3#	3:  3~
	LDX	PSP	; 2:  5 	4:  7 
	DEX		; 1:  6 	3: 10
	DEX		; 1:  7 	3: 13
	STD	0,X	; 2:  9 	5: 18
ALCLI6	DEX		; 1: 10 	3: 21
	DEX		; 1: 11 	3: 24
	STD	0,X	; 2: 13 	5: 29
ALCLI4	DEX		; 1: 14 	3: 32
	DEX		; 1: 15 	3: 35
	STD	0,X	; 2: 17 	5: 40
ALCLI2	DEX		; 1: 18 	3: 43
	DEX		; 1: 19 	3: 46
	STD	0,X	; 2: 21 	5: 51
	STX	PSP	; 2: 23 	4: 55
	RTS		; 1: 24 	5: 60
*
ALCL6	LDD	#0	; 3:  3#	 3:  3~
	LDX	PSP	; 2:  5 	 4:  7
	BRA	ALCLI6	; 2:  7 	 3: 10
*			; 0:  7 	42: 52
*
ALCL4	LDD	#0	; 3:  3#	 3:  3~
	LDX	PSP	; 2:  5 	 4:  7
	BRA	ALCLI4	; 2:  7 	 3: 10
*			; 0:  7 	31: 41
*
ALCL2	LDD	#0	; 3:  3#	 3:  3~
	LDX	PSP	; 2:  5 	 4:  7 
	BRA	ALCLI4	; 2:  7 	 3: 10
*			; 0:  7 	20: 30
That's half the cycle count (not including the call) of the four calls to PPSHD, but still around three times doing the 8 bytes with four PSHX instructions. It is an improvement.

What's the fastest possible way on the 6801 short of doing dangerous (because of interrupts) things like quickly swapping the return stack pointer in for the parameter stack pointer and swapping it out when done? 

(Never do that. If you swap out the stack pointer, you will not avoid interrupts, no matter how hard you try, and your interrupts will save registers all over things. I mean, it is possible to physically strap all interrupt inputs high, but then you have the devil of a time interfacing with the real world. Keep your return stack valid at all times.)

If we replace all those DEXs with direct calculation, we can trim it up a bit, like this:

ALCLD8	LDD	#-8	; 3:  3#	3:  3~
	ADDD	PSP	; 2:  5 	4:  7
	STD	PSP	; 2:  7 	4: 11
	LDX	PSP	; 2:  9 	4: 15
	CLRB		; 1: 10 	2: 17
	STD	0,X	; 2: 12 	5: 22
	STD	2,X	; 2: 14 	5: 27
	STD	4,X	; 2: 16 	5: 32
	STD	6,X	; 2: 18 	5: 37
	STX	PSP	; 2: 20 	4: 41
	RTS		; 1: 21 	5: 46

But there's no reusable code in there, other than that the whole routine can be used as a subroutine anywhere you need to allocate and clear 8 bytes on the parameter stack.

I'm not going to show all the approaches I've checked, but that's the best for time that I can figure out. The down side is that I can't figure out how to steal code from it, so it has to be mostly recreated for the 6 byte and 4 byte case. The 2 byte case you're just going to want to in-line. But it's always faster.

On the other hand, if you need to trim up the object code size, we do have a sort-of decent option in the DEX chain.

Doesn't this de-optimization of stack allocation have negative consequences for the split stack discipline?

Yes, but I've found, at least in my range of experience, that the negatives are not nearly as negative as they might seem, and the benefits outweigh the disadvantages.

So, the code --

As always, read the code, read the comments and be careful because sometimes the code and the comments are out-of-sync. (You can be sure the comments are closer to the code than any detailed explanation I could give in the blog.)

* 16-bit addition as example of split-stack stack frame discipline on 6801
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
	OPT	6801
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for usr stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily
DWORK	RMB	2	; For saving D temporarily
ALCOUNT	RMB	1	; for allocation utility routines
	RMB	1	; reserved
PSP	RMB	2	; parameter stack pointer
FP	RMB	2	; frame pointer
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
SSTKLIM	RMB	64	; 16 levels of call
SSTKLMX	EQU	FINALX+4
SSTKBAS	RMB	8	; for canary return
SSTKBSX	EQU	SSTKLMX+64
SSTKFAK	RMB	2	; fake frame pointer, self-link
SSTKFAX	EQU	SSTKBSX+8	; 6801 is post-dec (post-store-decrement) push
SSTKBMP	RMB	4	; a little bumper space
SSTKBMX	EQU	SSTKFAX+2	; But we are going to init S through X
PSTKLIM	RMB	64	; 16 levels of call at two parameters per call
PSTKLMX	EQU	SSTKBMX+4
PSTKBAS	RMB	4	; bumper space -- parameter stack is pre-dec
PSTKBSX	EQU	PSTKLMX+64
PSTKBMP	RMB	4	; a little bumper space
PSTKBMX	EQU	PSTKBSX+4
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	PSTKBMX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
*
INISTKS	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE
	LDD	LB_BASE	
	ADDD	#HBASEX		; calculat EA
	STD	HPPTR		; as if we actually had a heap
	STD	HPALL
	LDD	#CDBASE
	SUBD	#4		; extra bumper
	STD	HPLIM
	LDD	LB_BASE
	ADDD	#PSTKBSX	; calculate the address
	STD	PSP	; initialize parameter stack pointer empty
	STD	FP	; initialize frame pointer
	LDX	#SSTKNDR	; error handler
	STX	DWORK	; save it for a bit (not ready for stack)
	LDD	LB_BASE
	ADDD	#SSTKBSX	; calculate the address
	STD	XWORK	; move it to X (not ready fof stack)
	LDX	XWORK
	LDD	DWORK	; prime the stack with underflow error handler
	STD	0,X
	STD	4,X
	LDD	PSP	; and empty parameter stack frame
	STD	2,x	; stacks primed
	PULA		; get the return address
	PULB
	STS	SSAVE	; Save what the monitor gave us.
	TXS		; move to our own stack
	PSHB
	PSHA
	RTS
*
*
***
* General structure of the stacks, 
*
* return stack is always in pairs:
* [PRETADR   ]
* [PCALLERFRM]
* [RETADR    ]
* [CALLERFRM ] <= RSP
*
* order of elements on the parameter stack,
* when they are present:
* [PARAMETERS ]
* [VARIABLES  ] <= CALLERFRM
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES  ] <= FP
* [TEMPORARIES]
* [PARAMETERS ] <= PSP
*
* Result is returned on parameter stack
*
***
* Utility routines
PPOPD	LDX	PSP
	LDD	0,X
	INX
	INX
	STX	PSP
	RTS
*
PPSHD	LDX	PSP
	DEX
	DEX
	STX	PSP
	STD	0,X
	RTS
*
* subroutine to make sure we don'T forget anything
MARK	PULA
	PULB
	LDX	FP
	PSHX		; mark, no allocate,
	LDX	PSP
	STX	FP
	PSHB
	PSHA
	RTS
*
*UNMK	PULA
*	PULB
*	PULX
*	STX	FP
*	PSHB
*	PSHA
*	RTS
*
* Compromise between speed and reusability
* Enter here to load PSP and initialize to 0
* 8 bytes
ALCL8	LDD	#0	; 3:  3#	3:  3~
* Enter here with initial value in D
ALCLD8	LDX	PSP	; 2:  5 	4:  7 
* Enter here with PSP loaded and initial value in D
ALCLI8	DEX		; 1:  6 	3: 10
	DEX		; 1:  7 	3: 13
	STD	0,X	; 2:  9 	5: 18
ALCLI6	DEX		; 1: 10 	3: 21
	DEX		; 1: 11 	3: 24
	STD	0,X	; 2: 13 	5: 29
ALCLI4	DEX		; 1: 14 	3: 32
	DEX		; 1: 15 	3: 35
	STD	0,X	; 2: 17 	5: 40
ALCLI2	DEX		; 1: 18 	3: 43
	DEX		; 1: 19 	3: 46
	STD	0,X	; 2: 21 	5: 51
	STX	PSP	; 2: 23 	4: 55
	RTS		; 1: 24 	5: 60
*
* six bytes
ALCL6	LDD	#0	; 3:  3#	 3:  3~
ALCLD6	LDX	PSP	; 2:  5 	 4:  7
	BRA	ALCLI6	; 2:  7 	 3: 10
*			; 0:  7 	42: 52
* four bytes
ALCL4	LDD	#0	; 3:  3#	 3:  3~
ALCLD4	LDX	PSP	; 2:  5 	 4:  7
	BRA	ALCLI4	; 2:  7 	 3: 10
*			; 0:  7 	31: 41
* two bytes
ALCL2	LDD	#0	; 3:  3#	 3:  3~
	LDX	PSP	; 2:  5 	 4:  7 
	BRA	ALCLI4	; 2:  7 	 3: 10
*			; 0:  7 	20: 30
*
*
* Add D to PSP -- negative for allocation, positive for deallocation
ADDPSP	ADDD	PSP	; plus 3 bytes for load: 6 bytes vs. 9 total
	STD	PSP
	LDX	PSP	; 3 bytes for call vs. 6 bytes in-line
	RTS
*
PDROP_8	LDAB	#8	; saves two bytes, 7 vs. 3
* deallocate count in B
PDROPB	LDX	PSP	; 5 bytes to deallocate in-line
	ABX		; vs. 3 bytes to call this.
	STX	PSP	; ABX is useful for deallocation
	RTS		; 5 bytes vs. 7 total
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry, after link:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ]
* [FRMPTR0==<EMPTYP>]
* [RETADR1 ]
* [FRMPTR1 ] <= RSP
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>] <= FRMPTR0,FRMPTR1
* [32:VAR1_1--]
* [32:VAR1_2--] <= FRMPTR1
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP,FP
*
****
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit
ADD16S	JSR	MARK	; mark, no allocate, X is PSP
*
	LDD	#(-1)	; default negative
	JSR	ALCLI4	; allocate 2 temporary cells and init
	LDX	FP	; 
	TST	2,X	; the left-hand operand sign bit
	BMI	ADD16SR
	LDX	PSP
	CLR	2,X	; positive
	CLR	3,X
ADD16SR	LDX	FP
	TST	0,X	; the right-hand operand sign bit
	BMI	ADD16SL
	LDX	PSP
	CLR	0,X	; positive
	CLR	1,X
ADD16SL	LDX	FP
	LDD	2,X	; left hand 
	ADDD	0,X	; right hand
	STD	2,X	; store low half
	LDX	PSP
	LDD	2,X
	ADCB	1,X
	ADCA	0,X
	LDX	FP	; wouldn't need to do this if we tracked PSP extras
	STD	0,X
*
	STX	PSP	; drop the temporaries
	PULX
	STX	FP
	RTS
*
* The alternative without link, mark, or restore will be shown in the no-frame case.
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 32-bit
* output parameter:
*   17-bit sum in 32-bit D1
ADD16U	JSR	MARK	; mark, no allocate, X is PSP
*
	LDX	FP
	LDD	2,X	; left
	ADDD	0,X	; add right
	STD	2,X	; save low
	LDD	#0	; extend
	ROLB		; extend Carry unsigned (could ADC #0)
	STD	0,X	; re-use right side to store high half
*
	PULX		; restore FP
	STX	FP
	RTS
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with one 16-bit parameters,
* after mark (no local allocation)
* [<unknown>  ] <= FRMPTR0
* [32:VAR1_1  ]
* [32:VAR1_2  ] <= FRMPTR1
* [16:PARAM2_1] <= PSP,FP
*
* To show how to walk the stack --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit addend in 32-bit
* target parameter in caller
*   2nd 32-bit variable at offset -2*NATWID
* no output parameter:
ADD16SI	JSR	MARK
*
	LDD	#(-1)	; make a temporary -1
	JSR	ALCLI2	; (default to signed)
	LDX	FP
	TST	0,X	; test high byte
	BMI	ADD16SP
	LDX	PSP
	CLR	0,X	; zero extend
	CLR	1,X
ADD16SP	TSX
	LDX	0,X	; caller's FP
	LDD	2,X	; caller's 2nd variable, low
	LDX	FP
	ADDD	0,X	; parameter
	TSX
	LDX	0,X
	STD	2,X	; update low half with result
	LDD	0,X	; 2nd variable, high half
	LDX	PSP
	ADCB	1,X	; sign extension half
	ADCA	0,X
	TSX
	LDX	0,X
	STD	0,X	; update high half
*
	LDX	FP
	INX		; drop parameter
	INX	
	STX	PSP	; and sign temporary goes bye-bye, too
	PULX
	STX	FP
	RTS
*
*
***
* Return stack on entry:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ] <= RSP
*
* Return stack after link:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ]
* [FRMPTR0==<EMPTYP>] <= RSP
*
* Parameter stack after mark and local allocation
* [<unknown>] <= FRMPTR0
* [VAR1_1--]
* [VAR1_2--] <= PSP,FP
*
MAIN	JSR	MARK
	JSR	ALCL8	; allocate and clear 8 bytes
	STX	FP	; Point FP to base of local variables.
*
	LDD	#$1234
	JSR	PPSHD
	LDD	#$CDEF
	JSR	PPSHD
	JSR	ADD16U	; 32-bit result on parameter stack should be $0000E023
	LDD	#$8765
	LDX	PSP	; reuse parameter space, since order is okay
	STD	0,X
	JSR	ADD16S	; result on parameter stack should be $FFFF6788
	LDX	PSP
	LDD	2,X	; result low half
	LDX	FP
	STD	2,X	; to 2nd local variable low half
	LDX	PSP
	LDD	0,X	; result high half
	LDX	FP
	STD	0,X	; to 2nd local variable high half
	LDD	#$A5A5	
	JSR	PPSHD
	JSR	ADD16SI	; result in 2nd variable should be FFFF0D2D (Carry set)
	LDX	FP
	LDD	2,X	; 2nd variable low half
	LDX	LB_BASE
	STD	FINALX+2,X
	LDX	FP
	LDD	0,X
	LDX	LB_BASE
	STD	FINALX,X
*
	JSR	PDROP_8
	PULX
	STX	FP	; restore FP	
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
*
***
* Return stack will always be in pairs:
* [RETADRNN  ]
* [CALLERFMNN]
*
* Return stack after initialization:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS <= RSP
*
* Return stack after saving previous mark:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>] <= RSP
*
* Parameter stack after initialization, mark:
* [<unknown]PSTKBAS <= PSP,FP==<EMPTYP>
*
START	JSR	INISTKS
	LDX	PSP
	PSHX	; empty previous mark
	STX	FP	; empty new mark
*
	JSR	MAIN
*
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
SSTKNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP
	NOP		; another landing pad to set breakpoint at
	NOP
	LDX	$FFFE
	JMP	0,X	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

Again, I have tested this code. It builds the stack frames and tears them down as advertised and gets the right result in the right place. Again, I will not guarantee that this code can be generalized, whether by hand or by automaton (compiler, etc.).

[JMR202411182142 edit:] 

Just realized, while working on this for the 6800, that the name for SUB16SI did not agree with what it is doing. So I'm fixing it, calling it ADD16SI, instead.

[JMR202411182142 edit end.] 

Again, I remind you that we have seen what this kind of code looks like without stack frames. I may come back and strip the stack frames from this specific example, just to be obnoxious, but we need to look at both kinds of stack frames on the 6800, single stack, first.

If you don't want to wait, move on to getting numeric output in binary.


(Title Page/Index)


 

 

 

 

Wednesday, November 6, 2024

ALPP 02-25 -- Ascending the Wrong Island -- Single-stack Stack Frame Example: 6801

And this one has been sitting at the bottom of the pool for a while, as predicted, even longer than the 6809 example.

  Ascending the Wrong Island --
Single-stack Stack Frame Example:
6801

(Title Page/Index)

This is a concrete example to demonstrate some approaches to the problems in single-stack stack frames on the 6801. I've taken the concrete example for the 68000 and transliterated it to the 6809, and we've taken a bit of a detour through addressing math that might be helpful, finishing with some examples of some of the fancy modes on the 68000.  

Here I'll work on a more concrete and, hopefully, more understandable translation of stack frames to the 6801. It can't be a transliteration, because we can't reference local variables at a negative offset from the frame pointer. But the linked list of frame pointers has to remain such that it can be walked backwards to get caller's context, and such that, when the called routine ends, it can restore the caller's context.

The advantage of this concrete example is that we won't have to push the 6801 past the workaround limits of the CPU. None of the routines require more stack pointer math than a few pushes and pops. And we'll rig something up to keep all offsets positive.

And, again, I want to emphasize that I do not recommend the single stack discipline that most of the current "modern" software engineering infrastructure is built on. This is just here for comparison. To that end, I will provide examples of both the single-stack stack frame and the split-stack stack frame for the 6801 here, the single stack version first.

Probably the biggest impediment to doing stack frames on the 6800 and 6801 is the lack of support for fast general address arithmetic. We can do the arithmetic, but it's slow enough to cause the programmer serious angst about using variables that require address math just to access. 

There are possible faster work-arounds for some parts of it (such as if you arrange to keep the stack entirely within a 256-byte page), but they have specific ranges of applicability that take time to understand, and may not allow general use.

And address math requires temporary variables, preferably in the direct page, which themselves require consideration and support at interrupt time. And, because the 6801 only has positive constant offsets, we must, if possible, arrange to only need constant positive offsets.

Even if we had the SBX instruction, we would prefer not to have to load B and use it, if we could.

Now the reason we use negative offsets in the 6809 and 68000 code is that they can be kept constant, and the compiler (or programmer) doesn't need to specifically remember how many temporaries and parameters have been pushed/allocated while it/she/he is generating code -- only while setting up the frame. 

And if the frame is built by pushing initial values for variables, we barely have to remember it then. (Which may be why Motorola figured SBX was not necessary.)

Here's a cross-section of what we think we want the stack frame list to look like:

* Cross-section of general frame structure in called routine:
* [{LOCVAR}] for calling routine
* [{TEMP}  ] for calling routine
* [PARAM   ] from calling routine
* [RETADR  ] to calling routine
* [FRMLNK  ] at entry to calling routine
* [LOCVAR  ] for called -- current -- routine
* [TEMP    ] for called -- current -- routine
* [(PARAM) ] to be passed to a further call

Let's take a bit broader section and show the connections that we used for the 6809 and 68000:

* Broader cross-section, showing chaining for routine 3, in-flight:
* [RETADR1 ] 
* [FRMLNK2 ] <= FRMLNK3
* [LOCVAR2 ] 
* [TEMP2   ]
* [PARAM3  ]
* [RETADR2 ]
* [FRMLNK3 ] <= FP (frame pointer)
* [LOCVAR3 ]
* [TEMP3   ]
* [(PARAM4)] <= SP (return stack pointer)

With the 6809 and 68000, we could index downward from the pointer to the frame link -- from the frame pointer. So that was the way I constructed those.

With the 6800/1, we can't use negative constant offsets in indexed mode, only positive. 

If the compiler (or programmer) keeps track of how many bytes have been pushed and popped since entering the routine, it's actually no problem to add that many bytes to the offsets needed to reach the local variables, and to skip over the frame link and return address to the parameters. 

But it does make the compiler more complex, and it adds a step or two for the programmers, which provides more opportunity for mistakes and bugs of the sort that like to hide themselves until they can really bite hard.  

It would be nice to have a second pseudo-register pointing to the local variables (Call it the variable base pointer, VBP?), but how could we maintain that? Specifically, how could the calling routine restore its variable base pointer after the called routine completes and returns? All we have to help us so far is knowing at compile time how many bytes of local variables, temporaries, and parameters we have to adjust the provious frame pointer by. That is not available at run-time unless we save it somewhere.

We could stack the offset from SP to VBP and do the math at runtime to reproduce the VBP, but that's run-time math. 

A better alternative would be to just push the VBP itself when we push the frame pointer. 

But either way ends up further fattening a stack already well-fattened by the stack frame overhead.

What we want is some way to combine the function of the frame pointer with the function of the variable base pointer without having to calculate offsets at run time. Again, that's why we liked the negative constant offsets on the 6809. We could let the CPU handle the calculations for us, and hide the address calculation time in the overall access overhead.

(You can see those calculation times in the 68000 and 68020. In the 68030 and beyond, they've added a lot of circuitry to do as much as possible of those calculations in parallel with whatever else the CPU is doing, which makes those processors significantly faster than the 68020 at the same CPU clock rates, even.)

Late last night, I was thinking that staging the linkage, moving VBP through FP on its way to the stack, would do the trick. But that's actually what we are doing with SP.

I'm not seeing any other alternatives. Either 

  • make the compiler or programmer track the number of bytes between the current SP and the first byte of the local variables in the stack; 
  • or push the base address of the local variables along with the frame pointer.

If you're going to saddle the programmer with the burden of maintaining the changing offsets anyway, what's the purpose in the discipline of maintaining a frame? It's precisely the burden of remembering what's on the stack that stack frames are supposed to "solve".

So, stack the pointer to the base address of the routine's local (dynamic) variables, too.

The single-stack example below relies on stacking the local variable base pointer along with the frame pointer. And you have to do it every time, or you have to remember that you didn't -- which is essentially stacking a flag, so why not just stack the VBP?

Or drag the entire compile-time analysis of the code with you to make it possible to run the compiled code? (Kind of like having to bury a link table in your object code just to run it.) Should every routine access its entry in the table of sizes of variable allocations when it terminates or something? Somehow, I suspect that's actually part of the code-bloat in modern code support libraries. I am not going to touch that here.

So note, in the comments in the source code, the frame structure that we are actually going to use. 

There are also a few places where I adjusted the code for the tools, such as my asm68c assembler not (presently) being able to declare blocks larger than 256 bytes with RMB.

[JMR202411141016 edit:]

Working through coding this for the 6800, I recognized I had left unnecessary code in a few places, and while I was checking that I hadn't screwed anything up removing it, discovered I had not quite completed the example. The corrections are just meaningful enough that I'm leaving the old code down below the end of the chapter.

[JMR202411141016 end edit.]

Be careful to read the comments in the code along with the code. Again, I'm giving the details of the discussion there. And watch out for when I forget to update the comments to match the code! (Read the code.)

* 16-bit addition as example of single-stack stack frame discipline on 6801,
* with test code
* Joel Matthew Rees, October 2024
*
	OPT	6801
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for usr stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily
DWORK	RMB	2	; For saving D temporarily
FP	RMB	2	; frame pointer
VBP	RMB	2	; variable base pointer
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
STKLIM	RMB	192	; roughly 16 to 20 levels of call
STKLIMX	EQU	FINALX+4
STKBAS	RMB	8	; for canary return
STKBASX	EQU	STKLIMX+192
STKFAK	RMB	2	; fake frame pointer, self-link
STKFAKX	EQU	STKBASX+8	; 6801 is post-dec (post-store-decrement) push
STKBMP	RMB	4	; a little bumper space
STKBMPX	EQU	STKFAKX+2	; But we are going to init S through X
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	STKBMPX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
INISTK	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE
	LDD	LB_BASE	
	ADDD	#HBASEX		; calculat EA
	STD	HPPTR		; as if we actually had a heap
	STD	HPALL
	LDD	#CDBASE
	SUBD	#4		; extra bumper
	STD	HPLIM
	LDD	LB_BASE
	ADDD	#STKBASX+2
	STD	FP	; initialize
	STD	VBP	; initialize
	LDX	FP
	STX	0,X	; self link
	ADDD	#6
	STD	6,X	; last self link
	STD	2,X	; error VARBP
	LDX	#STKUNDR	; error handler
	STX	XWORK
	LDD	XWORK
	LDX	FP
	DEX
	DEX
	STD	0,X	; last fake return to error handler
	STD	6,X	; first fake return to error handler
	PULA		; get the return address
	PULB
	STS	SSAVE	; Save what the monitor gave us.
	TXS		; move to our own stack
	PSHB
	PSHA
	RTS
*
***
* Since negative index offsets are so expensive,
* we want to create a stack frame with only positive offsets.
* And we want the frame pointer to be pushed after the call,
* on entry to the local context.
* And the saved frame pointer needs to link to the previous one.
* And when we restore the previous frame, 
* we need to be able to restore the previous frame base.
*
* Cross-section of general frame structure in called routine:
* [{LOCVAR}] for calling routine
* [{TEMP}  ] for calling routine
* [PARAM   ] from calling routine
* [RETADR  ] to calling routine
* [VARBP   ] base of local variables in calling routine
* [FRMLNK  ] at entry to calling routine
* [LOCVAR  ] for called -- current -- routine
* [TEMP    ] for called -- current -- routine
* [(PARAM) ] to be passed to a further call
*
* Broader cross-section, showing chaining for routine 3, in-flight:
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK2 ] <= FRMLNK3
* [LOCVAR2 ] <= VARBP2
* [TEMP2   ]
* [PARAM3  ]
* [RETADR2 ]
* [VARBP2  ]
* [FRMLNK3 ] <= FP (frame pointer)
* [LOCVAR3 ] <= VBP (variable base pointer)
* [TEMP3   ]
* [(PARAM4)] <= SP (return stack pointer (6800 S is byte below))
***
*
***
* Utility routines
*
* Let the caller do allocation after.
LINKF	PULA		; get return address
	PULB
	LDX	VBP	; push frame base
	PSHX
	LDX	FP	; and link the frame in
	PSHX
	TSX		; set up new frame pointers
	STX	FP	; because we want to use the pointer at will
	STX	VBP	; link and allocate 0 complete
	PSHB		; put return address back
	PSHA
	RTS
*
* No return value
UNLKF	PULA		; get return address
	PULB
	LDX	FP	; deallocate
	TXS		; and unlink
	PULX		; restore previous
	STX	FP
	PULX
	STX	VBP
	PSHB		; restore return address
	PSHA
	RTS
*
*
* Stack after LINK and allocation
* when functions are called by MAIN
* with two parameters
* We will return result in D:X
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ]
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ]
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FRMLNK0
* [32:VAR1_1]
* [32:VAR1_2] <= VARBP1
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK0 ] <= FP,SP,VBP
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit D:X D high, X low
* Does not alter the parameters.
ADD16S	JSR	LINKF
	TSX		; no local allocations
*
	LDAA	#(-1)	; prepare for sign extension
	TST	8,X	; the left-hand operand sign bit
	BMI	ADD16SR
	CLRA		; zero extend
ADD16SR	PSHA		; push left extension
	PSHA		; left sign cell below X now
	LDAA	#(-1)	; reload
	TST	6,X	; the right-hand operand sign bit
	BMI	ADD16SL
	CLRA		; zero extend
ADD16SL	PSHA		; push right extension
	PSHA
	TSX		; point to sign extensions
	LDD	12,X	; left-hand low cell
	ADDD	10,X	; right-hand low cell
	STD	XWORK	; save low half of result
	LDD	2,X	; left-hand extension
	ADCB	1,X	; right-hand extension
	ADCA	0,X
	STD	DWORK	; Save high half of result
*
	JSR	UNLKF	; drops temporaries
	LDX	XWORK	; get low half of result
	LDD	DWORK	; get high half of result
	RTS		; result is in D:X
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit D:X D high
ADD16U	JSR	LINKF
	TSX		; no local allocations
*
	LDD	8,X	; left
	ADDD	6,X	; right
	STD	XWORK	; save low half
	LDD	#0
	ADCB	#0
	STD	DWORK	; save carry bit in high half
*
	JSR	UNLKF	; drops temporaries
	LDX	XWORK	; get low half of result
	LDD	DWORK	; get high half of result
	RTS		; result is in D:X
*
* Etc.
*
***
*
* Stack after LINK #0 when fuctions are called by MAIN
* with one input parameter
* (#0 means no local variables)
* [<SELF>  ] <= <SELF>
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ]
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ]
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FRMLNK0
* [32:VAR1_1]
* [32:VAR1_2] <= VARBP1
* [PARAM2_1]
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK0 ] <= FP,SP,VBP
*
* To show how to walk the stack --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit addend
* target parameter in caller
*   2nd 32-bit variable at offset -2*NATWID
* no output parameter:
ADD16SI	JSR	LINKF
	TSX		; no local allocations
*
	LDAA	#(-1)
	TST	6,X	; high byte of paramater
	BMI	ADD16SIP
	CLRA
ADD16SIP	PSHA	; save the sign extension half
	PSHA
	LDX	2,X	; get caller's VBP
	LDD	2,X	; caller's 2nd variable, low
	LDX	FP
	ADDD	6,X	; parameter
	LDX	2,X	; caller's VBP
	STD	2,X	; save result low half away
	LDD	0,X	; caller's 2nd variable, high
	TSX
	ADCB	1,X	; sign extension half
	ADCA	0,X
	LDX	FP
	LDX	2,X
	STD	0,X	; save result high half away
*
	JSR	UNLKF	; drops temporaries 
	RTS		; no result to load
*
*
***
* Stack after LINK
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ] 
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FP
* [32:VAR1_1]
* [32:VAR1_2] <= SP,VBP
*
MAIN	JSR	LINKF
	LDX	#0
	PSHX		; four pushes is only one byte more than a call. 
	PSHX
	PSHX
	PSHX
	TSX
	STX	VBP	; link and allocate complete
*
	LDX	#$1234	; parameters
	PSHX
	LDX	#$CDEF
	PSHX
	JSR	ADD16U	; result in D:X should be $E023
	INS	; could reuse instead of dropping
	INS
	INS
	INS
	PSHX
	LDX	#$8765
	PSHX
	JSR	ADD16S	; result in D:X should be $FFFF6788
	INS	; could reuse instead of dropping
	INS
	INS
	INS
	STX	XWORK
	LDX	VBP
	STD	0,X
	LDD	XWORK
	STD	2,X
	LDX	#$A5A5
	PSHX
	JSR	ADD16SI		; result in 2nd variable should be FFFF0D2D
	LDX	VBP		; get the result from our variable
	LDD	2,X		; low half
	LDX	LB_BASE		; store it in FINAL, in process local space
	STD	FINALX+2,X
	LDX	VBP
	LDD	0,X		; high half
	LDX	LB_BASE
	STD	FINALX,X
*
	JSR	UNLKF
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
* (who knows?) <= VBP
***
*
* Stack after initialization:
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,FP,VBP
* [STKUNDR ]STKBAS <= SP
***
* Stack after LINK (at call to MAIN)
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ] 
* [FRMLNKY=STKBAS+NATWID ] <= SP,FP,VBP
*
START	NOP
	JSR	INISTK
	NOP
*
	JSR	LINKF
*
	JSR	MAIN
*
	JSR	UNLKF
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	LDX	$FFFE	; alternatively, jmp through reset vector
	JMP	0,X
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

I think I'm going to continue using the fake return technique to keep things better under control.

I have tested this code. It does run; it builds the stack frames and tears them down as advertised. And, as always, I will not guarantee that this code can be generalized. Nor will I guarantee that it can be generated by any real compiler.

[JMR202411182142 edit:] 

Just realized, while working on this for the 6800, that the name for SUB16SI did not agree with what it is doing. So I'm fixing it, calling it ADD16SI, instead.

[JMR202411182142 edit end.]

I am going to post the split-stack stack frame version for comparison, but this has gotten so long that it really needs to be in a separate post. Also, I'm pretty sure you'll want to compare this with that, side-by-side, in separate browser windows. The differences become that obvious.

As a reminder, we've already seen what this kind of code looks like without stack frames.

Once I get the split-stack version of this code up (It's up now.), I'll convert it to the 6800. If you're getting worn out, go ahead and move on to getting numeric output in binary.


(Title Page/Index)

 

[JMR202411141016 old code version:]

* 16-bit addition as example of single-stack stack frame discipline on 6801,
* with test code
* Joel Matthew Rees, October 2024
*
	OPT	6801
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for usr stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily
DWORK	RMB	2	; For saving D temporarily
FP	RMB	2	; frame pointer
VBP	RMB	2	; variable base pointer
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
STKLIM	RMB	192	; roughly 16 to 20 levels of call
STKLIMX	EQU	FINALX+4
STKBAS	RMB	8	; for canary return
STKBASX	EQU	STKLIMX+192
STKFAK	RMB	2	; fake frame pointer, self-link
STKFAKX	EQU	STKBASX+8	; 6801 is post-dec (post-store-decrement) push
STKBMP	RMB	4	; a little bumper space
STKBMPX	EQU	STKFAKX+2	; But we are going to init S through X
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	STKBMPX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
INISTK	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE
	LDD	LB_BASE	
	ADDD	#HBASEX		; calculat EA
	STD	HPPTR		; as if we actually had a heap
	STD	HPALL
	LDD	#CDBASE
	SUBD	#4		; extra bumper
	STD	HPLIM
	LDD	LB_BASE
	ADDD	#STKBASX+2
	STD	FP	; initialize
	STD	VBP	; initialize
	LDX	FP
	STX	0,X	; self link
	ADDD	#6
	STD	6,X	; last self link
	STD	2,X	; error VARBP
	LDX	#STKUNDR	; error handler
	STX	XWORK
	LDD	XWORK
	LDX	FP
	DEX
	DEX
	STD	0,X	; last fake return to error handler
	STD	6,X	; first fake return to error handler
	PULA		; get the return address
	PULB
	STS	SSAVE	; Save what the monitor gave us.
	TXS		; move to our own stack
	PSHB
	PSHA
	RTS
*
***
* Since negative index offsets are so expensive,
* we want to create a stack frame with only positive offsets.
* And we want the frame pointer to be pushed after the call,
* on entry to the local context.
* And the saved frame pointer needs to link to the previous one.
* And when we restore the previous frame, 
* we need to be able to restore the previous frame base.
*
* Cross-section of general frame structure in called routine:
* [{LOCVAR}] for calling routine
* [{TEMP}  ] for calling routine
* [PARAM   ] from calling routine
* [RETADR  ] to calling routine
* [VARBP   ] base of local variables in calling routine
* [FRMLNK  ] at entry to calling routine
* [LOCVAR  ] for called -- current -- routine
* [TEMP    ] for called -- current -- routine
* [(PARAM) ] to be passed to a further call
*
* Broader cross-section, showing chaining for routine 3, in-flight:
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK2 ] <= FRMLNK3
* [LOCVAR2 ] <= VARBP2
* [TEMP2   ]
* [PARAM3  ]
* [RETADR2 ]
* [VARBP2  ]
* [FRMLNK3 ] <= FP (frame pointer)
* [LOCVAR3 ] <= VBP (variable base pointer)
* [TEMP3   ]
* [(PARAM4)] <= SP (return stack pointer (6800 S is byte below))
***
*
***
* Utility routines
*
* Let the caller do allocation after.
LINKF	PULA		; get return address
	PULB
	LDX	VBP	; push frame base
	PSHX
	LDX	FP	; and link the frame in
	PSHX
	TSX		; set up new frame pointers
	STX	FP	; because we want to use the pointer at will
	STX	VBP	; link and allocate 0 complete
	PSHB		; put return address back
	PSHA
	RTS
*
* No return value
UNLKF	PULA		; get return address
	PULB
	LDX	FP	; deallocate
	TXS		; and unlink
	PULX		; restore previous
	STX	FP
	PULX
	STX	VBP
	PSHB		; restore return address
	PSHA
	RTS
*
*
* Stack after LINK and allocation
* when functions are called by MAIN
* with two parameters
* We will return result in D:X
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ]
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ]
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FRMLNK0
* [32:VAR1_1]
* [32:VAR1_2] <= VARBP1
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK0 ] <= FP,SP,VBP
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit D:X D high, X low
* Does not alter the parameters.
ADD16S	LDX	VBP
	JSR	LINKF
	TSX		; no local allocations
*
	LDAA	#(-1)	; prepare for sign extension
	TST	8,X	; the left-hand operand sign bit
	BMI	ADD16SR
	CLRA		; zero extend
ADD16SR	PSHA		; push left extension
	PSHA		; left sign cell below X now
	LDAA	#(-1)	; reload
	TST	6,X	; the right-hand operand sign bit
	BMI	ADD16SL
	CLRA		; zero extend
ADD16SL	PSHA		; push right extension
	PSHA
	TSX		; point to sign extensions
	LDD	12,X	; left-hand low cell
	ADDD	10,X	; right-hand low cell
	STD	XWORK	; save low half of result
	LDD	2,X	; left-hand extension
	ADCB	1,X	; right-hand extension
	ADCA	0,X
	STD	DWORK	; Save high half of result
*
	JSR	UNLKF	; drops temporaries
	LDX	XWORK	; get low half of result
	LDD	DWORK	; get high half of result
	RTS		; result is in D:X
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit D:X D high
ADD16U	LDX	VBP
	JSR	LINKF
	TSX		; no local allocations
*
	LDD	8,X	; left
	ADDD	6,X	; right
	STD	XWORK	; save low half
	LDD	#0
	ADCB	#0
	STD	DWORK	; save carry bit in high half
*
	JSR	UNLKF	; drops temporaries
	LDX	XWORK	; get low half of result
	LDD	DWORK	; get high half of result
	RTS		; result is in D:X
*
* Etc.
*
***
*
* Stack after LINK #0 when fuctions are called by MAIN
* with one input parameter
* (#0 means no local variables)
* [<SELF>  ] <= <SELF>
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ]
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ]
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FRMLNK0
* [32:VAR1_1]
* [32:VAR1_2] <= VARBP1
* [PARAM2_1]
* [RETADR1 ] 
* [VARBP1  ]
* [FRMLNK0 ] <= FP,SP,VBP
*
* To show how to walk the stack --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit addend
* target parameter in caller
*   2nd 32-bit variable at offset -2*NATWID
* no output parameter:
SUB16SI	LDX	VBP
	JSR	LINKF
	TSX		; no local allocations
*
	LDAA	#(-1)
	TST	6,X	; high byte of paramater
	BMI	SUB16SIP
	CLRA
SUB16SIP	PSHA	; save the sign extension half
	PSHA
	LDX	2,X	; get caller's VBP
	LDD	2,X	; caller's 2nd variable, low
	LDX	FP
	ADDD	6,X	; parameter
	LDX	2,X	; caller's VBP
	STD	2,X	; save result low half away
	LDD	0,X	; caller's 2nd variable, high
	TSX
	ADCB	1,X	; sign extension half
	ADCA	0,X
	LDX	FP
	LDX	2,X
	STD	0,X	; save result high half away
*
	JSR	UNLKF	; drops temporaries 
	RTS		; no result to load
*
*
***
* Stack after LINK
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ] 
* [FRMLNKY=STKBAS+NATWID ] <= FRMLNKX,VARBP0
* [RETADR0 ] 
* [VARBP0  ]
* [FRMLNKX ] <= FP
* [32:VAR1_1]
* [32:VAR1_2] <= SP,VBP
*
MAIN	LDX	VBP
	JSR	LINKF
	LDX	#0
	PSHX		; four pushes is only one byte more than a call. 
	PSHX
	PSHX
	PSHX
	TSX
	STX	VBP	; link and allocate complete
*
	LDX	#$1234	; parameters
	PSHX
	LDX	#$CDEF
	PSHX
	JSR	ADD16U	; result in D:X should be $E023
	INS	; could reuse instead of dropping
	INS
	INS
	INS
	PSHX
	LDX	#$8765
	PSHX
	JSR	ADD16S	; result in D:X should be $FFFF6788
	INS	; could reuse instead of dropping
	INS
	INS
	INS
	STX	XWORK
	LDX	VBP
	STD	0,X
	LDD	XWORK
	STD	2,X
	LDX	#$A5A5
	PSHX
	JSR	SUB16SI		; result in 2nd variable should be FFFF0D2D
	LDX	VBP		; get the result from our variable
	LDD	2,X		; low half
	LDX	LB_BASE		; store it in FINAL, in process local space
	STD	FINALX+2,X
	LDX	VBP
	LDD	0,X		; high half
	LDX	LB_BASE
	STD	FINALX,X
*
	JSR	UNLKF
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
* (who knows?) <= VBP
***
*
* Stack after initialization:
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,FP,VBP
* [STKUNDR ]STKBAS <= SP
***
* Stack after LINK (at call to MAIN)
* [<SELF>  ] <= <SELF>,VARBPY
* [STKUNDR ]
* [VARBPY  ] 
* [<SELF>  ] <= <SELF>,VARBPX,FRMLNKY
* [STKUNDR ]STKBAS
* [VARBPX  ] 
* [FRMLNKY=STKBAS+NATWID ] <= SP,FP,VBP
*
START	NOP
	JSR	INISTK
	NOP
*
	LDX	VBP	; mark
	PSHX
	LDX	FP
	PSHX
	TSX	; link
	STX	FP
	STX	VBP
*
	JSR	MAIN
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	LDX	$FFFE	; alternatively, jmp through reset vector
	JMP	0,X
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

[JMR202411141016 end old code version.]

 

 

Sunday, November 3, 2024

ALPP 02-24 -- Some Address Math for the 68000

  Some Address Math
for the
68000

(Title Page/Index)

After a break for multi-byte negation, because address math is so important, I think I should show you explicit 68000 corollaries for what I've shown you for the 6809, as well as the routines  for the 6801 and for the 6800.

When instructions become more general, they often take more bytes to encode. This is especially clear for the 68000. And when you generalize an operation, it often takes more instructions to implement -- even with a more powerful instruction set CPU. And the more you repeat those multiple instructions, the more opportunity you have to make mistakes. 

More than speed and byte count, this is why we define utility routines like we just looked at for the 6800 and 6801. We don't want to give ourselves too many opportunities for mistakes. (Macros can help with this, but we won't talk about that just yet.)

Between the 6809 and the 68000, it can be kind of a wash -- when you're working on 16-bit numbers and small applications that fit in a 64K memory space. When you start working with 32-bit numbers, it's advantage 68000, ... except then you also tend to work with 32-bit addresses, and the addresses can make byte count swell. 

I transliterated the fig implementation of Forth from 6800 to 68000, and the object image size increased by about 80% (real rough estimate). This is because I didn't want to restrict it to operating in the lower 32K of memory, minus the interrupt vector table, so the virtual machine i-codes (function addresses, really) swelled from 16-bit to 32-bit. And since the Forth is mostly a clot of i-codes, the overall image size swells. 

I started a conversion to direct call, which I got lost in (partly motivating this tutorial), and the code size does seem to improve a bit, but not completely to the size of the 6809 image.

Do look at assembly listings when you try to compare code sizes for stuff. In particular, the 68000 will often seem to take about twice the code bytes that the 6809 takes in these snippets. But when we move to concrete code where pieces come together, the code size comes down closer to the 6809 code size.

And I'll note again, being able to use single instructions instead of utility routines is nice, but it's actually more important that the 68000 has something of an optimal number of registers, so we don't have to worry about pseudo-registers in memory when switching processes.

As always, read the code and the comments in the code, and open up separate browser windows and compare side-by-side.

I'm showing the entire 68000 code in a single block because the abstract operations don't quite map the same, but I'm keeping the order roughly the same to keep it easy to find what to compare. 

[JMR202411070913 addendum:]

You may have missed the mention of the "here pointer" in the 6809 address math chapter:

LOCBAS	EQU	*

In Motorola assemblers, an asterisk where the assembler could parse an address means the location of the current instruction or directive, thus, "here". I'll need to explain more about it later, I'm sure.

[JMR202411070913 addendum end.]

How registers are mapping when moving from 6809 to 68000 --

  • I'm mapping the 6809's S to A7, of course;
  • U to A6;
  • DP will map to A5;
  • X mostly to A0;
  • Y to whatever.
  • B is sort-of mapped to D7;
  • A is sort-of mapped to D6 or the top bytes of D7 or D5 or something, depending on what I need it to do.

(And please don't just copy-and-paste code without thinking.)

* 68000 pointer math

	ORG	SOMETHING
* All of these work fine in-line, rather than called as subroutines.
* In fact, unless specifically specified otherwise, you should in-line.
* You can substitute any data register unless specified otherwise.
*
* Likewise, you can substitute any address register,
* except that A7 should always be in-lined --
* -- except for those routines which specifically handle the return address, 
* but those routines are not really intended to be used anyway.
* Calling a subroutine and playing with the return stack 
* without handling the return address
* just is not a good way to keep control of your program.
*
* And then there is alignment. 68000 needs 16- and 32-bit accesses 
* to be 16-bit aligned, and will throw address errors if they are not.
* (Later CPUs are not so restricted.)
*
* Negate Dn in 8, 16, or 32 bits:
NEGLD7	NEG.L	D7	; .L => 32 bits, .W => 16 bits, .W => 8 bits
	RTS
* On the 6800/6801/6809, you can negate (2's complement) a byte 
* using a 1-byte instruction.
* On the 68000, it takes a 2-byte instruction.
* It takes 5 bytes of instruction to negate 16 bits on 6800/1/9,
* and 13 bytes to negate 32 bits.
* But on the 68000, it takes just two,
* the above 16-bit op-code with a couple of bits changed.
* This is a common pattern with 68000 instructions.
*
* And, for all the time I spend explaining NEG, 
* since the 68000 can subtract registers in either order, 
* we really don't need NEG here.

* Unsigned byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADDBX	AND.W	#$FF,D7	; zero extend it
	ADD.W	D7,A0	; 16-bit source sign extended to 32 bits
	RTS
* Alternative
ADDBXalt
	AND.W	#FF,D7
	LEA	(A0,D7.W),A0	; takes more bytes
	RTS
*
* Signed byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADSBX	EXT.W	D7
	ADD.W	D7,A0	; 16-bit source sign extended to 32 bit An
	RTS
* Alternative
ADSBXalt
	EXT.W	D7
	LEA	(A0,D7.W),A0	; takes more bytes
	RTS
*
* Unsigned byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SUBBX	AND.W	#$FF,D7	; zero extend it
	SUB.W	D7,A0	; 16-bit source sign extended to 32 bits
	RTS

* Signed byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SBSBX	EXT.W	D7
	SUB.W	D7,A0	; 16-bit source sign extended to 32 bit An
	RTS
*
* Unsigned 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADDWX	AND.L	#$FFFF,D7	; zero extend it
	ADD.L	D7,A0	
	RTS
* Alternative
ADDWXalt
	AND.L	#FFFF,D7
	LEA	(A0,D7.L),A0	; takes more bytes
	RTS
*
* Signed 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADSWX	ADD.W	D7,A0	
	RTS
* Alternative
ADSWXalt
	LEA	(A0,D7.W),A0	; takes more bytes
	RTS
* Alternative
ADSWXalt2
	LEA	(A0,A1.W),A0	; takes more bytes
	RTS
*
* Unsigned 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SUBWX	AND.L	#$FFFF,D7	; zero extend it
	SUB.L	D7,A0	
	RTS

* Signed 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SBSWX	SUB.W	D7,A0	
	RTS
*
* 32-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADDLX	ADD.L	D7,A0	
	RTS
* Alternative
ADDLXalt
	LEA	(A0,D7.L),A0	; takes more bytes
	RTS
* Alternative
ADDLXalt2
	LEA	(A0,A1.L),A0	; takes more bytes
	RTS
*
* 32-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SUBLX	SUB.L	D7,A0	
	RTS
*


*************
* For the return stack
* As explained above, just in-line the LEA.
* These are provided as a solution to a puzzle,
* not as useful code.
*
* Signed byte offset
* Just in-line the EXT.W and ADD.W
ADSBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	EXT.W	D7	; zero extend it
	ADD.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.
*
* Unsigned byte offset
* Just in-line the AND.W and ADD.W
ADDBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.W	#$FF,D7	; zero extend it
	ADD.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.

* Signed 16-bit offset
* Just in-line the ADD.W
ADSWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	ADD.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.
*
* Unsigned 16-bit offset
* Just in-line the AND.L and ADD.L
ADDWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.L	#$FFFF,D7	; zero extend it
	ADD.L	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.

* 32-bit offset
* Just in-line the ADD.L
ADDLS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	ADD.L	D7,A7	; 32 bits
	JMP	(A0)	; return via A0
*
* Unsigned byte offset
* Just in-line the AND.W and SUB.W
SUBBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.W	#$FF,D7	; zero extend it
	SUB.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
*
* Unsigned 16-bit offset
* Just in-line the AND.L and SUB.L
SUBWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.L	#$FFFF,D7	; zero extend it
	SUB.L	D7,A7	; 32 bits
	JMP	(A0)	; return via A0
*
* Signed byte offset
* Just in-line the EXT.W and SUB.W
SUBBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	EXT.W	D7	; sign extend it
	SUB.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
*
* Signed 16-bit offset
* Just in-line the SUB.W
SUBWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	SUB.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0

* 32-bit offset
* Just in-line the SUB.L
SUBWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	SUB.L	D7,A7	; 32 bits
	JMP	(A0)	; return via A0
*

* INX and DEX trains and INS and DES trains are meaningless.
* HOWEVER, just to remind ourselves:
* (And all of these work for Y and U, too but IN-LINE them!!)
* (They work for S if in-lined, as well.)
ADD16X	LEA 	16(A0),A0
	RTS
ADD14X	LEA	14(A0),A0
	RTS
SUB16X	LEA	-16(A0),A0
	RTS
* Etc. In-line these.
INX	LEA	1(A0),A0	; Sigh. In-line it. Do not make trains with it. Please.
	RTS
DEX	LEA	-1(A0),A0	; See INX. In-line it. Do not make trains with it. PLEASE.
	RTS
* Note that we can also use ADDQ and SUBQ for offset less than 9
*
* More solutions to puzzles.
* If you called these, you would have to juggle the return address as shown.
* You don't want to do that.
* Just in-line the LEAS instructions.
* Then there's no return address to juggle, no messing with X.
* DO NOT USE THIS CODE other than for examples of silly walks.
ADD16S	MOVE.L	(A7)+,A0
	LEA	16(A7),A7
	JMP	(A0)
* etc.
* Could all be replaced with just LEA	16(A7),A7 in-line!
* That's actually cheaper than just the instruction JSR!!!


* Synthetic stacks restricted within page boundaries make no sense at all
* on the 68000. Except, I suppose they could, sort-of.
*
* In the first place,
* we should be able to use an extra address register to make a third stack.
* If we do, addressing has already been covered, above.
*
* But if we want a software stack maintained by pointers in memory,
* for some reason,
* Given a pseudo-register somewhere in process local variable space
* accessed via A5:
	ORG	SOMEWHERE
	...
QSP	DS.L	1	; a synthetic stack pointer Q
* QSP-LOCBAS has to be within +/-32K on 68000, 2-byte op-code, 2-byte offset, syntax: QSP-LOCBAS(A5)
* 68020 and above allows 32-bit range, 4-byte op-code, 4-byte offset, syntax: (QSP-LOCBAS,A5)
	...
	DS.L	2	; buffer zone
QSTKLIM	DS.L	32
QSTKBAS	DS.L	2	; buffer zone
	...

* 32-bit Dn for synthetic stack (could/should be in-line):
ADDQSP	ADD.L	D7,QSP-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 256-byte page boundary
* so that carries cannot be generated:
* unsigned byte D7
ADDQSPS	ADD.B	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 65536-byte page boundary
* so that carries cannot be generated:
* unsigned 16-bit D7
ADDQSPW	ADD.W	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
*
* 32-bit Dn for synthetic stack (could/should be in-line):
SUBQSP	SUB.L	D7,QSP-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 256-byte page boundary
* so that carries cannot be generated:
* unsigned byte
SUBQSPS	SUB.B	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 65536-byte page boundary
* so that carries cannot be generated:
* unsigned 16-bit D7
SUBQSPW	SUB.W	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS

* 68000 has no memory indirection
QPSHD7L	MOVE.L	QSP-LOCBAS(A5),A4	; 4 bytes in op-code
	MOVE.L	D7,-(A4)		; 2 bytes in op-code
	MOVE.L	A4,QSP-LOCBAS(A5)	; 4 bytes in op-code
	RTS
*
* 68020+ have memory indirection
QPSHD7LI
	SUBQ.L	#4,QSP-LOCBAS(A5)	; 4 bytes in op-code (SUBQ.W would be faster for medium stack)
	MOVE.L	D7,([A4])		; 4 bytes in op-code
	RTS
*
QPOPD7L	MOVE.L	QSP-LOCBAS(A5),A4	; 4 bytes in op-code
	MOVE.L	(A4)+,D7		; 2 bytes in op-code
	MOVE.L	A4,QSP-LOCBAS(A5)	; 4 bytes in op-code
	RTS
*
* 68020+ have memory indirection
QPOPD7LI
	MOVE.L	([A4]),D7		; 4 bytes in op-code
	ADDQ.L	#4,QSP-LOCBAS(A5)	; 4 bytes in op-code (ADDQ.W would be faster for medium stack)
	RTS


* Register offsets from A7 were dealt with above.

* Lest I forget --
* On the 6800 or 6801, this would be reference by a process-local
* LOCALBASE or similar pseudo-register, which I almost forgot to talk about.
* On the 6809, it could be done by pseudo-register or (with some glue) by DP.
* On the 68000, we are going to use a spare address register,
* and I am going to pick A5.
* All the address math has been shown above,
* the only issue is being explicit about the assembly language idiom.
* Lest I forget --
*
* Given 
	ORG	Whatever
LOCBAS	EQU	*
*	...
VAR	DS.B	m	; or .W or .L, etc.
*
* With A5 known to be set to LOCBAS,
	LEA	LOCBAS(PC),A5
* or
	MOVEA.L	#LOCBAS,A5
*
* In-line snippets --
* For variable VAR within 256 bytes of LOCBAS:
	...
	LEA	VAR-LOCBAS(A5),A0	; that's all! (4-byte op-code)
	...
*
* When VAR is 256 bytes or more away from LOCBAS, but less than 32768
* (or, even, below LOCBAS but within -32768), in other words, signed 16-bit offset:
	...
	LEA	VAR-LOCBAS(A5),A0	; same thing!
	...
*
* It's a little messier when the signed offset doesn't fit in 16 bits, 
* less than -32768 below, or 32768 or greater above --
	...
	MOVE.L	#VAR-LOCBASE,D7		; Any Dn. An will also work, if it's not in use. 6 bytes.
	LEA	(A5,D7.L),A0		; 4 bytes. total 10 bytes. 
	...
*
* From the 68020 on, 32-bit offsets are allowed, but the op-code is also 32-bits plus displacement:
	...
	LEA	(VAR-LOCBASE,A5),A0	; 8 byte total op-code
	...
* 
* Do I really need to show this as subroutines?
* signed 16-bit offset in D7:
LEALBWX	LEA	(A5,D7.W),A0	; PLEASE just do this in-line!
	RTS
*
* 32-bit offset in D7:
LEALBLX	LEA	(A5,D7.L),A0	; PLEASE just do this in-line!
	RTS
*			;-/
* 
* I assume you're not going to be wanting to keep LOCBAS
* in a pseudo-register called LB_BASE.
* But you might want to maintain a separate allocation area
* with a pointer in AL_BASE, like this:
LOCBAS	EQU	*
	...
AL_BASE	DS.L	1
	...
* for signed 16-bit offsets in D7: 
ADDLBW	MOVE.L	AL_BASE-LOCBAS(A5),A0
	ADD.W	D7,A0	; or LEA (A0,D7.W),A0
	RTS
* for unsigned 16-bit offsets:
ADDLBU	AND.L	#$0000FFFF,D7	; unsigned offset
* for 32-bit offsets
ADDLBL	MOVE.L	AL_BASE-LOCBAS(A5),A0
	ADD.L	D7,A0	; or LEA (A0,D7.L),A0
	RTS
*
* 68020 and above allow you to do weird things like this --
	...
	LEA	([AL_BASE-LOCBASE,A5],D7.L),A0
*	...					;  8-o
* ... quite literally letting you index directly off that pseudo-register
* out there in memory.
*
* As near as I can tell,
* memory indirect modes all require an address register,
* or the PC. 
* But that's not so bad, other than some of the modes being overkill.
*
* And, in spite of my mugging, maybe this has been a good way
* to expand your grasp of the power of the 68000 addressing modes.

* Sorry about the mugging. Sort-of. ;-/

As you can see, the 68000 just basically does almost all the address math you need without subroutines.

Including, to some extent, arrays, but let's not go there yet.

As with the previous three chapters, I have not tested the code. It should run, modulo typos.

The 68000 can be hard to wrap your head around. I know. If the above doesn't make sense yet, it's okay. I'll point you back here from time to time when we are working with more concrete examples of using the above

Look at how I've been avoiding things. I think it's time to build a concrete example of stack frames on the 6801.

Or you can jump ahead to getting numeric output in binary.


(Title Page/Index)


 

 

 

 

Saturday, November 2, 2024

ALPP 02-23 -- Synthesizing Multibyte NEGate on 6809 (Applies to 6800 and 6801)

  Synthesizing Multibyte NEG
on 6809
(Applies to 6800 and 6801)

(and 6805, with modifications, but we won't talk about that)

(Title Page/Index)

The fact that the NEGD routine effectively does not change from the 6800 to the 6809 had me looking at the 68000's NEGX instruction and NOT instruction and scratching my head as to why there was no NOTX instruction and the reasons for the rules for generating the X bit for the NEG, NEGX, and NOT instructions, and I started losing confidence in the NEGation sequence I have been using in my 6800/6801 and 6809 work:

NEGAB	COMA	; 2's complement NEGate is bit COMplement + 1
	NEGB
	BNE	NEGABX	; or BCS. but BNE works -- extends 0
	INCA
NEGABX	RTS

The theory is pretty straightforward. Radix complement negation is to subtract the number from the radix (or radix raised to the power of the number of columns) and remember your borrow (carry). 

Bit complement is 1's complement, or reduced radix complementr.

So 2's complement is 1's complement plus 1. 

And then you just carry any borrow from the first column as far as it carries. 

No, there's something wrong with that description, which is what had me going.

I guess it's more accurate to say the lack of carry carries until you get a non-zero column? As soon as you get a carry, everything to the left becomes 1's complement. Because the carry means borrow. And 1 - 1 is zero, but 0 - 1 is 1. Or something.

[JMR202411050852 clarification:]

The carry generated on the NEG instruction in Motorola CPUs is the borrow from the (virtual) subtraction from zero. This is the inverse of the carry from adding one.

When you subtract from zero, there's going to be a borrow for any non-zero operand.

When you add 1 to the 1's complement (bit inverse), the carry is only going to generate when the result of the add is zero -- which is exactly the same as when the argument (operand) is zero.

So when there is no borrow from the NEG is when there is carry from the add, and you can stop when there is no carry from the add, which, for me anyway, confirms the reasoning behind stopping when you no longer get 0 as a result.

[JMR202411050852 clarification end.]

Anyway, I was getting lost in something, so I came up with a little routine to test every possible result against straight-out subtracting from 0:

	org $2000
start	ldu #$2100
	ldd #$5a5a
	std ,--u
	ldd #0
	std ,--U
	std ,--u
testl	ldd 2,u
	std ,u
	ldd #0
	subd 2,u
	com ,u
	neg 1,u
	bne testni
	inc ,u
testni	cmpd ,u
	bne teste
	ldd #1
	addd 2,u
	std 2,u
	bne testl
	std 4,u
	nop
	nop
teste	ldd 4,u
	nop
	leau 6,u
	nop

Set a breakpoint at teste, step through the loop a couple of times, and let it rip, and if the value at the top of the U stack is cleared when you're done, every value worked correctly.

Now, I keep saying something about BCS, so I can test that, as well. Just change the

	bne	testni

to

	bcs	testni

and let'r rip again.

And it works.

And som'eres in there, it hit me like the proverbial ton of bricks going down again, bit complement (1's complement in a binary field) does not carry. So the NOT instruction is it's own extending form. And, yes, you prime the NEGX loop on the 68000 with a straight NEG.

... yeah, and I guess I'm not having a smart brain day today or something ...

Well, so, here's a 4-byte negate on 6809:

* negate the 32-bit number on top of stack:
NEG32	COM	,U
	COM	1,U
	COM	2,U
	NEG	3,U
	BNE	NEG32X
	INC	2,U
	BNE	NEG32X
	INC	1,U
	BNE	NEG32X
	INC	,U
NEG32X	RTS

It ought to work.  

[JMR202411160655 addendum:]

It's worth noting that the above discussion of negation is another example of how the 68000 is not just a simple upgrade to the 6809 -- but you have to consider how sign and zero extension works on both CPUs to see it.

[JMR202411160655 addendum end.]



Back to the regularly scheduled programming, as soon as I finish figuring out what instruction and addressing combinations on the 68000 are relevant to what I'm demonstrating.


(Title Page/Index)