Yet another couple of useful bits,
from the bottom of the pool.
Ascending the Right Island --
Frameless Examples (Single- &
Split-stack):
6809
Now that we have worked through both the single-stack and split-stack frameless examples for the 6801, we can finally get back to the code that started this detour (6809 version) and strip out the code for maintaining the stack frames.
On higher-level architectures like the 6809, the stack frame maintenance code can be so non-intrusive that it can be easy to fail to notice it.
But it can still get in the way. So I'm going ahead and showing the code without it here, in single-stack no-frame and split-stack frameless discipline.
Frameless does mean we have to keep track of what's on the stack(s).
And there's really not much more left to talk about, although we want to remember that, because
we are making specific use of the direct page, the entry address is $2000
instead $80.
* 16-bit addition as example of single-stack no frame discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID EQU 2 ; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
ORG $2000 ; MDOS says this is a good place for usr stuff.
* SETDP $20 ; for some other assemblers
SETDP $2000 ; for EXORsim
*
ENTRY LBRA START
NOP ; Just want even addressed pointers for no reason.
NOP ; bumper
NOP
SSAVE RMB 2 ; a place to keep S so we can return clean
SSAVEX EQU 6 ; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE RMB 2 ; just for kicks, save U, too
USAVEX EQU SSAVEX+2
DPSAVE RMB 2 ; a place to keep DP so we can return clean
DPSAVEX EQU USAVEX+2
RMB 4 ; bumper
XWORK RMB 2 ; For saving an index register temporarily
XWORKX EQU DPSAVEX+6
HPPTR RMB 2 ; heap pointer (not yet managed)
HPPTRX EQU XWORKX+2
HPALL RMB 2 ; heap allocation pointer
HPALLX EQU HPPTRX+2
RMB 4 ; bumper
FINAL RMB 4 ; 32-bit Final result in DP variable (to show we can)
FINALX EQU HPALLX+6
GAP1 RMB 2 ; Mark the bottom of the gap
GAP1X EQU FINALX+4
*
LB_ADDR EQU ENTRY
*
*
SETDP 0 ; Not yet set up
ORG $2100 ; Give the DP room.
RMB 4 ; a little bumper space
SSTKLIM RMB 96 ; roughly 16 levels of call
SSTKLIMX EQU $104
* ; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS RMB 4 ; for canary return
SSTKBASX EQU SSTKLIMX+96
SSTKBMP RMB 4 ; a little bumper space
SSTKBMPX EQU SSTKBASX+4
*
HBASE RMB $1024 ; Not using or managing heap yet.
HBASEX EQU SSTKBMPX+4
HLIM RMB 4 ; bumper
HLIMX EQU HBASEX+$1024
*
*
INISTK TFR DP,A
CLRB
TFR D,X ; save old DP base for a moment
LEAY ENTRY,PCR ; Set up new DP base
TFR Y,D
TFR A,DP ; Now we can access DP variables correctly.
* SETDP $20 ; some other assemblers
SETDP $2000 ; EXORsim
STX <DPSAVE ; technically only need to save high byte
STU <USAVE
PULS X ; get return address
STS <SSAVE ; Save what the monitor gave us.
LEAS SSTKBMPX,Y ; Move to our own stack
LEAY STKUNDR,PCR ; fake return to stack underflow handler
PSHS Y ;
PSHS Y ; one more fake return to handler
CLRB ; A still has run-time DP
ADDD #HBASEX ; calculat EA
TFR D,Y ; as if we actually had a heap
STY <HPPTR
STY <HPALL
JMP ,X ; return via X
*
***
* Stack after call when fuctions are called by MAIN
* with two parameters
* (#0 means no local variables)
* We will return result in D:X
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ]
* [--------]
* [--------]
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ]
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
* 16-bit left 1st pushed, right 2nd
* output parameter:
* 17-bit sum in 32-bit D:X D high, X low
ADD16S LDX #-1 ; sign extend right
TST 2,S ; sign bit, anyway
BMI ADD16SR
LEAX 1,X ; 0
ADD16SR PSHS X ; push right extension (parameters 4 offset)
LDX #-1 ; negative
LDD 6,S ; left
BMI ADD16SL
LEAX 1,X ; 0
ADD16SL PSHS X ; push left extension (parameters 6 offset)
ADDD 6,S ; add right
TFR D,X ; save low
PULS D ; get left sign extension (parameters 4 offset)
ADCB 1,S ; carry is still safe
ADCA ,S ; high word complete
LEAS 2,S ; drop temporary
RTS ; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
* 16-bit left, right
* output parameter:
* 17-bit sum in 32-bit D:X D high
ADD16U LDD 4,S ; left
ADDD 2,S ; add right
TFR D,X ; save low
LDD #0 ; extend
ADCB #0 ; extend Carry unsigned (could ROL in)
RTS ; C, N valid, Z not valid
*
* Etc.
*
***
* Stack at entry when called by MAIN
* (#0 means no local variables)
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ]
* [VAR1_1--]
* [VAR1_2--] <= PARAM2_1
* [PARAM2_1] (pointer to VAR1_2)
* [PARAM2_2]
* [RETADR1 ]
*
* To show how to access caller's local variables through pointer
* instead of walking stack --
* Add 16-bit signed parameter
* to 32 bit caller's 32-bit internal variable.
* input parameter:
* 16-bit pointer to 32-bit integer
* 16-bit addend
* no output parameter:
ADD16SI LDX #-1 ; sign extend 1st parameter
TST 2,S
BMI ADD16SIP
LEAX 1,X
ADD16SIP PSHS X ; parameters now 4 offset
LDX 6,S ; pointer -- LDD [6,X] gets the high half
LDD 2,X ; caller's 2nd variable, low
ADDD 4,S ; 1st parameter
STD 2,X ; update low half
LDD ,X ; caller's 2nd variable, high
ADCB 1,S ; sign extension
ADCA ,S ; high byte
STD ,X ; update
LEAS 2,S ; drop temporary
RTS ; C, N valid, Z not valid
*
*
***
* Stack after allocating local variables
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ]
* [32:VAR1_1]
* [32:VAR1_2] <= SP
*
MAIN LDD #0 ; allocate and initialize
TFR D,X
PSHS D,X
PSHS D,X
*
LDX #$1234
LDD #$CDEF
PSHS D,X
LBSR ADD16U ; result in D:X should be $E023
STX 2,S
LDD #$8765
STD 0,S
LBSR ADD16S ; result in D:X should be $FFFF6788 (and carry set)
STX 6,S ; result in 2nd local variable
STD 4,S
LEAX 4,S ; calculate address of 2nd variable to pass in
STX 2,S
LDD #$A5A5
STD ,S
LBSR ADD16SI ; result in 2nd variable should be FFFF0D2D (Carry set)
LDD 4,S
STD <FINAL
LDD 6,S
STD <FINAL+2
LEAS 12,S ; drop both the used parameters and the local variables together
RTS ; C, N still valid, Z still not
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= SP
***
*
START NOP
LBSR INISTK
NOP
*
*
LBSR MAIN
*
DONE NOP
ERROR NOP ; define error labels as something not DONE, anyway
STKUNDR NOP
LDS <SSAVE ; restore the monitor stack pointer
LDU <USAVE ; restore U
LDD <DPSAVE ; restore the monitor DP last
TFR A,DP
SETDP 0 ; For lack of a better way to set it.
NOP
NOP ; landing pad to set breakpoint at
NOP
NOP
JMP [$FFFE] ; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor,
* but not necessarily to your program in a runnable state.
Again, not much to say about the split-stack code. other than that you'll want to compare it with the split-stack stack frame version for the 6809 and the split-stack stack frame version for the 6801, for the same reasons as mentioned above. to get a better feel of the differences.
* 16-bit addition as example of split-stack frame-free discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID EQU 2 ; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
ORG $2000 ; MDOS says this is a good place for usr stuff.
* SETDP $20 ; for lwasm and some other assemblers
SETDP $2000 ; for EXORsim and some other assemblers
*
ENTRY LBRA START
NOP ; Just want even addressed pointers for no reason.
NOP ; bumper
NOP
SSAVE RMB 2 ; a place to keep S so we can return clean
SSAVEX EQU 4 ; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE RMB 2 ; just for kicks, save U, too
USAVEX EQU SSAVEX+2
DPSAVE RMB 2 ; a place to keep DP so we can return clean
DPSAVEX EQU USAVEX+2
RMB 4 ; bumper
XWORK RMB 2 ; For saving an index register temporarily
XWORKX EQU DPSAVEX+6
HPPTR RMB 2 ; heap pointer (not yet managed)
HPPTRX EQU XWORKX+2
HPALL RMB 2 ; heap allocation pointer
HPALLX EQU HPPTRX+2
RMB 4 ; bumper
FINAL RMB 4 ; 32-bit Final result in DP variable (to show we can)
FINALX EQU HPALLX+6
GAP1 RMB 2 ; Mark the bottom of the gap
GAP1X EQU FINALX+4
*
LB_ADDR EQU ENTRY
*
*
SETDP 0 ; Not yet set up
ORG $2100 ; Give the DP room.
RMB 4 ; a little bumper space
SSTKLIM RMB 32 ; 16 levels of call
SSTKLIMX EQU $104 ; Skip over the DP page.
* ; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS RMB 4 ; for canary return
SSTKBASX EQU SSTKLIMX+32
SSTKBMP RMB 4 ; a little bumper space
SSTKBMPX EQU SSTKBASX+4
PSTKLIM RMB 64 ; about 16 levels of call at two parameters per call
PSTKLIMX EQU SSTKBMPX+4
PSTKBAS RMB 4 ; bumper space -- parameter stack is pre-dec
PSTKBASX EQU PSTKLIMX+64
*
HBASE RMB $1024 ; Not using or managing heap yet.
HBASEX EQU PSTKBASX+4
HLIM RMB 4 ; bumper
HLIMX EQU HBASEX+$1024
*
*
* Calculate DP because we don't have DP relative in index postbyte:
INISTKS TFR DP,A
CLRB
TFR D,X ; save old DP base for a moment
LEAY ENTRY,PCR ; Set up new DP base
TFR Y,D
TFR A,DP ; Now we can access DP variables correctly.
* SETDP $20 ; some other assemblers
SETDP $2000 ; EXORsim
STX <DPSAVE ; technically only need to save high byte
STU <USAVE
PULS X ; get return address
STS <SSAVE ; Save what the monitor gave us.
LEAS SSTKBMPX,Y ; Move to our own return stack
LEAU PSTKBASX,Y ; and our own parameter stack
LEAY STKUNDR,PCR ; fake return to stack underflow handler
PSHS Y
PSHS Y ; one more fake return to stack underflow handler
CLRB ; A still has run-time DP
ADDD #HBASEX ; calculat EA
TFR D,Y ; as if we actually had a heap
STY <HPPTR
STY <HPALL
JMP ,X ; return via X
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ]
* [RETADR1 ]
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
* 16-bit left, right
* output parameter:
* 17-bit sum in 32-bit
ADD16S LDX #-1 ; sign extend right
TST ,U ; sign bit, anyway (Use Y to show it can be used.)
BMI ADD16SR
LEAX 1,X ; 0
ADD16SR PSHU X ; push right extension (parameters 2 offset)
LDX #-1 ; negative
LDD 4,U ; left
BMI ADD16SL
LEAX 1,X ; 0
ADD16SL PSHU X ; push left extension (parameters 4 offset)
ADDD 4,U ; add right
STD 6,U ; save low
PULU D ; get left sign extension (parameters 2 offset)
ADCB 1,U ; carry is still safe
ADCA ,U++ ; high word complete, tricky postinc (parameters 0 offset)
STD ,U
RTS ; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
* 16-bit left, right in 32-bit
* output parameter:
* 17-bit sum in 32-bit
ADD16U LDD 2,U ; left
ADDD ,U ; add right
STD 2,U ; save low
LDD #0 ; extend
ROLB ; extend Carry unsigned (could ADC #0)
STD ,U
RTS ; C, N valid, Z not valid
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with two 16-bit parameters,
* [32:VAR1_1--]
* [32:VAR1_2--] <= PARAM2_1
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
* 16-bit pointer to 32-bit integer
* 16-bit addend
* no output parameter:
ADD16SI LDD #-1 ; sign extend addend parameter
TST ,U
BMI ADD16SIP
LDD #0
ADD16SIP PSHU D ; save sign extension (parameters 2 offset)
LDX 4,U ; get pointer to variable
LDD 2,X ; caller's 2nd variable, low
ADDD 2,U ; addend parameter
STD 2,X ; update low half
LDD ,X ; caller's 2nd variable, high
ADCB 1,U ; sign extension low byte
ADCA ,U ; high byte
STD ,X ; store result
LEAU 6,U ; drop temporary and parameters -- no return parameter
RTS ; C, N valid, Z not valid
*
*
***
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= RSP
*
* Parameter stack after local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN LDD #0 ; allocate and initialize
TFR D,X
PSHU D,X
PSHU D,X
LDX #$1234
LDD #$CDEF
PSHU D,X ; 8 bytes local, 4 bytes parameter, 12 bytes offset
LBSR ADD16U ; 32-bit result on parameter stack should be $0000E023
LEAU 2,U ; drop high part (could be optimized out).
LDD #$8765
PSHU D
LBSR ADD16S ; result on parameter stack should be $FFFF6788 (and carry set)
PULU D,X ; 4 bytes of used parameters removed from stack (local variables on top)
STX 2,U ; low half, store in local variable
STD ,U ; high half
LEAX ,U ; point to 2nd variable
LDD #$A5A5
PSHU D,X ; X pushed first
LBSR ADD16SI ; result in 2nd variable should be FFFF0D2D (Carry set)
LDD 2,U
STD <FINAL+2
LDD ,U
STD <FINAL
LEAU 8,U
RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= RSP (S)
***
* (who knows?) <= PSP (U)
***
*
***
* Return stack will be just the return addresses:
* [RETADRNN ]
*
* Return stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= RSP
*
*
* Parameter stack after initialization, mark:
* [<unknown] <= PSP
*
START LBSR INISTKS
*
LBSR MAIN
*
*
DONE NOP
ERROR NOP ; define error labels as something not DONE, anyway
STKUNDR NOP
LDS <SSAVE ; restore the monitor stack pointer
LDU <USAVE ; restore U
LDD <DPSAVE ; restore the monitor DP last
TFR A,DP
SETDP 0 ; For lack of a better way to set it.
NOP
NOP ; landing pad to set breakpoint at
NOP
NOP
JMP [$FFFE] ; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor,
* but not necessarily to your program in a runnable state.
As always, I have stepped through the code and made sure it does what I say it does.
If reading through it and comparing it with other version brings up questions that stepping through the code doesn't answer, go ahead and leave me a comment.
From here, you can either go ahead to digging into outputting binary numbers, or (when I get it ready) you can look at one more set of examples for frameless discipline, on the 68000.
--
Ah, more squirrels to chase. I mean, more daydreams.
With the stack split up, we might be able to see how a simple hysteric spill-fill cache could significantly optimize calls and returns.
Calls and returns cost, in addition to the code and cycles to load the new PC, cycles to save and restore the old. With the combined stack, they also tend to incur code and cycle costs in moving parameters into place and saving and restoring registers.
With a cache attached to the return stack pointer, saves and restores can happen in parallel with fetching, decoding, and executing instructions, effectively hiding the basic call/return overhead.
Here's what I mean by hysteric spill/fill:
Say the cache has sixteen entries (32 bytes).
When pushing a new return address crosses the boundary between the 12th and 13th entry -- 3/4 ful, the cache controller starts pushing saved addresses off the other end into main RAM, to make more room. It watches the bus so that it can do so when the bus is not busy with instruction fetches or data or DMA accesses, unless it the cache completely fills, in which case it gets the bus at higher priority than instruction fetches.
It will keep pushing addresses out until the cache is half-empty again, or until a return cancels the fill.
Returns will work in reverse. As long as it is nested more than four calls deep, it will try to keep at least four return addresses in the cache, schedule reads to bring addresses back in from RAM when the boundary between the 5th and 4th entries is crossed.
It uses a cache base and limit register to maintain position in the stack address space, and a stack base and limit register to tell the controller when to come to a hard stop, and when to initiate stack overflow or underflow interupt/exception processing.
I assume that you have noticed that splitting the stack helps relieve the costs of moving parameters into place, and can even be of some relief relative to saving and restoring registers.
The parameter stack is not as regularly structured as the return address stack, but it could profitably be cached in a similar manner, with a larger cache, either double or quadruple size.
Both of these caches should be paired, to enable fast context switching. Or maybe done in sets of four, but I'm not sure the 6809 would benefit from four of each. One for the current process and one to be writing back to RAM after a process switch should be enough.
And I guess, since I've commented after the 6801 examples about how the direct page should be a bank of memory to use as pseudo-registers, I should mention the concept of a cache for the direct page here. This would also be paired, with the switch activated when the DP is set. There would need to be several different strategies for filling the new cache and writing back the dirty entries from the old cache, plus a way of setting priority for differnt regions of the direcgt page.
Caching the direct page would conflict with using it for I/O devices, so I'm thinking the 6809 wants a second direct page (specifiable in the index post-byte) just for I/O.
Heh. Daydreams, indeed. This is just an 8/16-bit processor with a 16-bit address space. Too greedy. Unless we had a true 16/32-bit descendant of the 6809.
Ah. Sorry for the further distractions.
No comments:
Post a Comment