And
this one has been sitting at the bottom of the pool
for a while, as predicted, even longer than
the 6809 example.
Walking the Pontoons --
Split-stack Stack Frame Example:
6801
Having worked through
a concrete single-(combined)-stack stack frame example
for the 6801, let's look at whether the split stack discipline/paradigm improves
things (as I think it should). In the split stack discipline, we no longer need frame links. Each frame pointer is simply stacked up like the VBP was in the single stack example. Walking from frame to frame is simply moving back in the return stack -- since we've decided to store the frame pointers there in this split stack discipline.
We decided this?
Okay, I decided it.
Here are some of my reasons --
- we want linkage on one stack and parameter on the other.
- And one of the side benefits is that it's no longer a linked list, it's a stack of pointers, really easy to walk.
We can now simply point to the actual bottom of the fixed part of each frame,
instead of the frame link.
And that means that local references can now be positive offset from the frame
pointer, thanks to the natural structure in logic, reason, and mathematics --
the structure in the universe, if you will allow me, that either God or the
initial conditions provided us with.
We might still wish we had the SBX instruction. It would make allocation
without initialization more straightforward in many cases. But the Double
accumulator and adding negatives can help get us over that. Sort-of.
As with the 6809 and 68000, the return stack will always be in pairs.
However, rather than pushing the previous frame pointer immediately, we'll have the called routine push the frame pointer after allocating the local variables.
And it will be pushed in order of return address, caller frame pointer:
* [PRETADR ]
* [PCALLERFRM]
* [RETADR ]
* [CALLERFRM ] <= RSP
The parameter stack is just whatever is out there, but the conceptual order
would be
* [VARIABLES ] <= CALLERFRM
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES ] <= FP
* [TEMPORARIES]
* [PARAMETERS ] <= PSP
The difference here with the 6809 and 68000 implementations is that, instead of relying on negative offsets to access the variables as is the case there, we will move the frame pointer to the bottom of the variables after allocating them. Variables and parameters passed in will be at known positive offsets from the frame pointer.
The frame pointer we pushed before allocation was the frame pointer of the calling routine, which is as it should be. This will not allow us reliable (easy) means of accessing the calling function's temporary variables, but we shouldn't want to access them anyway.
And we note again that we no longer need to dodge around the return pointer and the saved stack frame pointer because they are on the other stack (where they should be).
You may be wondering about the local variable allocation and initialization,
since the 6801 provides no native push instruction for the parameter stack,
and the clean method we used in-line in the combined frames example:
LDX #0 ; 3: 3# 3: 3~
PSHX ; 1: 4 4: 7
PSHX ; 1: 5 4: 11
PSHX ; 1: 6 4: 15
PSHX ; 1: 7 4: 21
is no longer available.
We might want to use this, instead:
LDX #0
JSR PPSHD
JSR PPSHD
JSR PPSHD
JSR PPSHD
but it comes at a time cost:
PPOPD LDX PSP ; 2: 2# 4: 4~
LDD 0,X ; 2: 4 5: 9
INX ; 1: 5 3: 12
INX ; 1: 6 3: 15
STX PSP : 2: 8 4: 19
RTS : 1: 9 5: 24 (+3#, +6~ call)
*
PPSHD LDX PSP ; 2: 2 4: 4
DEX ; 1: 3 3: 7
DEX ; 1: 4 3: 10
STX PSP : 2: 6 4: 14
STD 0,X : 2: 8 5: 19
RTS : 1: 9 5: 24 (+3#, +6~ call)
Four calls to PPSHD is going to take 120 cycles (plus the load Double
accumulator), six times what the PSHX costs. That hurts.
There is a lot of redundant loading and storing of PSP going on in those routines, not to mention the basic call overhead -- the JSR and RTS instructions.
Would doing it in a loop help?
* count in low byte of DWORD
ALIND0 LDD #0 ; 3: 3# 3: 3~
* count in pseudo-register BCOUNT,
* initial value in D
ALINDC LDX PSP ; 2: 2# 4: 4~
ALINDL STD 0,X ; 2: 4 5: 9
DEX ; 1: 5 3: 12
DEX ; 1: 6 3: 15
DEC BCOUNT ; 3: 9 6: 21 (pseudo-register, but extended mode)
BNE ALINDL ; 2: 11 3: 24 (20 * count + 4 => count 4=>84)
STX PSP ; 2: 13 4: (84)+4
RTS ; 1: 14 5: (84)+9 (+3#, +6~ call)
That's better than our four calls to PPSHD, but still over 4 times the 4 PSHXes. And, in order to clear 16 bits at a time, I'm using a pseudo-register -- BCOUNT.
And we are reminded that the 6801 does not have direct-page addressing modes for the read-modify-write (unary operator) instructions, so BCOUNT will be addressed in extended (absolute) mode.
(Someday, I want to use programmed logic to implement a 6801 source-code
compatible CPU with the op-codes moved around a bit, add SBX, and add
direct-page mode op-codes for the read-modify-write -- unary -- instructions.
It wouldn't save all that much time in the above loop, but I would also add
address space decoding, so that the direct page could be popped out of the
absolute address space, and you could do fancy things like bank-switch
internal dual-ported RAM in the direct page, speeding access further and
allowing really fast process context switches. Heh. Dreams.)
If we are really clever, we can line it all up and provide routines for
allocating and clearing up to four 16-bit local variables like this:
ALCL8 LDD #0 ; 3: 3# 3: 3~
LDX PSP ; 2: 5 4: 7
DEX ; 1: 6 3: 10
DEX ; 1: 7 3: 13
STD 0,X ; 2: 9 5: 18
ALCLI6 DEX ; 1: 10 3: 21
DEX ; 1: 11 3: 24
STD 0,X ; 2: 13 5: 29
ALCLI4 DEX ; 1: 14 3: 32
DEX ; 1: 15 3: 35
STD 0,X ; 2: 17 5: 40
ALCLI2 DEX ; 1: 18 3: 43
DEX ; 1: 19 3: 46
STD 0,X ; 2: 21 5: 51
STX PSP ; 2: 23 4: 55
RTS ; 1: 24 5: 60
*
ALCL6 LDD #0 ; 3: 3# 3: 3~
LDX PSP ; 2: 5 4: 7
BRA ALCLI6 ; 2: 7 3: 10
* ; 0: 7 42: 52
*
ALCL4 LDD #0 ; 3: 3# 3: 3~
LDX PSP ; 2: 5 4: 7
BRA ALCLI4 ; 2: 7 3: 10
* ; 0: 7 31: 41
*
ALCL2 LDD #0 ; 3: 3# 3: 3~
LDX PSP ; 2: 5 4: 7
BRA ALCLI4 ; 2: 7 3: 10
* ; 0: 7 20: 30
That's half the cycle count (not including the call) of the four calls to PPSHD,
but still around three times doing the 8 bytes with four PSHX instructions. It
is an improvement.What's the fastest possible way on the 6801 short of doing dangerous (because of interrupts) things like quickly swapping the return stack pointer in for the parameter stack pointer and swapping it out when done?
(Never do that. If you swap out the stack pointer, you will not avoid interrupts, no matter how hard you try, and your interrupts will save registers all over things. I mean, it is possible to physically strap all interrupt inputs high, but then you have the devil of a time interfacing with the real world. Keep your return stack valid at all times.)
If we replace all those DEXs with direct calculation, we can trim it up a bit,
like this:
ALCLD8 LDD #-8 ; 3: 3# 3: 3~
ADDD PSP ; 2: 5 4: 7
STD PSP ; 2: 7 4: 11
LDX PSP ; 2: 9 4: 15
CLRB ; 1: 10 2: 17
STD 0,X ; 2: 12 5: 22
STD 2,X ; 2: 14 5: 27
STD 4,X ; 2: 16 5: 32
STD 6,X ; 2: 18 5: 37
STX PSP ; 2: 20 4: 41
RTS ; 1: 21 5: 46
But there's no reusable code in there, other than that the whole routine can
be used as a subroutine anywhere you need to allocate and clear 8 bytes on the
parameter stack.
I'm not going to show all the approaches I've checked, but that's the best for time that I can figure out. The down side is that I can't figure out how to steal code from it, so it has to be mostly recreated for the 6 byte and 4 byte case. The 2 byte case you're just going to want to in-line. But it's always faster.
On the other hand, if you need to trim up the object code size, we do have a
sort-of decent option in the DEX chain.
Doesn't this de-optimization of stack allocation have negative consequences for the split stack discipline?
Yes, but I've found, at least in my range of experience, that the negatives are not nearly as negative as they might seem, and the benefits outweigh the disadvantages.
So, the code --
As always, read the code, read the comments and be careful because sometimes the code and the comments are out-of-sync. (You can be sure the comments are closer to the code than any detailed explanation I could give in the blog.)
* 16-bit addition as example of split-stack stack frame discipline on 6801
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
OPT 6801
NATWID EQU 2 ; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
ORG $80 ; MDOS says this is a good place for usr stuff.
*
ENTRY JMP START
NOP ; Just want even addressed pointers for no reason.
NOP ; bumper
NOP ; 6 bytes to this point.
SSAVE RMB 2 ; a place to keep S so we can return clean
RMB 4 ; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK RMB 2 ; For saving an index register temporarily
DWORK RMB 2 ; For saving D temporarily
ALCOUNT RMB 1 ; for allocation utility routines
RMB 1 ; reserved
PSP RMB 2 ; parameter stack pointer
FP RMB 2 ; frame pointer
LB_BASE RMB 2 ; For process local variables
HPPTR RMB 2 ; heap pointer (not yet managed)
HPALL RMB 2 ; heap allocation pointer
HPLIM RMB 2 ; heap limit
* End of pseudo-registers
RMB 4 ; bumper
GAP1 RMB 2 ; Mark the bottom of the gap
*
*
*
ORG $2000 ; Give the DP room.
LB_ADDR RMB 4 ; a little bumper space
FINAL RMB 4 ; 32-bit Final result in DP variable (to show we can)
FINALX EQU 4
SSTKLIM RMB 64 ; 16 levels of call
SSTKLMX EQU FINALX+4
SSTKBAS RMB 8 ; for canary return
SSTKBSX EQU SSTKLMX+64
SSTKFAK RMB 2 ; fake frame pointer, self-link
SSTKFAX EQU SSTKBSX+8 ; 6801 is post-dec (post-store-decrement) push
SSTKBMP RMB 4 ; a little bumper space
SSTKBMX EQU SSTKFAX+2 ; But we are going to init S through X
PSTKLIM RMB 64 ; 16 levels of call at two parameters per call
PSTKLMX EQU SSTKBMX+4
PSTKBAS RMB 4 ; bumper space -- parameter stack is pre-dec
PSTKBSX EQU PSTKLMX+64
PSTKBMP RMB 4 ; a little bumper space
PSTKBMX EQU PSTKBSX+4
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE RMB 1 ; $1024 or something ; Not using or managing heap yet.
HBASEX EQU PSTKBMX+4
*HLIM RMB 4 ; bumper
*HLIMX EQU HBASEX+$100 ; 1024
*
*
ORG $3000
CDBASE JMP ERROR ; more bumpers
NOP
*
INISTKS LDX #LB_ADDR ; set up process local space
STX LB_BASE
LDD LB_BASE
ADDD #HBASEX ; calculat EA
STD HPPTR ; as if we actually had a heap
STD HPALL
LDD #CDBASE
SUBD #4 ; extra bumper
STD HPLIM
LDD LB_BASE
ADDD #PSTKBSX ; calculate the address
STD PSP ; initialize parameter stack pointer empty
STD FP ; initialize frame pointer
LDX #SSTKNDR ; error handler
STX DWORK ; save it for a bit (not ready for stack)
LDD LB_BASE
ADDD #SSTKBSX ; calculate the address
STD XWORK ; move it to X (not ready fof stack)
LDX XWORK
LDD DWORK ; prime the stack with underflow error handler
STD 0,X
STD 4,X
LDD PSP ; and empty parameter stack frame
STD 2,x ; stacks primed
PULA ; get the return address
PULB
STS SSAVE ; Save what the monitor gave us.
TXS ; move to our own stack
PSHB
PSHA
RTS
*
*
***
* General structure of the stacks,
*
* return stack is always in pairs:
* [PRETADR ]
* [PCALLERFRM]
* [RETADR ]
* [CALLERFRM ] <= RSP
*
* order of elements on the parameter stack,
* when they are present:
* [PARAMETERS ]
* [VARIABLES ] <= CALLERFRM
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES ] <= FP
* [TEMPORARIES]
* [PARAMETERS ] <= PSP
*
* Result is returned on parameter stack
*
***
* Utility routines
PPOPD LDX PSP
LDD 0,X
INX
INX
STX PSP
RTS
*
PPSHD LDX PSP
DEX
DEX
STX PSP
STD 0,X
RTS
*
* subroutine to make sure we don'T forget anything
MARK PULA
PULB
LDX FP
PSHX ; mark, no allocate,
LDX PSP
STX FP
PSHB
PSHA
RTS
*
*UNMK PULA
* PULB
* PULX
* STX FP
* PSHB
* PSHA
* RTS
*
* Compromise between speed and reusability
* Enter here to load PSP and initialize to 0
* 8 bytes
ALCL8 LDD #0 ; 3: 3# 3: 3~
* Enter here with initial value in D
ALCLD8 LDX PSP ; 2: 5 4: 7
* Enter here with PSP loaded and initial value in D
ALCLI8 DEX ; 1: 6 3: 10
DEX ; 1: 7 3: 13
STD 0,X ; 2: 9 5: 18
ALCLI6 DEX ; 1: 10 3: 21
DEX ; 1: 11 3: 24
STD 0,X ; 2: 13 5: 29
ALCLI4 DEX ; 1: 14 3: 32
DEX ; 1: 15 3: 35
STD 0,X ; 2: 17 5: 40
ALCLI2 DEX ; 1: 18 3: 43
DEX ; 1: 19 3: 46
STD 0,X ; 2: 21 5: 51
STX PSP ; 2: 23 4: 55
RTS ; 1: 24 5: 60
*
* six bytes
ALCL6 LDD #0 ; 3: 3# 3: 3~
ALCLD6 LDX PSP ; 2: 5 4: 7
BRA ALCLI6 ; 2: 7 3: 10
* ; 0: 7 42: 52
* four bytes
ALCL4 LDD #0 ; 3: 3# 3: 3~
ALCLD4 LDX PSP ; 2: 5 4: 7
BRA ALCLI4 ; 2: 7 3: 10
* ; 0: 7 31: 41
* two bytes
ALCL2 LDD #0 ; 3: 3# 3: 3~
LDX PSP ; 2: 5 4: 7
BRA ALCLI4 ; 2: 7 3: 10
* ; 0: 7 20: 30
*
*
* Add D to PSP -- negative for allocation, positive for deallocation
ADDPSP ADDD PSP ; plus 3 bytes for load: 6 bytes vs. 9 total
STD PSP
LDX PSP ; 3 bytes for call vs. 6 bytes in-line
RTS
*
PDROP_8 LDAB #8 ; saves two bytes, 7 vs. 3
* deallocate count in B
PDROPB LDX PSP ; 5 bytes to deallocate in-line
ABX ; vs. 3 bytes to call this.
STX PSP ; ABX is useful for deallocation
RTS ; 5 bytes vs. 7 total
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry, after link:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ]
* [FRMPTR0==<EMPTYP>]
* [RETADR1 ]
* [FRMPTR1 ] <= RSP
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>] <= FRMPTR0,FRMPTR1
* [32:VAR1_1--]
* [32:VAR1_2--] <= FRMPTR1
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP,FP
*
****
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
* 16-bit left, right
* output parameter:
* 17-bit sum in 32-bit
ADD16S JSR MARK ; mark, no allocate, X is PSP
*
LDD #(-1) ; default negative
JSR ALCLI4 ; allocate 2 temporary cells and init
LDX FP ;
TST 2,X ; the left-hand operand sign bit
BMI ADD16SR
LDX PSP
CLR 2,X ; positive
CLR 3,X
ADD16SR LDX FP
TST 0,X ; the right-hand operand sign bit
BMI ADD16SL
LDX PSP
CLR 0,X ; positive
CLR 1,X
ADD16SL LDX FP
LDD 2,X ; left hand
ADDD 0,X ; right hand
STD 2,X ; store low half
LDX PSP
LDD 2,X
ADCB 1,X
ADCA 0,X
LDX FP ; wouldn't need to do this if we tracked PSP extras
STD 0,X
*
STX PSP ; drop the temporaries
PULX
STX FP
RTS
*
* The alternative without link, mark, or restore will be shown in the no-frame case.
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
* 16-bit left, right in 32-bit
* output parameter:
* 17-bit sum in 32-bit D1
ADD16U JSR MARK ; mark, no allocate, X is PSP
*
LDX FP
LDD 2,X ; left
ADDD 0,X ; add right
STD 2,X ; save low
LDD #0 ; extend
ROLB ; extend Carry unsigned (could ADC #0)
STD 0,X ; re-use right side to store high half
*
PULX ; restore FP
STX FP
RTS
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with one 16-bit parameters,
* after mark (no local allocation)
* [<unknown> ] <= FRMPTR0
* [32:VAR1_1 ]
* [32:VAR1_2 ] <= FRMPTR1
* [16:PARAM2_1] <= PSP,FP
*
* To show how to walk the stack --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
* 16-bit addend in 32-bit
* target parameter in caller
* 2nd 32-bit variable at offset -2*NATWID
* no output parameter:
ADD16SI JSR MARK
*
LDD #(-1) ; make a temporary -1
JSR ALCLI2 ; (default to signed)
LDX FP
TST 0,X ; test high byte
BMI ADD16SP
LDX PSP
CLR 0,X ; zero extend
CLR 1,X
ADD16SP TSX
LDX 0,X ; caller's FP
LDD 2,X ; caller's 2nd variable, low
LDX FP
ADDD 0,X ; parameter
TSX
LDX 0,X
STD 2,X ; update low half with result
LDD 0,X ; 2nd variable, high half
LDX PSP
ADCB 1,X ; sign extension half
ADCA 0,X
TSX
LDX 0,X
STD 0,X ; update high half
*
LDX FP
INX ; drop parameter
INX
STX PSP ; and sign temporary goes bye-bye, too
PULX
STX FP
RTS
*
*
***
* Return stack on entry:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ] <= RSP
*
* Return stack after link:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>]
* [RETADR0 ]
* [FRMPTR0==<EMPTYP>] <= RSP
*
* Parameter stack after mark and local allocation
* [<unknown>] <= FRMPTR0
* [VAR1_1--]
* [VAR1_2--] <= PSP,FP
*
MAIN JSR MARK
JSR ALCL8 ; allocate and clear 8 bytes
STX FP ; Point FP to base of local variables.
*
LDD #$1234
JSR PPSHD
LDD #$CDEF
JSR PPSHD
JSR ADD16U ; 32-bit result on parameter stack should be $0000E023
LDD #$8765
LDX PSP ; reuse parameter space, since order is okay
STD 0,X
JSR ADD16S ; result on parameter stack should be $FFFF6788
LDX PSP
LDD 2,X ; result low half
LDX FP
STD 2,X ; to 2nd local variable low half
LDX PSP
LDD 0,X ; result high half
LDX FP
STD 0,X ; to 2nd local variable high half
LDD #$A5A5
JSR PPSHD
JSR ADD16SI ; result in 2nd variable should be FFFF0D2D (Carry set)
LDX FP
LDD 2,X ; 2nd variable low half
LDX LB_BASE
STD FINALX+2,X
LDX FP
LDD 0,X
LDX LB_BASE
STD FINALX,X
*
JSR PDROP_8
PULX
STX FP ; restore FP
RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
*
***
* Return stack will always be in pairs:
* [RETADRNN ]
* [CALLERFMNN]
*
* Return stack after initialization:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS <= RSP
*
* Return stack after saving previous mark:
* [SSTKNDR ]
* [<EMPTYP>]
* [SSTKNDR ]SSTKBAS
* [FRMPTRm1==<EMPTYP>] <= RSP
*
* Parameter stack after initialization, mark:
* [<unknown]PSTKBAS <= PSP,FP==<EMPTYP>
*
START JSR INISTKS
LDX PSP
PSHX ; empty previous mark
STX FP ; empty new mark
*
JSR MAIN
*
*
DONE NOP
ERROR NOP ; define error labels as something not DONE, anyway
SSTKNDR NOP
LDS SSAVE ; restore the monitor stack pointer
NOP
NOP
NOP ; another landing pad to set breakpoint at
NOP
LDX $FFFE
JMP 0,X ; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor,
* but not necessarily to your program in a runnable state.
Again, I have tested this code. It builds the stack frames and tears them down as advertised and gets the right result in the right place. Again, I will not guarantee that this code can be generalized, whether by hand or by automaton (compiler, etc.).
[JMR202411182142 edit:]
Just realized, while working on this for the 6800, that the name for SUB16SI
did not agree with what it is doing. So I'm fixing it, calling it ADD16SI,
instead.
[JMR202411182142 edit end.]
Again, I remind you that we have seen what
this kind of code looks like without stack frames. I may come back and strip the stack frames from this specific example, just
to be obnoxious, but we need to look at both kinds of stack frames on the
6800,
single stack, first.
If you don't want to wait, move on to
getting numeric output in binary.
[JMR202411290957 addendum:]
I've mentioned several times how I'd have liked a subtract B from X SBX instruction in the 6801. We might have put it to good use it here, as well.
Another thing that would be useful, if we must have stack frames, would be a version of the JSR call instruction(s) that pushes both the address of the next instruction and the 2-byte frame pointer variable from the direct page, with a corresponding return that pops both the PC and the frame pointer, something like
CALL <address>:PSP
where PSP is the direct page variable.
The return might look like
RET PSP
That would be asking for too much, really, but it would streamline the split-stack discipline.
It could also help with single-stack stack frames.
(Sigh. Daydreams.)
[JMR202412152250: I wasted a few brain cycles on more meanderings on extending the 6801, here.
[JMR202411290957 addendum end.]
No comments:
Post a Comment