Sunday, December 15, 2024

ALPP 02-35 -- Tentative Op-code Map of RK0801 CPU (Extension of M6801)

One final bit of treasure from the bottom of the pool.

  Tentative Op-code Map of
RK0801 CPU
(Extension of 6801)

(Title Page/Index)

 

This is a tentative op-code map of extensions to the 6801 CPU that I think would make it significantly more efficient without blowing the semiconductor real estate (gates) budget for an 8-bit CPU core, from some older ideas I've had for a while (direct page unaries and SBX) and some new ideas suggested by the addressing math and stack frame examples

New in this map:

  • SBX: Subtract B from X corollary to existing ABX. This optimizes small-to-medium allocations where size is not known at compile/assemble time, also helps when following relative links around.
    (Adding an op-code to add D to X might be another possibility, but would require sign-extending into A.)
  • Add signed Immediate byte to X and S, ADIX/ADIS. This optimizes small-to-medium stack and other allocations where size is known at compile/assemble time.
    (Add and Subtract unsigned byte immediate is an option, but requires more op-codes in the very tight primary op-code table. Add 16-bit immediate is yet another option, but is less efficient with code size, enough so as to make the most common case, add plus or minus 2, meaningless.)
    (Considered dropping INX/DEX and INS/DES, but that de-optimizes byte string operations.)
  • Direct-page versions of unary/read-modify-write byte instructions,
    • NEG (NEGate byte),
    • COM (bit COMplement byte),
    • LSR (Logical Shift Right byte),
    • ROR (ROtate Right byte through carry)
    • ASR (Arithmetic Shift Right byte, copying sign),
    • ROL (ROtate Left byte through carry),
    • DEC (DECrement byte),
    • INC (INCrement byte),
    • TST (TeST byte),
    • CLR (CLeaR byte).

    (These are, really, more appropriate in direct-page mode than in extended mode, to provide effective pseudo-registers.)
    (Also, it might be useful to provide address function code outputs that distinguish between direct page and extended mode, providing an effective separate address space for pseudo-registers and I/O, with all addressing modes enabled on it.)

  • 16-bit read/modify/write  instructions:
    • DINC, (Double-byte INCrement)
      (including INCD, INCrement Double accumulator),
    • DDEC (Double-byte DECrement)
      (including DECD, DECrement Double accumulator),
    • DASL (Double-byte Arithmetic Shift Left)
      (including ASLD, Arithmetic Shift Left Double accumulator),
    • DLSR (Double-byte Logical Shift Right)
      (including LSRD, Logical Shift Right Double accumulator).

    (DASL and DLSR are moved from their position in the 6801 map to the corresponding position in the new 16-bit ranks.)
    (16-bit increment and decrement in the direct page will be especially helpful for software stacks.)

  • JMP to direct-page target (not in 6801 op-codes).

Adding the FDIV and IDIV instructions that the 68HC11 has would be fun, but would likely shoot the gates budget. Likewise, adding the 68HC11's bit testing and manipulation instructions or an additional stack register would require using pre-bytes, and I don't want to do that, either.

Instead of moving the op-codes around, the missing op-codes could be squeezed into empty codes in the 6801 map, but that would require gates that could be used for something else. 

Using a pre-byte and putting the direct page op-codes in a second op-code map would partially erase the advantage of direct-page op-codes.

Left half of the op-code table:

Mnemonic

UNARY
BRANCH
UNARY

**ACCA **INH REL INH **ACCB *Dir Ind Ext

0 1 2 3 4 5 6 7
0 NEG ***CBA BRA TSX NEG NEG
1
NOP BRN [INS] INCD
*DINC
2
***SBA BHI PULA DECD
*DDEC
3 COM ***ABA BLS PULB COM COM
4 LSR ***TAB BCC [DES] LSR LSR
5
***TBA BCS TXS **ASLD *DASL
6 ROR TAP BNE PSHA ROR ROR
7 ASR TPA BEQ PSHB ASR ASR
8 ASL [INX] BVC PULX ASL ASL
9 ROL [DEX] BVS RTS ROL ROL
A DEC CLV BPL ABX DEC DEC
B
SEV BMI RTI ***LSRD *DLSR
C INC CLC BGE PSHX INC INC
D TST SEC BLT MUL TST TST
E ***DAA CLI BGT WAI *SBX *JMP
F CLR SEI BLE SWI CLR CLR

*Not in 6801 *No JMP dp in 6801

**Moved in 2801

***Both row and column moved.

Right half of the op-code table:

Mnemonic

BINARY

ACCA ACCB

Imm Dir Ind Ext Imm Dir Ind Ext

8 9 A B C D E F
0 SUB
1 CMP
2 SBC
3 SUBD ADDD
4 AND
5 BIT
6 LDA
7
STA
STA
8 EOR
9 ADC
A ORA
B ADD
C CPX LDD
D BSR JSR
STD
E LDS LDX
F *~ADIS STS *~ADIX STX

*Not in 6801

*~ADIS and ADIX are signed byte constant

Expanding the address map via segment registers or widened address registers is tempting, but I'm thinking to simply be satisfied with two additional address function outputs to allow distinction between

  • code space (PC relative),
  • return address stack space (S relative),
  • direct page space (DP mode),
  • general data (everything else).

Four address spaces won't really even double available address space because of issues in indexing and hard space separation, but it will make it possible to reach or somewhat exceed full 64 K  addressing.

On the other hand, it would not be hard to give the '0801 widened X and PC and maybe S, or segment registers for two, three or all four of the above address spaces or something similar. If segment registers, I would want to use either full-width segment registers, or have the segment registers offset a full byte. None of that 4-bit offset wamby-pamby.

Further extensions, such as a second stack and widened address registers, and the Y register, bit operators, and IDIV and FDIV from the 'HC11, would warrant another part number, say 2801, the "2" indicating two stacks. 

Or a second 16-bit accumulator, such as the 6309 has, would make it a 16-bit CPU, so maybe 21601. But borrowing from the 6309 tends to point to the idea that, beyond a certain point, we'd want to move up to an extended derivative of the 6809.

Well, I don't think I have anything more for this rabbit hole at this moment, so you can return to the irregularly scheduled assembly language tutorial, continuing with getting numbers output.


(Title Page/Index)


Sunday, December 8, 2024

ALPP 02-34 -- Ascending the Right Island -- Frameless Examples (Single- & Split-stack): 68000

This should be the final bits of treasure I have to drag up from the bottom of the pool before I get back to I/O.

  Ascending the Right Island --
 Frameless Examples (Single- & Split-stack):
68000

(Title Page/Index)

 

Having worked through the 6809 versions of the stack frame example code with the stack frame code stripped out, let's continue full circle and look at the 68000 version with the stack frame code stripped out. 

Again, there is not a whole lot to say here that can't be seen fairly easily in the code. Let's start with the single-stack no-frame code, comparing it with the single-stack stack frame code for the 68000 and the same for the 6809 in a new window. You'll want to assemble this and trace through it, stopping appropriately to look at the registers, stack, and memory.

	OPT LIST,SYMTAB	; Options we want for the stand-alone assembler.
	MACHINE MC68000	; because there are a lot the assembler can do.
	OPT DEBUG	; We want labels for debugging.
	OUTPUT
***********************************************************************
*
* 16-bit addition as example of single-stack no-frame discipline on 68000
* with test code
* Joel Matthew Rees, October, November 2024
*
NATWID	EQU	4	; 4 bytes in the CPU's natural integer
HLFNAT	EQU	2	; half natural integer
*
*
	EVEN
LB_ADDR	EQU	*
ENTRY	BRA.W	START
	NOP		; A little buffer zone.
	NOP
A4SAVE	DS.L	1	; a place to keep A4 to A7 so we can return clean
A5SAVE	DS.L	1	; using it as pseudo-DP
A6SAVE	DS.L	1	; 
A7SAVE	DS.L	1	; SP
	DS.L	2	; gap
HPPTR	DS.L	1	; heap pointer (not yet managed)
HPALL	DS.L	1	; heap allocation pointer
	DS.L	2	; gap
FINAL	DS.L	1	; unused statically allocated variable
GAP1	DS.L	51	; gap, make it an even 256 bytes.
*
	DS.L	2	; a little bumper space
SSTKLIM	DS.L	16*8	; 16 levels of call, with room for stack frames
* 			; 68000 is pre-dec (pre-store-decrement) push
SSTKBAS	DS.L	4	; for canary return
	DS.L	2	; bumper
HBASE	DS.L	$1000	; heap space (not yet managing it)
HLIM	DS.L	2	; bumper
*
*
	EVEN
INISTKS	MOVEM.L	(A7)+,A0	; get the return address from the BIOS-provided stack
	LEA	LB_ADDR(PC),A3	; point to our process-local area
	MOVEM.L	A4-A7,A4SAVE-LB_ADDR(A3)	; Store away what the BIOS gives us.
	MOVE.L	A3,A5	; set up our local base (pseudo-DP) in A5
	LEA	SSTKBAS+4*NATWID-LB_ADDR(A5),A7	; set up our return stack
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	LEA	HBASE-LB_ADDR(A5),A4	; as if we actually had a heap
	MOVE.L	A4,HPPTR-LB_ADDR(A5)
	MOVE.L	A4,HPALL-LB_ADDR(A5)
	JMP	(A0)		; return via A0
*

***
* Stack after entry when functions are called by MAIN
* with two parameters
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [--------]
* [--------] 
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] <= SP 
*

* Signed 16 bit add to 32 bit result
* Why do this? Stack cell is 32-bit, parameters are 16.
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right in 32-bit cell on stack
* output parameter:
*   17-bit sum in 32-bit D1
ADD16S	MOVE.W	NATWID+HLFNAT(A7),D0	; right (16-bit only)
	EXT.L	D0
	MOVE.W	2*NATWID+HLFNAT(A7),D1	; add to left (16-bit only)
	EXT.L	D1
	ADD.L	D0,D1			; 32-bit result
	RTS		; return, *** all flags valid!! ***
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 32-bit cell on stack
* output parameter:
*   17-bit sum in 32-bit D1
ADD16U	CLR.L	D0
	MOVE.W	NATWID+HLFNAT(A7),D0	; right (16-bit only)
	CLR.L	D1
	MOVE.W	2*NATWID+HLFNAT(A7),D1	; add to left (16-bit only)
	ADD.L	D0,D1			; 32-bit result
	RTS		; return, *** all flags valid!! ***
*
* Etc.
*

***
* Stack after entry when functions are called by MAIN
* with two parameters (pointer and addend)
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [VAR1_1--]
* [VAR1_2--] <= PARAM2_1
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] <= SP
* To show how to access caller's local variables through pointer
* instead of walking stack --
* Add 16-bit signed parameter
* to 32 bit caller's 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	MOVE.W	NATWID+HLFNAT(A7),D1	; skip over return address
	EXT.L	D1
	MOVE.L	2*NATWID(A7),A0		; get pointer to (internal) variable
	ADD.L	D1,(A0)			; add to variable pointed to
	RTS		; return, *** all flags valid!! ***
*
*
***
* Stack on entry
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= SP
*
MAIN	CLR.L	-(A7)	; 2 variables
	CLR.L	-(A7)
	MOVE.L	#$1234,-(A7)
	MOVE.L	#$CDEF,-(A7)
	BSR.W	ADD16U	; result in D1 should be $E023
	LEA	2*NATWID(A7),A7	; could reuse, instead
	MOVE.L	D1,-(A7)
	MOVE.L	#$8765,-(A7)
	BSR.W	ADD16S	; result in D1 should be $FFFF6788 (and carry set)
	LEA	2*NATWID(A7),A7	; drop the parameters
	MOVE.L	D1,(A7)	; store result in 2nd local variable
	PEA	(A7)
	MOVE.L	#$A5A5,-(a7)
	BSR.W	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	MOVE.L	2*NATWID(A7),FINAL-LB_ADDR(A5)	; store the result
	LEA	4*NATWID(A7),A7	; drop the parameters and locals
	RTS
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP (A7)
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= SP
***
*
START	BSR.W	INISTKS
*
	NOP
*
	BSR.W	MAIN
*
	NOP
*
DONE	NOP
ERROR	NOP		; stack underflow and ERROR skip DONE
STKUNDR	NOP
	MOVEM.L	A4SAVE-LB_ADDR(A5),A4-A7	; restore the monitor's A4-A7
	NOP
	NOP		; landing pad
	NOP
	NOP
* One way to return to the OS or other calling program
	CLR.W	-(A7)	; there should be enough room on the caller's stack
	TRAP	#1	;	quick exit
*

I have stepped through the code myself. It runs, puts the correct results where they are supposed to go, and restores the stack as it should.

Now let's look at it with a split-stack frameless discipline, comparing it with both the above in a new browser window and the split-stack stack frame version for the 68000, and with the split-stack version for the 6809. Assemble this one, as well, and step through it, watching memory and registers.

	OPT LIST,SYMTAB	; Options we want for the stand-alone assembler.
	MACHINE MC68000	; because there are a lot the assembler can do.
	OPT DEBUG	; We want labels for debugging.
	OUTPUT
***********************************************************************
*
* 16-bit addition as example of split-stack, frameless discipline on 68000
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	4	; 4 bytes in the CPU's natural integer
HLFNAT	EQU	2	; half natural integer
*
*
	EVEN
LB_ADDR	EQU	*
ENTRY	BRA.W	START
	NOP		; A little buffer zone.
	NOP
A4SAVE	DS.L	1	; a place to keep A4 to A7 so we can return clean
A5SAVE	DS.L	1	; using it as pseudo-DP
A6SAVE	DS.L	1	; 
A7SAVE	DS.L	1	; SP
	DS.L	2	; gap
HPPTR	DS.L	1	; heap pointer (not yet managed)
HPALL	DS.L	1	; heap allocation pointer
	DS.L	2	; gap
FINAL	DS.L	1	; unused statically allocated variable
GAP1	DS.L	51	; gap, make it an even 256 bytes.
*
	DS.L	2	; a little bumper space
SSTKLIM	DS.L	16*2	; 16 levels of call, with room for frame pointers
* 			; 68000 is pre-dec (pre-store-decrement) push
SSTKBAS	DS.L	2	; for canary return
SSTKBMP	DS.L	2	; bumper
PSTKLIM	DS.L	16*4	; roughly 16 levels of call
PSTKBAS	DS.L	2	; bumper
HBASE	DS.L	$1000	; heap space (not yet managing it)
HLIM	DS.L	2	; bumper
*
*
	EVEN
INISTKS	MOVE.L	(A7)+,A0	; get the return address from the other (BIOS) stack
	LEA	LB_ADDR(PC),A3
	MOVEM.L	A4-A7,A4SAVE-LB_ADDR(A3)	; Store away what the BIOS gives us.
	MOVE.L	A3,A5	; set up our local base (pseudo-DP)
	LEA	SSTKBMP-LB_ADDR(A5),A7	; set up our return stack
	LEA	PSTKBAS-LB_ADDR(A5),A6	; set up our parameter stack
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	PEA	STKUNDR(PC)	; fake return to stack underflow handler
	LEA	HBASE-LB_ADDR(A5),A4	; as if we actually had a heap
	MOVE.L	A4,HPPTR-LB_ADDR(A5)
	MOVE.L	A4,HPALL-LB_ADDR(A5)
	JMP	(A0)		; return via A0
*

***
* Return stack when functions are called by MAIN
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [RETADR1 ] <= RSP
*
* Parameter stack when called by MAIN
* with two 16-bit parameters,
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*

* Signed 16 bit add with 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right in 16-bit cells
* output parameter:
*   17-bit sum in 32-bit cell
ADD16S	MOVEM.W	(A6)+,D0/D1	; D0 lowest, but 16-bit sign extends!
*	EXT.L	D0		; right
*	EXT.L	D1		; left
	ADD.L	D0,D1		; add right to left
	MOVE.L	D1,-(A6)
	RTS		; return, *** all flags valid!! ***
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 16-bit cells
* output parameter:
*   17-bit sum in 32-bit cell
ADD16U	CLR.L	D0
	CLR.L	D1
*	MOVEM.W	(A6)+,D0/D1	; D0 lowest, but 16-bit sign extends!
	MOVE.W	(A6)+,D0	; right
	MOVE.W	(A6)+,D1	; left
	ADD.L	D0,D1		; add right to left
	MOVE.L	D1,-(A6)
	RTS		; return, *** all flags valid!! ***
*
* Etc.
*

***
* Parameter stack when called by MAIN
* with two parameters, 32-bit pointer and 16-bit addend
* [32:VAR1_1--]
* [32:VAR1_2--] <= PARAM2_1 
* [32:PARAM2_1]
* [16:PARAM2_2] <= PSP

* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to caller's 2nd 32-bit internal variable.
* input parameter:
*   32-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	MOVE.W	(A6)+,D1		; addend
	EXT.L	D1
	MOVE.L	(A6)+,A0		; get caller's internal variable pointer
	ADD.L	D1,(A0)	; add to caller's 2nd variable
	RTS		; return, *** all flags valid!! ***
*
*
***
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= RSP
*
* Parameter stack after local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN	CLR.L	-(A6)		; allocate and initialize
	CLR.L	-(A6)		; allocate and initialize
	MOVE.W	#$1234,-(A6)
	MOVE.W	#$CDEF,-(A6)
	BSR.W	ADD16U	; result on parameter stack should be $E023
	LEA	HLFNAT(A6),A6	; adjust to 16 bit, could be optimized out
	MOVE.W	#$8765,-(A6)
	BSR.W	ADD16S	; result on parameter stack should be $FFFF6788 (and carry set)
	MOVE.L	(A6),NATWID(A6)	; save in local
	LEA	NATWID(A6),A0
	MOVE.L	A0,(A6)		; save the pointer.
	MOVE.W	#$A5A5,-(a6)	; push the addend.
	BSR.W	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	MOVE.L	(A6),FINAL-LB_ADDR(A5)	; store the result
	RTS
*
***
* Stack at START:
* (what BIOS/OS gave us) <= RSP (A7)
***
* (who knows?) <= PSP (A6)
***
*
***
* Return stack will always be in pairs:
* [RETADRNN  ]
* [CALLERFMNN]
*
* Return stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= RSP
*
* Parameter stack after initialization, mark:
* [<unknown] <= PSP,FP==<EMPTYP>
*
START	BSR.W	INISTKS
*
	NOP
*
	BSR.W	MAIN
*
	NOP
*
DONE	NOP
	NOP		; landing pad
ERROR	NOP
STKUNDR	NOP
	MOVE.L	(A7)+,A4
	MOVEM.L	A4SAVE-LB_ADDR(A5),A4-A7	; restore the monitor's A4-A7
	NOP
	NOP		; landing pad
	NOP
	NOP
* One way to return to the OS or other calling program
	CLR.W	-(A7)	; there should be enough room on the caller's stack
	TRAP	#1	;	quick exit
*

Man, I'm worn out. On the other hand, I think this detour will help me focus as we progress through the I/O examples, starting with binary numeric output.

(It definitely helped me figure out some daydreams about extending the 6801, if you're interested.)


(Title Page/Index)


Wednesday, December 4, 2024

ALPP 02-33 -- Ascending the Right Island -- Frameless Examples (Single- & Split-stack): 6809

Yet another couple of useful bits, from the bottom of the pool.

Ascending the Right Island --
 Frameless Examples (Single- & Split-stack):
6809

(Title Page/Index)

 

Now that we have worked through both the single-stack and split-stack frameless examples for the 6801, we can finally get back to the code that started this detour (6809 version) and strip out the code for maintaining the stack frames. 

On higher-level architectures like the 6809, the stack frame maintenance code can be so non-intrusive that it can be easy to fail to notice it. 

But it can still get in the way. So I'm going ahead and showing the code without it here, in single-stack no-frame and split-stack frameless discipline. 

Frameless does mean we have to keep track of what's on the stack(s).

And there's really not much more left to talk about, although we want to remember that, because we are making specific use of the direct page, the entry address is $2000 instead $80.

First the single-stack version. You'll want to compare with the single-stack stack frame version for the 6809 to get a better feel for what is happening with stack frames, as well as with the single-stack no frame version for the 6801 to see how the 6809's addressing modes make things easier.
* 16-bit addition as example of single-stack no frame discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$2000	; MDOS says this is a good place for usr stuff.
*	SETDP	$20	; for some other assemblers
	SETDP	$2000	; for EXORsim
*
ENTRY	LBRA	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP
SSAVE	RMB	2	; a place to keep S so we can return clean
SSAVEX	EQU	6	; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE	RMB	2	; just for kicks, save U, too
USAVEX	EQU	SSAVEX+2
DPSAVE	RMB	2	; a place to keep DP so we can return clean
DPSAVEX	EQU	USAVEX+2
	RMB	4	; bumper
XWORK	RMB	2	; For saving an index register temporarily
XWORKX	EQU	DPSAVEX+6
HPPTR	RMB	2	; heap pointer (not yet managed)
HPPTRX	EQU	XWORKX+2
HPALL	RMB	2	; heap allocation pointer
HPALLX	EQU	HPPTRX+2
	RMB	4	; bumper
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	HPALLX+6
GAP1	RMB	2	; Mark the bottom of the gap
GAP1X	EQU	FINALX+4
*
LB_ADDR	EQU	ENTRY
*
*
	SETDP	0	; Not yet set up
	ORG	$2100	; Give the DP room.
	RMB	4	; a little bumper space
SSTKLIM	RMB	96	; roughly 16 levels of call
SSTKLIMX	EQU	$104
* 			; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS	RMB	4	; for canary return
SSTKBASX	EQU	SSTKLIMX+96
SSTKBMP	RMB	4	; a little bumper space
SSTKBMPX	EQU	SSTKBASX+4
*
HBASE	RMB	$1024		; Not using or managing heap yet.
HBASEX	EQU	SSTKBMPX+4
HLIM	RMB	4	; bumper
HLIMX	EQU	HBASEX+$1024
*
*
INISTK	TFR	DP,A
	CLRB
	TFR	D,X		; save old DP base for a moment
	LEAY	ENTRY,PCR	; Set up new DP base
	TFR	Y,D
	TFR	A,DP		; Now we can access DP variables correctly.
*	SETDP	$20	; some other assemblers
	SETDP	$2000	; EXORsim
	STX	<DPSAVE		; technically only need to save high byte
	STU	<USAVE
	PULS	X		; get return address
	STS	<SSAVE		; Save what the monitor gave us.
	LEAS	SSTKBMPX,Y	; Move to our own stack
	LEAY	STKUNDR,PCR	; fake return to stack underflow handler
	PSHS	Y		; 
	PSHS	Y		; one more fake return to handler
	CLRB			; A still has run-time DP
	ADDD	#HBASEX		; calculat EA
	TFR	D,Y		; as if we actually had a heap
	STY	<HPPTR
	STY	<HPALL
	JMP	,X	; return via X
*
***
* Stack after call when fuctions are called by MAIN
* with two parameters
* (#0 means no local variables)
* We will return result in D:X
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [--------]
* [--------] 
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] 
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit D:X D high, X low
ADD16S	LDX	#-1	; sign extend right
	TST	2,S	; sign bit, anyway
	BMI	ADD16SR
	LEAX	1,X	; 0
ADD16SR	PSHS	X	; push right extension (parameters 4 offset)
	LDX	#-1	; negative
	LDD	6,S	; left
	BMI	ADD16SL
	LEAX	1,X	; 0
ADD16SL	PSHS	X	; push left extension (parameters 6 offset)
	ADDD	6,S	; add right
	TFR	D,X	; save low
	PULS	D	; get left sign extension (parameters 4 offset)
	ADCB	1,S	; carry is still safe
	ADCA	,S	; high word complete
	LEAS	2,S	; drop temporary
	RTS		; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit D:X D high
ADD16U	LDD	4,S	; left
	ADDD	2,S	; add right
	TFR	D,X	; save low
	LDD	#0	; extend
	ADCB	#0	; extend Carry unsigned (could ROL in)
	RTS		; C, N valid, Z not valid
*
* Etc.
*
***
* Stack at entry when called by MAIN
* (#0 means no local variables)
* We will return result in D0:D1
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [VAR1_1--]
* [VAR1_2--] <= PARAM2_1
* [PARAM2_1] (pointer to VAR1_2)
* [PARAM2_2]
* [RETADR1 ] 
*
* To show how to access caller's local variables through pointer
* instead of walking stack --
* Add 16-bit signed parameter
* to 32 bit caller's 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDX	#-1	; sign extend 1st parameter
	TST	2,S
	BMI	ADD16SIP
	LEAX	1,X
ADD16SIP	PSHS	X	; parameters now 4 offset
	LDX	6,S	; pointer -- LDD [6,X] gets the high half
	LDD	2,X	; caller's 2nd variable, low
	ADDD	4,S	; 1st parameter
	STD	2,X	; update low half
	LDD	,X	; caller's 2nd variable, high
	ADCB	1,S	; sign extension
	ADCA	,S	; high byte 
	STD	,X	; update
	LEAS	2,S	; drop temporary
	RTS		; C, N valid, Z not valid
*
*
***
* Stack after allocating local variables
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2] <= SP
*
MAIN	LDD	#0	; allocate and initialize
	TFR	D,X
	PSHS	D,X
	PSHS	D,X
*
	LDX	#$1234
	LDD	#$CDEF
	PSHS	D,X
	LBSR	ADD16U	; result in D:X should be $E023
	STX	2,S
	LDD	#$8765
	STD	0,S
	LBSR	ADD16S	; result in D:X should be $FFFF6788 (and carry set)
	STX	6,S	; result in 2nd local variable
	STD	4,S
	LEAX	4,S	; calculate address of 2nd variable to pass in
	STX	2,S
	LDD	#$A5A5
	STD	,S	
	LBSR	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	LDD	4,S
	STD	<FINAL
	LDD	6,S
	STD	<FINAL+2
	LEAS	12,S	; drop both the used parameters and the local variables together
	RTS		; C, N still valid, Z still not
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= SP
***
*
START	NOP
	LBSR	INISTK
	NOP
*
*
	LBSR	MAIN
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	<SSAVE	; restore the monitor stack pointer
	LDU	<USAVE	; restore U
	LDD	<DPSAVE	; restore the monitor DP last
	TFR	A,DP
	SETDP	0	; For lack of a better way to set it.
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	JMP	[$FFFE]	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

Again, not much to say about the split-stack code. other than that you'll want to compare it with the split-stack stack frame version for the 6809 and the split-stack stack frame version for the 6801, for the same reasons as mentioned above. to get a better feel of the differences.
* 16-bit addition as example of split-stack frame-free discipline on 6809
* using the direct page,
* with test code
* Joel Matthew Rees, October 2024
*
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$2000	; MDOS says this is a good place for usr stuff.
*	SETDP	$20	; for lwasm and some other assemblers
	SETDP	$2000	; for EXORsim and some other assemblers
*
ENTRY	LBRA	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP
SSAVE	RMB	2	; a place to keep S so we can return clean
SSAVEX	EQU	4	; manufacture offsets for assemblers that can't do SSAVE-ENTRY
USAVE	RMB	2	; just for kicks, save U, too
USAVEX	EQU	SSAVEX+2
DPSAVE	RMB	2	; a place to keep DP so we can return clean
DPSAVEX	EQU	USAVEX+2
	RMB	4	; bumper
XWORK	RMB	2	; For saving an index register temporarily
XWORKX	EQU	DPSAVEX+6
HPPTR	RMB	2	; heap pointer (not yet managed)
HPPTRX	EQU	XWORKX+2
HPALL	RMB	2	; heap allocation pointer
HPALLX	EQU	HPPTRX+2
	RMB	4	; bumper
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	HPALLX+6
GAP1	RMB	2	; Mark the bottom of the gap
GAP1X	EQU	FINALX+4
*
LB_ADDR	EQU	ENTRY
*
*
	SETDP	0	; Not yet set up
	ORG	$2100	; Give the DP room.
	RMB	4	; a little bumper space
SSTKLIM	RMB	32	; 16 levels of call
SSTKLIMX	EQU	$104	; Skip over the DP page.
* 			; 6809 is pre-dec (pre-store-decrement) push
SSTKBAS	RMB	4	; for canary return
SSTKBASX	EQU	SSTKLIMX+32
SSTKBMP	RMB	4	; a little bumper space
SSTKBMPX	EQU	SSTKBASX+4
PSTKLIM	RMB	64	; about 16 levels of call at two parameters per call
PSTKLIMX	EQU	SSTKBMPX+4
PSTKBAS	RMB	4	; bumper space -- parameter stack is pre-dec
PSTKBASX	EQU	PSTKLIMX+64
*
HBASE	RMB	$1024		; Not using or managing heap yet.
HBASEX	EQU	PSTKBASX+4
HLIM	RMB	4	; bumper
HLIMX	EQU	HBASEX+$1024
*
*
* Calculate DP because we don't have DP relative in index postbyte:
INISTKS	TFR	DP,A
	CLRB
	TFR	D,X		; save old DP base for a moment
	LEAY	ENTRY,PCR	; Set up new DP base
	TFR	Y,D
	TFR	A,DP		; Now we can access DP variables correctly.
*	SETDP	$20	; some other assemblers
	SETDP	$2000	; EXORsim
	STX	<DPSAVE		; technically only need to save high byte
	STU	<USAVE
	PULS	X		; get return address
	STS	<SSAVE		; Save what the monitor gave us.
	LEAS	SSTKBMPX,Y	; Move to our own return stack
	LEAU	PSTKBASX,Y	; and our own parameter stack
	LEAY	STKUNDR,PCR	; fake return to stack underflow handler
	PSHS	Y
	PSHS	Y		; one more fake return to stack underflow handler
	CLRB			; A still has run-time DP
	ADDD	#HBASEX		; calculat EA
	TFR	D,Y		; as if we actually had a heap
	STY	<HPPTR
	STY	<HPALL
	JMP	,X	; return via X
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ]
* [RETADR1 ]
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit
ADD16S	LDX	#-1	; sign extend right
	TST	,U	; sign bit, anyway (Use Y to show it can be used.)
	BMI	ADD16SR
	LEAX	1,X	; 0
ADD16SR	PSHU	X	; push right extension (parameters 2 offset)
	LDX	#-1	; negative
	LDD	4,U	; left
	BMI	ADD16SL
	LEAX	1,X	; 0
ADD16SL	PSHU	X	; push left extension (parameters 4 offset)
	ADDD	4,U	; add right
	STD	6,U	; save low
	PULU	D	; get left sign extension (parameters 2 offset)
	ADCB	1,U	; carry is still safe
	ADCA	,U++	; high word complete, tricky postinc (parameters 0 offset)
	STD	,U
	RTS		; C, N valid, Z not valid
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 32-bit
* output parameter:
*   17-bit sum in 32-bit 
ADD16U	LDD	2,U	; left
	ADDD	,U	; add right
	STD	2,U	; save low
	LDD	#0	; extend
	ROLB		; extend Carry unsigned (could ADC #0)
	STD	,U
	RTS		; C, N valid, Z not valid
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with two 16-bit parameters,
* [32:VAR1_1--]
* [32:VAR1_2--] <= PARAM2_1
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDD	#-1	; sign extend addend parameter
	TST	,U
	BMI	ADD16SIP
	LDD	#0
ADD16SIP	PSHU	D	; save sign extension (parameters 2 offset)
	LDX	4,U	; get pointer to variable
	LDD	2,X	; caller's 2nd variable, low
	ADDD	2,U	; addend parameter
	STD	2,X	; update low half
	LDD	,X	; caller's 2nd variable, high
	ADCB	1,U	; sign extension low byte
	ADCA	,U	; high byte
	STD	,X	; store result
	LEAU	6,U	; drop temporary and parameters -- no return parameter
	RTS		; C, N valid, Z not valid
*
*
***
* Return stack on entry:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS
* [RETADR0 ] <= RSP
*
* Parameter stack after local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN	LDD	#0	; allocate and initialize
	TFR	D,X
	PSHU	D,X
	PSHU	D,X
	LDX	#$1234
	LDD	#$CDEF
	PSHU	D,X	; 8 bytes local, 4 bytes parameter, 12 bytes offset
	LBSR	ADD16U	; 32-bit result on parameter stack should be $0000E023
	LEAU	2,U	; drop high part (could be optimized out).
	LDD	#$8765
	PSHU	D
	LBSR	ADD16S	; result on parameter stack should be $FFFF6788 (and carry set)
	PULU	D,X	; 4 bytes of used parameters removed from stack (local variables on top)
	STX	2,U	; low half, store in local variable
	STD	,U	; high half
	LEAX	,U	; point to 2nd variable
	LDD	#$A5A5
	PSHU	D,X	; X pushed first
	LBSR	ADD16SI		; result in 2nd variable should be FFFF0D2D (Carry set)
	LDD	2,U
	STD	<FINAL+2
	LDD	,U
	STD	<FINAL
	LEAU	8,U
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= RSP (S)
***
* (who knows?) <= PSP (U)
***
*
***
* Return stack will be just the return addresses:
* [RETADRNN  ]
*
* Return stack after initialization:
* [STKUNDR ]
* [STKUNDR ]SSTKBAS <= RSP
*
*
* Parameter stack after initialization, mark:
* [<unknown] <= PSP
*
START	LBSR	INISTKS
*
	LBSR	MAIN
*
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	<SSAVE	; restore the monitor stack pointer
	LDU	<USAVE	; restore U
	LDD	<DPSAVE	; restore the monitor DP last
	TFR	A,DP
	SETDP	0	; For lack of a better way to set it.
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	JMP	[$FFFE]	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

As always, I have stepped through the code and made sure it does what I say it does. 

If reading through it and comparing it with other version brings up questions that stepping through the code doesn't answer, go ahead and leave me a comment.

From here, you can either go ahead to digging into outputting binary numbers, or (when I get it ready) you can look at one more set of examples for frameless discipline, on the 68000.

--

Ah, more squirrels to chase. I mean, more daydreams.

With the stack split up, we might be able to see how a simple hysteric spill-fill cache could significantly optimize calls and returns.

Calls and returns cost, in addition to the code and cycles to load the new PC, cycles to save and restore the old. With the combined stack, they also tend to incur code and cycle costs in moving parameters into place and saving and restoring registers.

With a cache attached to the return stack pointer, saves and restores can happen in parallel with fetching, decoding, and executing instructions, effectively hiding the basic call/return overhead. 

Here's what I mean by hysteric spill/fill:

Say the cache has sixteen entries (32 bytes).

When pushing a new return address crosses the boundary between the 12th and 13th entry -- 3/4 ful, the cache controller starts pushing saved addresses off the other end into main RAM, to make more room. It watches the bus so that it can do so when the bus is not busy with instruction fetches or data or DMA accesses, unless it the cache completely fills, in which case it gets the bus at higher priority than instruction fetches. 

It will keep pushing addresses out until the cache is half-empty again, or until a return cancels the fill. 

Returns will work in reverse. As long as it is nested more than four calls deep, it will try to keep at least four return addresses in the cache, schedule reads to bring addresses back in from RAM when the boundary between the 5th and 4th entries is crossed.

It uses a cache base and limit register to maintain position in the stack address space, and a stack base and limit register to tell the controller when to come to a hard stop, and when to initiate stack overflow or underflow interupt/exception processing.

I assume that you have noticed that splitting the stack helps relieve the costs of moving parameters into place, and can even be of some relief relative to saving and restoring registers.

The parameter stack is not as regularly structured as the return address stack, but it could profitably be cached in a similar manner, with a larger cache, either double or quadruple size.

Both of these caches should be paired, to enable fast context switching. Or maybe done in sets of four, but I'm not sure the 6809 would benefit from four of each. One for the current process and one to be writing back to RAM after a process switch should be enough.

And I guess, since I've commented after the 6801 examples about how the direct page should be a bank of memory to use as pseudo-registers, I should mention the concept of a cache for the direct page here. This would also be paired, with the switch activated when the DP is set. There would need to be several different strategies for filling the new cache and writing back the dirty entries from the old cache, plus a way of setting priority for differnt regions of the direcgt page.

Caching the direct page would conflict with using it for I/O devices, so I'm thinking the 6809 wants a second direct page (specifiable in the index post-byte) just for I/O. 

Heh. Daydreams, indeed. This is just an 8/16-bit processor with a 16-bit address space. Too greedy. Unless we had a true 16/32-bit descendant of the 6809.

Ah. Sorry for the further distractions. 

The 68000 frameless examples

Or outputting binary numbers

 

(Title Page/Index)


 

 

 

 

Friday, November 29, 2024

ALPP 02-32 -- More Ascending the Right Island -- Split-stack No Frame Example: 6801

Still digging into that treasure from the bottom of the pool.

  Ascending the Right Island --
Split-stack No Frame Example:
6801

(Title Page/Index)

 

About the only thing I want to point out here is that, with the support for 16-bit operations on the 6801, it becomes easier to see how splitting the return address allows a more seamless approach to passing parameters than the single-stack no-frame example we just finished. 

Hopefully the code is mostly self-explanatory by now. (We've been looking at the meat of it for so long ...)

Compare with both the single-stack example for the 6801 and the split-stack example for the 6800 to help see what is and is not going on.

As always, read the code and step through it:

* 16-bit addition as example of split-stack frame-free discipline on 6801
* using the direct page,
* with test code
* Joel Matthew Rees, October, November 2024
*
	OPT	6801
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for usr stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily
DWORK	RMB	2	; For saving D temporarily
PSP	RMB	2	; parameter stack pointer
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
	RMB	4	; Put a bumper after the process static variables
SSTKLIM	RMB	64	; 16 levels of call
SSTKLMX	EQU	FINALX+8
SSTKBAS	RMB	4	; for canary return
SSTKBSX	EQU	SSTKLMX+64
SSTKFAK	RMB	2	; fake frame pointer, self-link
SSTKFAX	EQU	SSTKBSX+8	; 6801 is post-dec (post-store-decrement) push
SSTKBMP	RMB	4	; a little bumper space
SSTKBMX	EQU	SSTKFAX+2	; But we are going to init S through X
PSTKLIM	RMB	64	; 16 levels of call at two parameters per call
PSTKLMX	EQU	SSTKBMX+4
PSTKBAS	RMB	4	; bumper space -- parameter stack is pre-dec
PSTKBSX	EQU	PSTKLMX+64
PSTKBMP	RMB	4	; a little bumper space
PSTKBMX	EQU	PSTKBSX+4
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	PSTKBMX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
*
INISTKS	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE
	LDD	LB_BASE		; bootstrap own return stack
	ADDD	#SSTKBSX
	STD	XWORK
	LDX	XWORK		; initial return stack pointer
*
	LDD	#SSTKNDR
	STD	0,X	; in the cell beyond empty stack pointer
	STD	2,X	; and the next cell, for good measure
	PULA		; pop real return address
	PULB
	STS	SSAVE	; save stack pointer from monitor ROM
	TXS		; move to our own stack (let TXS convert it)
	PSHB		; put return address on own stack
	PSHA		; stack now ready for interrupts
*
	LDD	LB_BASE		; bootstrap parameter stack
	ADDD	#PSTKBSX
	STD	PSP		; parameter stack now ready
*
	LDAA	LB_BASE		; set up heap as if we actually had one
	LDAB	LB_BASE+1
	ADDD	#HBASEX
	STD	HPPTR
	STD	HPALL		; as if the heap were functional
	LDD	#CDBASE
	SUBD	#4
	STAA	HPLIM
	RTS
*
*
***
* General structure of the stacks, 
*
* return stack is just the return address:
* [PRETADR   ]
* [RETADR    ] <= SP
*
* order of elements on the parameter stack,
* when they are present:
* [PARAMETERS ]
* [VARIABLES  ]
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES  ]
* [TEMPORARIES]
* [PARAMETERS ] <= PSP
*
* Result is returned on parameter stack
*
***
*
* Utility routines
*
PPOPD	LDX	PSP
	LDD	0,X
	INX
	INX
	STX	PSP
	RTS
*
* This saves bytes:
ALCL2	CLRA
	CLRB	; fall through
* 
PPSHD	LDX	PSP
ALCLI2	DEX
	DEX
	STX	PSP
	STD	0,X
	RTS
*
* Compromise between speed and reusability
* Returns with PSP in X.
* Enter here to load PSP and initialize to 0
* 8 bytes
ALCL8	CLRA	
	CLRB
* Enter here with initial value in A:B
ALCLD8	LDX	PSP
* Enter here with PSP loaded and initial value in D
ALCLI8	DEX
	DEX
	STD	0,X
ALCLI6	DEX
	DEX
	STD	0,X
ALCLI4	DEX
	DEX
	STD	0,X
	BRA	ALCLI2
*
* six bytes
ALCL6	CLRA
	CLRB
ALCLD6	LDX	PSP
	BRA	ALCLI6
*
* four bytes
ALCL4	CLRA
	CLRB
ALCLD4	LDX	PSP
	BRA	ALCLI4
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry
* [SSTKNDR ]
* [SSTKNDR ]SSTKBAS
* [RETADR0 ]
* [RETADR1 ] <= SP
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after entry (before temporary allocation)
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
****
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit
ADD16S	LDD	#(-1)	; default negative
	JSR	ALCLD4	; returns with PSP in X
	TST	6,X	; the left-hand operand sign bit
	BMI	ADD16SR
	CLR	2,X	; positive
	CLR	3,X
ADD16SR	TST	4,X	; the right-hand operand sign bit
	BMI	ADD16SL
	CLR	0,X	; positive
	CLR	1,X
ADD16SL	LDD	6,X	; left hand 
	ADDD	4,X	; right hand
	STD	6,X	; store low half
	LDD	2,X
	ADCB	1,X
	ADCA	0,X
	STD	4,X
*
	LDAB	#4	; shorter and faster than 4*INX, walks on B
	ABX
	STX	PSP	; drop the temporaries
	RTS
*
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 32-bit
* output parameter:
*   17-bit sum in 32-bit
ADD16U	LDX	PSP
	LDD	2,X	; left
	ADDD	0,X	; add right
	STD	2,X	; save low
	LDD	#0	; extend
	ROLB		; extend Carry unsigned (could ADC #0)
	STD	0,X	; re-use right side to store high half
*
	RTS		; PSP unchanged
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with one 16-bit parameters,
* after after entry (before temporary allocation)
* [<unknown>  ]
* [32:VAR1_1  ]
* [32:VAR1_2  ]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDD	#(-1)	; make a temporary -1
	JSR	PPSHD	; (default to signed) returns with PSP in X, 2 bytes on stack
	TST	2,X	; test parameter high byte
	BMI	ADD16SP
	CLR	0,X	; zero extend
	CLR	1,X
ADD16SP	LDX	4,X	; pointer to caller's local
	LDD	2,X	; caller's 2nd variable, low
	LDX	PSP
	ADDD	2,X	; parameter
	LDX	4,X	; pointer
	STD	2,X	; update low half with result
	LDD	0,X	; 2nd variable, high half
	LDX	PSP
	ADCB	1,X	; sign extension half
	ADCA	0,X
	LDX	4,X	; pointer
	STD	0,X	; update high half
*
	LDX	PSP
	LDAB	#6	; drop sign temporary and two parameters
	ABX
	STX	PSP
	RTS
*
*
***
* Return stack on entry:
* [SSTKNDR ]
* [SSTKNDR ]SSTKBAS
* [RETADR0 ] <= SP
*
* Parameter stack after local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN	JSR	ALCL8	; allocate and clear 8 bytes
*
	LDD	#$1234
	JSR	PPSHD
	LDD	#$CDEF
	JSR	PPSHD
	JSR	ADD16U	; 32-bit result on parameter stack should be $0000E023
	LDX	PSP	; order is okay, low half where we want it (PSP returned in X anyway)
	LDD	#$8765	; reuse high half
	STD	0,X
	JSR	ADD16S	; result on parameter stack should be $FFFF6788
	LDX	PSP	; (PSP returned in X anyway)
	LDD	2,X	; result low half
	STD	6,X	; to 2nd local variable low half
	LDD	0,X	; result high half
	STD	4,X	; to 2nd local variable high half
	LDD	PSP	; address of 2nd local variable
	ADDD	#4
	STD	2,X	; pointer is 1st arg
	LDD	#$A5A5
	STD	0,X	; 1st arg
	JSR	ADD16SI	; result in 2nd variable should be FFFF0D2D (Carry set)
	LDX	PSP	: unnecssary, ...
	LDD	2,X	; 2nd variable low half
	LDX	LB_BASE
	STD	FINALX+2,X
	LDX	PSP
	LDD	0,X
	LDX	LB_BASE
	STD	FINALX,X
*
	LDD	PSP
	ADDD	#8	; deallocate the locals
	STD	PSP
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
*
***
* Return stack will be just the return address:
* [RETADRNN  ]
*
* Return stack after initialization:
* [SSTKNDR ]
* [SSTKNDR ]SSTKBAS <= SP
*
*
* Parameter stack after initialization, mark:
* [<unknown]PSTKBAS <= PSP
*
START	JSR	INISTKS
*
	JSR	MAIN
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
SSTKNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP
	NOP		; another landing pad to set breakpoint at
	NOP
	LDX	$FFFE
	JMP	0,X	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

Again, I have tested the code and it produces the correct results without stack frames.

If you think you've seen enough for now, go ahead and move ahead with getting numeric output in binary. Otherwise, I'll be cleaning the stack frame support code out of the 6809 examples next. 

[JMR202412050835 daydream addendum:]

Before we leave this topic behind, if you've been following what I've been talking about on this detour, I could mention a bit more of my daydreams.

I think I've mentioned it in passing, but I have often thought it was unfortunate that Motorola didn't push the opcodes around a bit and keep direct mode address for the unary operator -- INC, ROL, TST, etc. -- instructions. (They did so on the 6809.) 

In fact, I would have preferred that they had kept direct page and left out extended mode for unary instructions. (They did exactly that for the 6805.)

"Unary" operators on the 68XX CPUs are mostly read-modify-write instructions that would benefit greatly, in terms of timing and object code efficiency, from having short-addressed versions, and they would also help make the direct page area even more of a psuedo-register memory file.

But we didn't really understand principles of locality in coding back then, so we can, shifting ourselves back to the context of the 1960s and '70s, understand why they saw it as a reasonable tradeoff, and why they wanted to leave as many op-codes as possible available for "inherent" mode operators that didn't seem to fit the unary/binary operator partition they were using -- like Add B to A (ABA), et. al.

If they had, or if, in producing the 6801 as an object-code compatible upgrade to the 6800, they had been willing to produce a mnemonic-level compatible object-code incompatible version of the 6801 with direct-page versions of the the unary operators -- daydream warning! -- it should have been possible to shave at least two cycles off the timing, compared to the 6801's extended mode timing (6 cycles extended, vs. potentially 4 cycles direct-page), giving more meaning to the idea of pseudo-registers -- or making the direct page more of a static cache. 

And if the RAM were going to be built-in (as it pretty much always was in 6801 SOCs), it might even have been possible to shave off yet another cycle, bringing DP variables within a cycle of accumulator timing.

And ... well, the 6801 has 16-bit shifts of the double accumulator,  so why not have 16-bit shifts and increments/decrements for direct page variables? Yeah, maybe that's just being greedy.

And, then, here's yet another step out into alternate reality -- a couple of extra address lines (48-pin DIP packages?) for address space, and it would be possible to distinguish between accessing code, data, stack, and the direct page, helping expand the address range beyond the tight squeeze of 64K.

Erk. Lost in my daydreams again. No wonder it takes me so long to get things done.

Okay, moving on to the 6809 examples, or skipping ahead to getting numeric output in binary.

[JMR202412050835 daydream addendum end.]


(Title Page/Index)


 

 

 

 

Thursday, November 28, 2024

ALPP 02-31 -- More Looking in the Rear-view Mirror -- Single-stack No Frame Example: 6801

More treasure from the bottom of the pool.

  More Looking in the Rear View Mirror --
Single-stack No Frame Example:
6801

(Title Page/Index)

 

Not much to say that I haven't already said. We've seen frameless for the 6800, both the single-stack frameless discipline of one chapter back and the split-stack frameless discipline that we just finished. I'm not sure but what I should leave the 6801, 6809, and 68000 versions as exercises for the interested reader, but I'm a sucker for easy puzzles, so I'll post them anyway. There are plenty of things an interested reader can think of to try for him- or herself.

One thing to pay attention to as you go through is the fact that I have left the utility routines out. Doing them in-line is not that much more code than a JSR, and I didn't want to hide what's going on. That's how much of an improvement the 6801 is over the 6800.

The down side of doing it in-line (by hand) is that there are more opportunities for mistakes.

Go ahead and read the code and compare, and if you are not sure you understand what's going on, single-step through the code.
* 16-bit addition as example of single-stack no frame discipline on 6801,
* with test code
* Joel Matthew Rees, October, November 2024
*
	OPT	6801
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for usr stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily
DWORK	RMB	2	; For saving D temporarily
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
	RMB	4	; buffer
STKLIM	RMB	192	; roughly 16 to 20 levels of call
STKLIMX	EQU	FINALX+8
STKBAS	RMB	4	; for canary return
STKBASX	EQU	STKLIMX+192
STKFAK	RMB	2	; fake frame pointer, self-link
STKFAKX	EQU	STKBASX+4	; 6801 is post-dec (post-store-decrement) push
STKBMP	RMB	4	; a little bumper space
STKBMPX	EQU	STKFAKX+2	; But we are going to init S through X
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	STKBMPX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
INISTK	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE		; local space functional
	LDD	LB_BASE		; bootstrap own stack
	ADDD	#STKBASX
	STD	XWORK	; avoid using BIOS stack
	LDX	XWORK	; ready own stack pointer
*
	PULA		; pop real return address
	PULB
	STS	SSAVE	; save stack pointer from monitor ROM
	TXS		; move to our own stack (let TXS convert it)
	PSHB		; put return address on own stack
	PSHA		; stack now ready for interrupts, utility routines
*
	LDD	#STKUNDR
	STD	0,X	; in the cell beyond empty stack pointer
	STD	2,X	; and the next cell, for good measure
*
	LDD	LB_BASE	
	ADDD	#HBASEX
	STD	HPPTR		; as if we were ready to use heap
	STD	HPALL
	LDD	#CDBASE
	SUBD	#4
	STD	HPLIM
	RTS		; finally done, now can return
*
***
* Not generating a stack frame
*
* Cross-section of general stack structure in called routine:
* [{LOCVAR}] for calling routine
* [{TEMP}  ] for calling routine
* [PARAM   ] from calling routine
* [RETADR  ] to calling routine
* [LOCVAR  ] for called -- current -- routine
* [TEMP    ] for called -- current -- routine
* [(PARAM) ] to be passed to a further call
*
* Broader cross-section, showing nesting for routine 3, in-flight:
* [RETADR1 ] 
* [LOCVAR2 ]
* [TEMP2   ]
* [PARAM3  ]
* [RETADR2 ]
* [LOCVAR3 ]
* [TEMP3   ]
* [(PARAM4)] <= SP (return stack pointer (6800 S is byte below))
***
*
***
* Utility routines left out
*
* Let the caller do allocation after.
*
* Stack at entry, before allocation
* when functions are called by MAIN
* with two 32-bit parameters
* We will return result in D:X
* [STKUNDR ]
* [STKUNDR ]STKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2]
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] <= SP (return stack pointer (6800 S is byte below))
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit D:X D high, X low
* Does not alter the parameters.
ADD16S	TSX		; no local allocations
	LDAA	#(-1)	; prepare for sign extension
	TST	4,X	; the left-hand operand sign bit
	BMI	ADD16SR
	CLRA		; zero extend
ADD16SR	PSHA		; push left extension
	PSHA		; left sign cell below X now
	LDAA	#(-1)	; reload
	TST	2,X	; the right-hand operand sign bit
	BMI	ADD16SL
	CLRA		; zero extend
ADD16SL	PSHA		; push right extension
	PSHA
	TSX		; point to sign extensions (4 temporary bytes on stack)
	LDD	8,X	; left-hand low cell
	ADDD	6,X	; right-hand low cell
	STD	XWORK	; save low half of result
	LDD	2,X	; left-hand extension
	ADCB	1,X	; right-hand extension
	ADCA	0,X	; high half done
*
	INS		; fastest to just drop the temporaries
	INS
	INS
	INS
	LDX	XWORK	; get low half of result
	RTS		; result is in D:X
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit D:X D high
ADD16U	TSX		; no local allocations
	LDD	4,X	; left
	ADDD	2,X	; right
	STD	XWORK	; save low half
	LDD	#0
	ADCB	#0
*
	LDX	XWORK	; get low half of result
	RTS		; result is in D:X
*
* Etc.
*
***
*
* Stack after LINK #0 when fuctions are called by MAIN
* with one input parameter
* (#0 means no local variables)
* [STKUNDR ]
* [STKUNDR ]STKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2] 
* [PARAM2_1] (pointer)
* [PARAM2_2] (addend)
* [RETADR1 ] <= SP (return stack pointer (6800 S is byte below))
*
* To show how to access caller's local through pointer
* instead of walking stack --
* Add 16-bit signed parameter
* to 32 bit caller's 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	TSX		; no local allocations up front
	LDAA	#(-1)
	TST	2,X	; high byte of paramater
	BMI	ADD16SIP
	CLRA
ADD16SIP	PSHA	; save the sign extension half (2 temporary bytes on stack)
	PSHA
	LDX	4,X	; get caller's pointer
	LDD	2,X	; caller's 2nd variable, low
	TSX
	ADDD	4,X	; parameter
	LDX	6,X	; caller's pointer
	STD	2,X	; save result low half away
	LDD	0,X	; caller's 2nd variable, high
	TSX
	ADCB	1,X	; sign extension half
	ADCA	0,X
	LDX	6,X	; caller's pointer
	STD	0,X	; save result high half away
*
	INS		; drop temporary 
	INS
	RTS		; no result to load
*
*
***
* Stack after local allocation
* [STKUNDR ]
* [STKUNDR ]STKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2] <= SP
*
MAIN	LDX	#0
	PSHX		; four pushes is only one byte more than a call. 
	PSHX
	PSHX
	PSHX
*
	LDX	#$1234	; parameters
	PSHX
	LDX	#$CDEF
	PSHX
	JSR	ADD16U	; result in D:X should be $E023
	INS		; could reuse instead of dropping
	INS
	INS
	INS
	PSHX		; low half
	LDX	#$8765
	PSHX
	JSR	ADD16S	; result in D:X should be $FFFF6788
	STX	XWORK
	STD	DWORK
	INS		; could reuse instead of dropping
	INS
	INS
	INS
	TSX
	LDD	XWORK
	STD	2,X
	LDD	DWORK
	STD	0,X
*	LDAB	#0	; calculate pointer
*	ABX		; would use ABX here if there were an offset.
	PSHX
	LDX	#$A5A5
	PSHX
	JSR	ADD16SI		; result in 2nd variable should be FFFF0D2D
	INS		; drop parameters
	INS
	INS
	INS
	TSX
	LDD	2,X		; low half
	LDX	LB_BASE		; store it in FINAL, in process local space
	STD	FINALX+2,X
	TSX
	LDD	0,X		; high half
	LDX	LB_BASE
	STD	FINALX,X
*
	TSX
	LDAB	#8
	ABX
	TXS
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
* (who knows?) <= FP
***
* (who knows?) <= VBP
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]STKBAS <= SP
***
*
START	NOP
	JSR	INISTK
	NOP
*
	JSR	MAIN
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	LDX	$FFFE	; alternatively, jmp through reset vector
	JMP	0,X
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

If you've seen enough binary output is still waiting. (And it will still be waiting in a few more hours or days, really.) 

If not, split stack with no stack frames is also great on the 6801, even a bit better than what we saw here.

 -- 

Maybe this would be a good place to bring up (again?) the regrets I have that Motorola didn't include a SBX subtract B from X instruction in the 6801. It would have been useful in the stack allocation code as you can see from where I used (and didn't use) ABX. It would also have been useful to have an add immediate to index AIX op-code, possibly 16-bit to do both allocation and deallocation, or signed 8-bit, or unsigned, paired with a subtract immediate from X (SIX?) instruction.

Yeah, more daydreams. Sorry. --


 (Title Page/Index)

 


 

 

Wednesday, November 27, 2024

ALPP 02-30 -- Ascending the Right Island -- Split-stack No Frame Example: 6800

Leaving those rubber bricks at the side of the pool, let's keep going down for more treasure.

  Ascending the Right Island --
Split-stack No Frame Example:
6800

(Title Page/Index)

 

At this point, from working through the single-stack example for the 6800 without stack frames, you might be seeing the reasoning behind stack frames. It can be really difficult figuring out where your data is and where it should be heading without some frame of reference, and stack frames do provide a frame of reference when you're deep in the arcane definitions of some routine. 

But building the code to support the stack frames tends to consume time and energy that you'd rather devote to the actual problem at hand, unless your CPU provides high-level support for the frames. It tends to end up a mixed blessing at best, with net costs usually, in my opinion, outweighing benefits, even when your CPU  supports it.

Here on the 6800, we can see those costs most clearly by looking carefully at the code I present here, reading the source code in a text editor while stepping through it in the simulator, and comparing it with the split-stack stack frame version and the single-stack versions. 

Before you get to wondering why anyone wanted to use a stack frame in the first place, it's worth noting that stack frames' utility became especially especially apparent in very large procedures with complex logic. When your procedure extends to hundreds of lines of code (or more) with dozens of variables (or more), you use tools in the assembler to name your local variables by their offset from the frame base pointer, and it helps greatly to manage the complexity. 

And it helps in constructing compilers, especially in the initial "bootstrap stages" of development. The compiler may be able to manage constructing and tearing down the frames more easily than it could handle remembering changing offsets.

But.

The frames get in the way. 

Especially when return addresses are inside the stack frames, they get in the way.

All the benefits of stack frames can, in fact, be found in this simple example of split-stack frameless coding discipline. You might think it's just my opinion, but I'll explain further as we go.

I think the code explains itself, particularly when comparing it to the split-stack example with stack frames and the single-stack example without frames, that we just finished.

One thing that might be a point of interest, I had thought I would use an ADDDX Add double accumulator to X routine in MAIN, 

* Could use this in the single-stack no frames example, too.
LEADPX	LDX	PSP	; Add D to PSP and load to X
* Add A:B to X through XWORK.
* Returned in both X and A:B.
* But we won't use it, after all.
ADDDX	STX	XWORK
	ADDB	XWORK+1
	ADDA	XWORK
	STAA	XWORK
	STAB	XWORK+1
	LDX	XWORK
	RTS	

to calculate the effective address of the variable that we are passing, but it worked out to be a wash. Took almost as much code to set it up as to just do it there in place.

Read the code, step through it, compare to what we've worked through so far. Note in particular how we are passing the return values back here, and how it is different from the way we use when working with various kinds of stack frames, and even different from the method of the frameless single-stack discipline:

* 16-bit addition as example of split-stack frame-free discipline on 6800
* using the direct page,
* with test code
* Joel Matthew Rees, October, November 2024
*
	OPT	6800
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for usr stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily
DWORK	RMB	2	; For saving D temporarily
PSP	RMB	2	; parameter stack pointer
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
	RMB	4	; Put a bumper after the process static variables
SSTKLIM	RMB	64	; 16 levels of call
SSTKLMX	EQU	FINALX+8
SSTKBAS	RMB	6	; for canary return
SSTKBSX	EQU	SSTKLMX+64
SSTKFAK	RMB	2	; fake frame pointer, self-link
SSTKFAX	EQU	SSTKBSX+6	; 6801 is post-dec (post-store-decrement) push
SSTKBMP	RMB	4	; a little bumper space
SSTKBMX	EQU	SSTKFAX+2	; But we are going to init S through X
PSTKLIM	RMB	64	; 16 levels of call at two parameters per call
PSTKLMX	EQU	SSTKBMX+4
PSTKBAS	RMB	4	; bumper space -- parameter stack is pre-dec
PSTKBSX	EQU	PSTKLMX+64
PSTKBMP	RMB	4	; a little bumper space
PSTKBMX	EQU	PSTKBSX+4
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	PSTKBMX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
*
INISTKS	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE
	LDAA	LB_BASE		; bootstrap own return stack
	LDAB	LB_BASE+1
	LDX	#SSTKBSX	; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1
	ADCA	XWORK
	STAB	XWORK+1		; initial return stack pointer
	STAA	XWORK
*
	LDX	#SSTKNDR	; for fake return address
	STX	DWORK		; save it for a moment
	PULA		; pop real return address
	PULB
	LDX	XWORK	; ready own return stack pointer
	STS	SSAVE	; save stack pointer from monitor ROM
	TXS		; move to our own stack (let TXS convert it)
	PSHB		; put return address on own stack
	PSHA		; stack now ready for interrupts
*
	LDAA	DWORK	; error handler for fake return
	LDAB	DWORK+1
	STAA	0,X	; in the cell beyond empty stack pointer
	STAB	1,X	; prime the return stack with error handler
	STAA	2,X	; second fake return to error handler
	STAB	3,X
* 
	LDAA	LB_BASE		; bootstrap parameter stack
	LDAB	LB_BASE+1
	LDX	#PSTKBSX	; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1		; initial parameter stack pointer
	ADCA	XWORK
	STAA	PSP		; parameter stack now ready
	STAB	PSP+1
*
	LDAA	LB_BASE		; set up heap as if we actually had one
	LDAB	LB_BASE+1
	LDX	#HBASEX		; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1		; calculat EA
	ADCA	XWORK
	STAA	HPPTR
	STAB	HPPTR+1
	STAA	HPALL		; as if the heap were functional
	STAB	HPALL+1
	LDX	#CDBASE
	STX	XWORK
	LDAA	XWORK
	LDAB	XWORK+1
	SUBB	#4
	SBCA	#0
	STAA	HPLIM
	STAB	HPLIM+1
	RTS
*
*
***
* General structure of the stacks, 
*
* return stack is only the return address
* (and maybe extremely ephemeral temporaries):
* [PRETADR   ]
* [RETADR    ]
*
* order of elements on the parameter stack,
* when they are present:
* [PARAMETERS ]
* [VARIABLES  ]
* [TEMPORARIES]
* [PARAMETERS ]
* [VARIABLES  ]
* [TEMPORARIES]
* [PARAMETERS ] <= PSP
*
* Result is returned on parameter stack
*
***
* Utility routines
*
* Could use this in the single-stack no frames example, too.
*LEADPX	LDX	PSP	; Add D to PSP and load to X
* Add A:B to X through XWORK.
* Returned in both X and A:B.
* But we won't use it, after all.
*ADDDX	STX	XWORK
*	ADDB	XWORK+1
*	ADDA	XWORK
*	STAA	XWORK
*	STAB	XWORK+1
*	LDX	XWORK
*	RTS	
*
PPOPD	LDX	PSP
	LDAA	0,X
	LDAB	1,X
	INX
	INX
	STX	PSP
	RTS
*
* This saves bytes:
ALCL2	CLRA
	CLRB	; fall through
* 
PPSHD	LDX	PSP
	DEX
	DEX
	STX	PSP
	STAA	0,X
	STAB	1,X
	RTS
*
* Compromise between speed and reusability
* Returns with PSP in X.s
* Enter here to load PSP and initialize to 0
* 8 bytes
ALCL8	CLRA	
	CLRB
* Enter here with initial value in A:B
ALCLD8	LDX	PSP
* Enter here with PSP loaded and initial value in D
ALCLI8	DEX
	DEX
	STAA	0,X
	STAB	1,X
ALCLI6	DEX
	DEX
	STAA	0,X
	STAB	1,X
ALCLI4	DEX
	DEX
	STAA	0,X
	STAB	1,X
ALCLI2	DEX		; PPSHD usually costs less.
	DEX
	STAA	0,X
	STAB	1,X
	STX	PSP
	RTS
*
* six bytes
ALCL6	CLRA
	CLRB
ALCLD6	LDX	PSP
	BRA	ALCLI6
*
* four bytes
ALCL4	CLRA
	CLRB
ALCLD4	LDX	PSP
	BRA	ALCLI4
*
* two bytes
*ALCL2	CLRA
*	CLRB
*	LDX	PSP
*	BRA	ALCLI2
*
*
PDROP8	LDAB	#8	; saves two bytes, 7 vs. 3
PDROP_B	CLRA
* Add A:B to PSP -- negative for allocation, positive for deallocation
ADDPSP	ADDB	PSP+1
	ADCA	PSP
	STAA	PSP
	STAB	PSP+1
	LDX	PSP	; return with X ready
	RTS
*
PDROP6	LDAB	#6
	BRA	PDROP_B	
*
PDROP4	LDAB	#4
	BRA	PDROP_B	
*
PDROP2	LDAB	#2	; JSR is 3 bytes, LDX PSP; INX; INX; STX PSP is 6
	BRA	PDROP_B	
*
*
***
* Return stack when functions are called by MAIN
* Return stack on entry, after link:
* [SSTKNDR ]
* [SSTKNDR ]SSTKBAS
* [RETADR0 ]
* [RETADR1 ]
*
* Parameter stack when called by MAIN
* with two 32-bit local variables
* and two 16-bit parameters,
* after mark (no local allocation)
* [<unknown>]
* [32:VAR1_1--]
* [32:VAR1_2--]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
****
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit
ADD16S	LDX	PSP
	LDAB	#(-1)	; default negative
	TBA
	JSR	ALCLI4	; allocate 2 temporary cells and init (leaves PSP in X)
	TST	6,X	; the left-hand operand sign bit
	BMI	ADD16SR
	CLR	2,X	; positive
	CLR	3,X
ADD16SR	TST	4,X	; the right-hand operand sign bit
	BMI	ADD16SL
	CLR	0,X	; positive
	CLR	1,X
ADD16SL	LDAA	6,X	; left hand 
	LDAB	7,X
	ADDB	5,X	; right hand
	ADCA	4,X
	STAA	6,X	; store low half
	STAB	7,X
	LDAA	2,X
	LDAB	3,X
	ADCB	1,X
	ADCA	0,X
	STAA	4,X	; store high half
	STAB	5,X
	JSR	PDROP4
	RTS
*
* The alternative, without link, mark, or restore?
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right in 2 16-bit
* output parameter:
*   17-bit sum in 32-bit
ADD16U	LDX	PSP
	LDAA	2,X	; left
	LDAB	3,X
	ADDB	1,X	; add right
	ADCA	0,X
	STAA	2,X	; save low in left side
	STAB	3,X
	LDAB	#0	; extend
	ADCB	#0	; extend Carry unsigned (could ROL)
	STAB	1,X	; re-use right side to store high half
	CLR	0,X	; only bit 8 can be affected
	RTS
*
* Etc.
*
*
***
* Parameter stack when called by MAIN
* with one 16-bit parameters,
* after mark (no local allocation)
* [<unknown>  ]
* [32:VAR1_1  ]
* [32:VAR1_2  ]
* [16:PARAM2_1]
* [16:PARAM2_2] <= PSP
*
* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameter:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	LDAB	#(-1)	; make a temporary -1
	TBA
	JSR	PPSHD	; default to signed (leaves PSP in X)
	TST	2,X	; test high byte
	BMI	ADD16SP
	CLR	0,X	; zero extend
	CLR	1,X
ADD16SP	LDX	4,X	; get pointer to target
	LDAA	2,X	; target low
	LDAB	3,X
	LDX	PSP
	ADDB	3,X	; parameter
	ADCA	2,X
	LDX	4,X	: pointer to target
	STAA	2,X	; update low half with result
	STAB	3,X
	LDAA	0,X	; target, high half
	LDAB	1,X
	LDX	PSP
	ADCB	1,X	; sign extension half
	ADCA	0,X
	LDX	4,X	; target
	STAA	0,X	; update high half
	STAB	1,X
	JSR	PDROP6	; drop temporary and parameters
	RTS
*
*
***
* Return stack on entry:
* [SSTKNDR ]
* [SSTKNDR ]SSTKBAS
* [RETADR0 ] <= RSP
*
*
* Parameter stack after mark and local allocation
* [<unknown>]
* [VAR1_1--]
* [VAR1_2--] <= PSP
*
MAIN	JSR	ALCL8	; allocate and clear 8 bytes
	LDAA	#$12
	LDAB	#$34
	JSR	PPSHD
	LDAA	#$CD
	LDAB	#$EF
	JSR	PPSHD
	JSR	ADD16U	; 32-bit result on parameter stack should be $0000E023
	LDAA	#$87	; ADD16U leaves PSP in X
	LDAB	#$65
	STAA	0,X	; reuse low half of result space, overwrite high half
	STAB	1,X
	JSR	ADD16S	; result on parameter stack should be $FFFF6788
	LDAA	2,X	; result low half -- ADD16S leaves PSP in X
	LDAB	3,X	; put result away
	STAA	6,X	; to 2nd local variable low half
	STAB	7,X
	LDAA	0,X	; result high half
	LDAB	1,X
	STAA	4,X	; to 2nd local variable high half
	STAB	5,X
	STX	XWORK	; instead of JSR ADDDX: 
	LDAB	XWORK+1	; LDAB #4; CLRA; JSR ADDDX; LDX PSP; STAB 3,X; STAA 2,X
	LDAA	XWORK	; Moving results around takes a lot of code,
	ADDB	#4 	; So just do it here.
	ADCA	#0
	STAB	3,X
	STAA	2,X
	LDAA	#$A5
	TAB		; don't really need to use both, just making things clear.
	STAA	0,X
	STAB	1,X
	JSR	ADD16SI	; result in 2nd variable should be FFFF0D2D (Carry set)
	LDAA	2,X	; 2nd variable low half -- ADD16SI leaves PSP in X
	LDAB	3,X
	LDX	LB_BASE
	STAA	FINALX+2,X
	STAB	FINALX+3,X
	LDX	PSP
	LDAA	0,X
	LDAB	1,X
	LDX	LB_BASE
	STAA	FINALX,X
	STAB	FINALX+1,X
	JSR	PDROP8	; ADD16SI also dropped its arguments for us, so only locals
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
*
***
* Return stack will only contain return addresses (and very ephemeral temporaries):
* [RETADRNN  ]
*
* Return stack after initialization:
* [SSTKNDR ]
* [SSTKNDR ]SSTKBAS <= RSP
*
*
* Parameter stack after initialization:
* [<unknown]PSTKBAS <= PSP
*
START	JSR	INISTKS
*
	JSR	MAIN
*
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
SSTKNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP
	NOP		; another landing pad to set breakpoint at
	NOP
	LDX	$FFFE
	JMP	0,X	; alternatively, jmp through reset vector
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

As always, I have tested this code, and it produces the correct results without stack frames, passing both input and return parameters on the stack, except for utility routines which use lower level register protocols not available to higher-level routines. 

I will be pointing you back here later. If this talk about stack frames and parameter passing methods seems a little fuzzy at this point, it's okay to move ahead for now.

You may want to move ahead with getting numeric output in binary, or you might want to see how single-stack, no-frame parameter passing works on the 6801, next.


(Title Page/Index)


 

 

 

 

Sunday, November 24, 2024

ALPP 02-29 -- Putting That Wrong Island in the Rear-view Mirror -- Single-stack No Frame Example: 6800

So we've got most of those rubber bricks off the bottom of the pool but there's some treasure down there, too.

  Putting That Wrong Island in the Rear View Mirror --
Single-stack No Frame Example:
6800

(Title Page/Index)

 

As I have said, even though we’ve just looked at an example of how split-stack stack frames can be done on the 6800 and we've even seen a parallel example of single-stack stack frames on the same, I do not recommend stack frames. 

But I think I have made it clear that, if you have to do stack frames, I recommend split-stack over single-stack.

In this chapter we are going to look at the same functional example of three kinds of addition using a single stack without a stack frame.

Single-stack no frame, if you are allowed to do it and learn how to do it right, will produce cleaner, more optimal code than single-stack with stack frames.

But I'm going to repeat myself. I cannot recommend this. You have to track what is on that stack, and the return address just gets in the way of your calculations and your memory. It's a bit (16 bits on the 6800) of distracting data that isn't relevant to the calculations the function is doing, and every time you look for something on the stack, it either sticks out like a sore thumb, distracting you, or you forget it's there and miss what you are aiming at. And walk on it. Or try to get it from where it isn't and end up executing data or garbage instead of instructions.

We have to acknowledge is that, without the frame pointer(s), we end up having to track how much of what we have on the stack at any particular point in the code.

But we have to keep track of that anyway, really, even though a frame pointer can help. If we don't know what's there, we don't know where we've put things, and that's a terrible state for a program (and a programmer) to be in -- and that's one reason people avoid reading the assembly language output of compilers.

Just looking at the code below, you may not see how much we've ripped out -- that's because we've been hiding what we could in subroutines. But tracing through the code should feel rather different, because you can hide code from the programmer, but you can't hide it from the processor.

You'll really want to compare the code with the stack frame version, and re-read the code and the comments. Take time to trace through both, watching the source as you do.

* 16-bit addition as example of single-stack discipline sans stack frame on 6800,
* with test code
* Joel Matthew Rees, October, November 2024
*
	OPT	6800
NATWID	EQU	2	; 2 bytes in the CPU's natural integer
*
*
* Blank line will end assembly.
	ORG	$80	; MDOS says this is a good place for user stuff.
*
ENTRY	JMP	START
	NOP		; Just want even addressed pointers for no reason.
	NOP		; bumper
	NOP		; 6 bytes to this point.
SSAVE	RMB	2	; a place to keep S so we can return clean
	RMB	4	; bumper
* All of the pseudo-registers must be saved and restored on context switch,
* cannot be accessed during interrupt service.
XWORK	RMB	2	; For saving an index register temporarily in leaf functions only
DWORK	RMB	2	; For saving D temporarily in leaf functions only
RETVHI	RMB	2	; high half of 32-bit return values (because we can't push X easily)
RETVLO	RMB	2	; 16-bit return values and low half (because loading and saving is redundant)
LB_BASE	RMB	2	; For process local variables
HPPTR	RMB	2	; heap pointer (not yet managed)
HPALL	RMB	2	; heap allocation pointer
HPLIM	RMB	2	; heap limit
* End of pseudo-registers
	RMB	4	; bumper
GAP1	RMB	2	; Mark the bottom of the gap
*
*
*
	ORG	$2000	; Give the DP room.
LB_ADDR	RMB	4	; a little bumper space
FINAL	RMB	4	; 32-bit Final result in DP variable (to show we can)
FINALX	EQU	4
	RMB	4	; buffer
STKLIM	RMB	192	; roughly 16 to 20 levels of call
STKLIMX	EQU	FINALX+8
STKBAS	RMB	8	; for canary return
STKSZ	EQU	192	; for EXORsim assembler limits
STKBASX	EQU	STKLIMX+192	; must be STKLIMX+STKSZ -- assembler won't take symbol
STKFAK	RMB	2	; fake frame pointer, self-link
STKFAKX	EQU	STKBASX+8	; 6801 is post-dec (post-store-decrement) push
STKBMP	RMB	4	; a little bumper space
STKBMPX	EQU	STKFAKX+2	; But we are going to init S through X
*
* My assembler limits RMBs to $100 long, so we'll use a different way.
HBASE	RMB	1	; $1024 or something	; Not using or managing heap yet.
HBASEX	EQU	STKBMPX+4
*HLIM	RMB	4	; bumper
*HLIMX	EQU	HBASEX+$100	; 1024
*
*
	ORG	$3000
CDBASE	JMP	ERROR		; more bumpers
	NOP
*STKBASM	FDB	STKBASX	; Doesn't work within EXORsim assembler limits after all
*HBASEXM	FDB	HBASEX	; by avoiding splitting large constants up at assemble time
*
INISTK	LDX	#LB_ADDR	; set up process local space
	STX	LB_BASE		; local space functional
	LDAA	LB_BASE		; bootstrap own stack
	LDAB	LB_BASE+1
*	ADDB	STKBASM+1
*	ADCA	STKBASM
	LDX	#STKBASX	; Instead of FDB
	STX	XWORK 
	ADDB	XWORK+1
	ADCA	XWORK
*
	STAB	XWORK+1		; initial stack pointer
	STAA	XWORK
*
	LDX	#STKUNDR	; for fake return address
	STX	DWORK		; save it for a moment
*	
	PULA		; pop real return address
	PULB
	LDX	XWORK	; ready own stack pointer
	STS	SSAVE	; save stack pointer from monitor ROM
	TXS		; move to our own stack (let TXS convert it)
	PSHB		; put return address on own stack
	PSHA		; stack now ready for interrupts, utility routines
*
	LDAA	DWORK	; error handler for fake return
	LDAB	DWORK+1
	STAA	0,X	; in the cell beyond empty stack pointer
	STAB	1,X
	STAA	2,X	; and the next cell, for good measure
	STAB	3,X
*
	LDAA	LB_BASE	
	LDAB	LB_BASE+1
	PSHB
	PSHA
*	JSR	PSH16I 
*	FDB	HBASEX	; EXORsim's interactive assembler doesn't like FDBs.
	LDX	#HBASEX
	JSR	SPSHX
*
	JSR	UADD16
	STAA	HPPTR		; as if we were ready to use heap
	STAB	HPPTR+1
	STAA	HPALL
	STAB	HPALL+1
*	JSR	PSH16I	; FDBs
*	FDB	CDBASE
*	JSR	PSH16I
*	FDB	(-4)		; extra bumper
*	JSR	UADD16
	LDX	#CDBASE
	STX	XWORK
	LDAA	XWORK
	LDAB	XWORK+1
	SUBB	#4
	SBCA	#0
*
	STAA	HPLIM
	STAB	HPLIM+1
	RTS		; finally done, now can return
*
***
* Since negative index offsets are so expensive,
* we want to create a stack frame with only positive offsets.
* And we want the frame pointer to be pushed after the call,
* on entry to the local context.
* And the saved frame pointer needs to link to the previous one.
* And when we restore the previous frame, 
* we need to be able to restore the previous frame base.
*
* Cross-section of general frame structure in called routine:
* [{LOCVAR}] for calling routine
* [{TEMP}  ] for calling routine
* [PARAM   ] from calling routine
* [RETADR  ] to calling routine
* [LOCVAR  ] for called -- current -- routine
* [TEMP    ] for called -- current -- routine
* [(PARAM) ] to be passed to a further call
*
* Broader cross-section, showing chaining for routine 3, in-flight:
* [RETADR1 ] 
* [LOCVAR2 ]
* [TEMP2   ]
* [PARAM3  ]
* [RETADR2 ]
* [LOCVAR3 ]
* [TEMP3   ]
* [(PARAM4)] <= SP (return stack pointer (6800 S is byte below))
***
*
***
* Utility routines
*
* Push low half of return value
* (Didn't use it there, don't use it here.)
PSHLH	TSX
	LDAA	0,X		; return address
	LDAB	1,X
	PSHB
	PSHA
	LDAA	RETVLO
	LDAB	RETVLO+1
	STAA	0,X
	STAB	1,X
	RTS
*
* Avoid the math to split 16-bit constants into two 8-bit loads,
* and push them while we are here.
* The constant follows the call in the instruction stream.
* Leaves constant in A:B, as well.
PSH16I	TSX		; point to top of return address stack
	LDX	0,X	; point into the instruction stream
	LDAA	0,X	; high byte from instruction stream
	LDAB	1,X	; low byte from instruction stream	
	INS		; drop the return address we almost have in X
	INS
	PSHB		; replace it with the constant
	PSHA
	JMP	2,X	; return to the byte after the constant.
*
* 8 bytes for the meat of this vs. 3 for the call.
* We end up using it a lot since EXORsim's interactive assembler doesn't do FDBs.
SPSHX	STX	XWORK
	DES
	DES
	TSX
	LDAA	2,X
	LDAB	3,X
	STAA	0,X
	STAB	1,X
	LDAA	XWORK
	LDAB	XWORK+1
	STAA	2,X
	STAB	3,X
	RTS
*
* 6 bytes for the meat of this vs. 3 for the call, instead of FDB
* (Didn't use it there, don't use it here.)
TXD	STX	XWORK
	LDAA	XWORK
	LDAB	XWORK+1
	RTS
*
* Utility 16-bit add, leave result in A:B
UADD16	TSX		; no frame
	LDAB	5,X	; left
	ADDB	3,X	; right		; because we can
	LDAA	4,X	; left
	ADCA	2,X	; right
	LDX	0,X
UADROP	INS		; drop return address and parameters
	INS
	INS
	INS
	INS
	INS
	JMP	0,X	; return via X
*
* Utility 16-bit sub, leave result in A:B
* (Didn't use it there, don't use it here.)
USUB16	TSX		; no frame
	LDAB	5,X	; left
	SUBB	3,X	; right		; because we can
	LDAA	4,X	; left
	SBCA	2,X	; right
	LDX	0,X
	BRA	UADROP	; drop return address and parameters
*
*
* We really don't want to put S in a temp if we can avoid it
ALOCS8	PULA
	PULB
ALOS8I	DES
	DES
ALOS6I	DES
	DES
ALOS4I	DES
	DES
ALOS2I	DES
	DES
	PSHB
	PSHA
	RTS
*
ALOCS6	PULA
	PULB
	BRA	ALOS6I
*
ALOCS4	PULA
	PULB
	BRA	ALOS4I
*
ALOCS2	PULA
	PULB
	BRA	ALOS2I
*
INI0_8	CLRA
	CLRB
* call with initialization value in A:B
INIS8	TSX
INIT8	STAA	8,X
	STAB	9,X
INIT6	STAA	6,X
	STAB	7,X
INIT4	STAA	4,X
	STAB	5,X
INIT2	STAA	2,X
	STAB	3,X
	RTS		; 0,X is return address!
*
INI0_6	CLRA
	CLRB
* call with initialization value in A:B
INIS6	TSX
	BRA	INIS6
*
INI0_4	CLRA
	CLRB
* call with initialization value in A:B
INIS4	TSX
	BRA	INIS4
*
INI0_2	CLRA
	CLRB
* call with initialization value in A:B
INIS2	TSX
	BRA	INIS2
*
DROP8	PULA
	PULB
	INS
	INS
DROP6I	INS
	INS
	INS
	INS
	INS
	INS
	PSHB
	PSHA
	RTS
*
DROP6	PULA
	PULB
	BRA	DROP6I
*
*
* Stack at entry
* when functions are called by MAIN
* with two parameters
* We will return results in RETVHI:RETVLO in direct page
* [STKUNDR ]
* [STKUNDR ]STKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2] <= VARBP1
* [PARAM2_1]
* [PARAM2_2]
* [RETADR1 ] 
*
* Signed 16 bit add to 32 bit result
* Handle sign overflow without losing precision.
* input parameters:
*   16-bit left 1st pushed, right 2nd
* output parameter:
*   17-bit sum in 32-bit in RETVHI:RETVLO
* Does not alter the parameters.
ADD16S	TSX		; no local variables
	LDAA	#(-1)	; prepare for sign extension
	TST	4,X	; the left-hand operand sign bit
	BMI	ADD16SR
	CLRA		; zero extend
ADD16SR	PSHA		; push left extension (only need one byte, though, really)
	PSHA		; left sign cell below X now
	LDAA	#(-1)	; reload
	TST	2,X	; the right-hand operand sign bit
	BMI	ADD16SL
	CLRA		; zero extend
ADD16SL	PSHA		; push right extension
	PSHA
	TSX		; point to sign extensions
	LDAA	8,X	; left-hand low cell
	LDAB	9,X
	ADDB	7,X	; right-hand low cell
	ADCA	6,X
	STAA	RETVLO	; save low half of result
	STAB	RETVLO+1
	LDAA	2,X	; left-hand extension
	LDAB	3,X
	ADCB	1,X	; right-hand extension
	ADCA	0,X
	STAA	RETVHI	; Save high half of result
	STAB	RETVHI+1
	INS		; drop sign extension temporaries
	INS		; 4 INS is one byte more than JSR DROP4
	INS
	INS
	RTS		; result is in RETVLO:RETVHI
*
* Unsigned 16 bit add to 32 bit result
* input parameters:
*   16-bit left, right
* output parameter:
*   17-bit sum in 32-bit RETVLO:RETVHI
ADD16U	TSX		; no local allocations
	LDAA	4,X	; left
	LDAB	5,X
	ADDB	3,X	; right
	ADCA	2,X
	STAA	RETVLO	; save low half
	STAB	RETVLO+1
	LDAB	#0
	ADCB	#0
	STAB	RETVHI+1	; save carry bit in high half
	CLR	RETVHI		; will never carry beyond bit 17
	RTS
*
* Etc.
*
***
*
* Stack after LINK #0 when fuctions are called by MAIN
* with one input parameter
* (#0 means no local variables)
* [STKUNDR ]
* [STKUNDR ]STKBAS
* [RETADR0 ] 
* [32:VAR1_1] <= PARAM2_1
* [32:VAR1_2]
* [PARAM2_1] (pointer)
* [PARAM2_2] (addend)
* [RETADR1 ] 
*
* Instead of walking the stack, pass in a pointer --
* Add 16-bit signed parameter
* to 32 bit caller's 2nd 32-bit internal variable.
* input parameters:
*   16-bit pointer to 32-bit integer
*   16-bit addend
* no output parameter:
ADD16SI	TSX		; no own local variables 
	LDAA	#(-1)
	TST	2,X	; high byte of addend paramater
	BMI	ADD16SIP
	CLRA
ADD16SIP	PSHA	; save the sign extension half
	PSHA
	LDX	4,X	; get pointer to target
	LDAA	2,X	; target low
	LDAB	3,X
	TSX		; SP[ sign, retadr, addend, long ptr ]
	ADDB	5,X	; addend parameter (stack is two lower, now)
	ADCA	4,X
	LDX	6,X	; target pointer
	STAA	2,X	; save result low half away
	STAB	3,X
	LDAA	0,X	; target high half
	LDAB	1,X
	TSX
	ADCB	1,X	; sign extension half
	ADCA	0,X
	LDX	6,X	; target
	STAA	0,X	; save result high half away
	STAB	1,X
	INS		; three bytes for INS and RTS vs. two bytes for branch
	INS
	RTS		; no result to load
*
*
***
* Stack after variable allocation
* [STKUNDR ]
* [STKUNDR ]STKBAS
* [RETADR0 ] 
* [32:VAR1_1]
* [32:VAR1_2] <= SP
*
MAIN	JSR	ALOCS8	; 2 calls, 6 bytes vs. 1 clr + 8 pushes , 9 bytes
	JSR	INI0_8
	TSX
*
	JSR	PSH16I
*	FDB	$1234	; parameters
	FCB	$12
	FCB	$34
	JSR	PSH16I
*	FDB	$CDEF
	FCB	$CD
	FCB	$EF
	JSR	ADD16U	; result in RETVHI:RETVLO should be $E023
	INS		; drop one parameter, reuse other
	INS
	TSX
	LDAA	RETVLO	; four extra bytes compared to calling PSHLH
	LDAB	RETVLO+1
	STAA	0,X
	STAB	1,X	
	JSR	PSH16I
*	FDB	$8765
	FCB	$87
	FCB	$65
	JSR	ADD16S	; result in RETVHI:RETVLO should be $FFFF6788
	TSX		; reuse both parameters
	LDAA	RETVHI
	LDAB	RETVHI+1
	STAA	4,X		; 2nd local variable high half
	STAB	5,X
	LDAA	RETVLO
	LDAB	RETVLO+1
	STAA	6,X
	STAB	7,X
	STX	XWORK	; calculate address of second variable
	LDAB	XWORK+1
	ADDB	#4
	STAB	3,X
	LDAA	XWORK
	ADCA	#0	; don't lose the carry
	STAA	2,X
	LDAB	#$A5
	STAB	0,X	; $A5
	STAB	1,X	; $A5A5
	JSR	ADD16SI		; result in 2nd variable should be FFFF0D2D
	INS			; drop the parameters
	INS
	INS
	INS
	TSX
	LDAA	2,X		; low half
	LDAB	3,X
	LDX	LB_BASE		; store it in FINAL, in process local space
	STAA	FINALX+2,X
	STAB	FINALX+3,X
	TSX
	LDAA	0,X		; high half
	LDAB	1,X
	LDX	LB_BASE
	STAA	FINALX,X
	STAB	FINALX+1,X
	JSR	DROP8
	RTS
*
*
***
* Stack at START:
* (what BIOS/OS gave us) <= SP
***
*
* Stack after initialization:
* [STKUNDR ]
* [STKUNDR ]STKBAS <= SP
*
***
*
START	NOP
	JSR	INISTK
	NOP
*
	JSR	MAIN
*
DONE	NOP
ERROR	NOP	; define error labels as something not DONE, anyway
STKUNDR	NOP
	LDS	SSAVE	; restore the monitor stack pointer
	NOP
	NOP		; landing pad to set breakpoint at
	NOP
	NOP
	LDX	$FFFE	; alternatively, jmp through reset vector
	JMP	0,X
*
* Anyway, if running in EXORsim, after RESETting,
* Ctrl-C should bring you back to EXORsim monitor, 
* but not necessarily to your program in a runnable state.

I probably spent a good six hours or more figuring out all the places I had messed up the offsets and lost track of what was on the stack. Sure, that was because I'd quit using the frame pointers for reference. It was also  because I was running low on sleep. But it was more so because of that distracting presence of the return addresses right in the middle of the data.

If you haven't traced through this code, do so. Otherwise, you won't really believe me.

And then go take a look at the split-stack version of this.

[JMR202411260841 addendum:]

Speaking of the split-stack version, while working through that, I realized I could have used a load effective address routine here for calculating the address of the second local variable in MAIN, something like

* Add D to S and load to X as a pointer
LEADSX	TSX	; make it a pointer
	INX	; adjust for return address the cheap way
	INX
	STX	XWORK
	ADDB	XWORK+1
	STAB	XWORK+1
	ADCA	XWORK
	STAA	XWORK
	LDX	XWORK
	RTS

[JMR202411260841 addendum end.]

(Note that, this time, I'm not suggesting you move ahead if you are getting tired. You've come this far, it's only a little farther along this path until you can decide whether I'm a fool for thinking split stack with no stack frames is so great -- or maybe see what I see.)

 (Title Page/Index)