Showing posts with label address math. Show all posts
Showing posts with label address math. Show all posts

Sunday, December 15, 2024

ALPP 02-35 -- Tentative Op-code Map of RK0801 CPU (Extension of M6801)

One final bit of treasure from the bottom of the pool.

  Tentative Op-code Map of
RK0801 CPU
(Extension of 6801)

(Title Page/Index)

 

This is a tentative op-code map of extensions to the 6801 CPU that I think would make it significantly more efficient without blowing the semiconductor real estate (gates) budget for an 8-bit CPU core, from some older ideas I've had for a while (direct page unaries and SBX) and some new ideas suggested by the addressing math and stack frame examples

New in this map:

  • SBX: Subtract B from X corollary to existing ABX. This optimizes small-to-medium allocations where size is not known at compile/assemble time, also helps when following relative links around.
    (Adding an op-code to add D to X might be another possibility, but would require sign-extending into A.)
  • Add signed Immediate byte to X and S, ADIX/ADIS. This optimizes small-to-medium stack and other allocations where size is known at compile/assemble time.
    (Add and Subtract unsigned byte immediate is an option, but requires more op-codes in the very tight primary op-code table. Add 16-bit immediate is yet another option, but is less efficient with code size, enough so as to make the most common case, add plus or minus 2, meaningless.)
    (Considered dropping INX/DEX and INS/DES, but that de-optimizes byte string operations.)
  • Direct-page versions of unary/read-modify-write byte instructions,
    • NEG (NEGate byte),
    • COM (bit COMplement byte),
    • LSR (Logical Shift Right byte),
    • ROR (ROtate Right byte through carry)
    • ASR (Arithmetic Shift Right byte, copying sign),
    • ROL (ROtate Left byte through carry),
    • DEC (DECrement byte),
    • INC (INCrement byte),
    • TST (TeST byte),
    • CLR (CLeaR byte).

    (These are, really, more appropriate in direct-page mode than in extended mode, to provide effective pseudo-registers.)
    (Also, it might be useful to provide address function code outputs that distinguish between direct page and extended mode, providing an effective separate address space for pseudo-registers and I/O, with all addressing modes enabled on it.)

  • 16-bit read/modify/write  instructions:
    • DINC, (Double-byte INCrement)
      (including INCD, INCrement Double accumulator),
    • DDEC (Double-byte DECrement)
      (including DECD, DECrement Double accumulator),
    • DASL (Double-byte Arithmetic Shift Left)
      (including ASLD, Arithmetic Shift Left Double accumulator),
    • DLSR (Double-byte Logical Shift Right)
      (including LSRD, Logical Shift Right Double accumulator).

    (DASL and DLSR are moved from their position in the 6801 map to the corresponding position in the new 16-bit ranks.)
    (16-bit increment and decrement in the direct page will be especially helpful for software stacks.)

  • JMP to direct-page target (not in 6801 op-codes).

Adding the FDIV and IDIV instructions that the 68HC11 has would be fun, but would likely shoot the gates budget. Likewise, adding the 68HC11's bit testing and manipulation instructions or an additional stack register would require using pre-bytes, and I don't want to do that, either.

Instead of moving the op-codes around, the missing op-codes could be squeezed into empty codes in the 6801 map, but that would require gates that could be used for something else. 

Using a pre-byte and putting the direct page op-codes in a second op-code map would partially erase the advantage of direct-page op-codes.

Left half of the op-code table:

Mnemonic

UNARY
BRANCH
UNARY

**ACCA **INH REL INH **ACCB *Dir Ind Ext

0 1 2 3 4 5 6 7
0 NEG ***CBA BRA TSX NEG NEG
1
NOP BRN [INS] INCD
*DINC
2
***SBA BHI PULA DECD
*DDEC
3 COM ***ABA BLS PULB COM COM
4 LSR ***TAB BCC [DES] LSR LSR
5
***TBA BCS TXS **ASLD *DASL
6 ROR TAP BNE PSHA ROR ROR
7 ASR TPA BEQ PSHB ASR ASR
8 ASL [INX] BVC PULX ASL ASL
9 ROL [DEX] BVS RTS ROL ROL
A DEC CLV BPL ABX DEC DEC
B
SEV BMI RTI ***LSRD *DLSR
C INC CLC BGE PSHX INC INC
D TST SEC BLT MUL TST TST
E ***DAA CLI BGT WAI *SBX *JMP
F CLR SEI BLE SWI CLR CLR

*Not in 6801 *No JMP dp in 6801

**Moved in 2801

***Both row and column moved.

Right half of the op-code table:

Mnemonic

BINARY

ACCA ACCB

Imm Dir Ind Ext Imm Dir Ind Ext

8 9 A B C D E F
0 SUB
1 CMP
2 SBC
3 SUBD ADDD
4 AND
5 BIT
6 LDA
7
STA
STA
8 EOR
9 ADC
A ORA
B ADD
C CPX LDD
D BSR JSR
STD
E LDS LDX
F *~ADIS STS *~ADIX STX

*Not in 6801

*~ADIS and ADIX are signed byte constant

Expanding the address map via segment registers or widened address registers is tempting, but I'm thinking to simply be satisfied with two additional address function outputs to allow distinction between

  • code space (PC relative),
  • return address stack space (S relative),
  • direct page space (DP mode),
  • general data (everything else).

Four address spaces won't really even double available address space because of issues in indexing and hard space separation, but it will make it possible to reach or somewhat exceed full 64 K  addressing.

On the other hand, it would not be hard to give the '0801 widened X and PC and maybe S, or segment registers for two, three or all four of the above address spaces or something similar. If segment registers, I would want to use either full-width segment registers, or have the segment registers offset a full byte. None of that 4-bit offset wamby-pamby.

Further extensions, such as a second stack and widened address registers, and the Y register, bit operators, and IDIV and FDIV from the 'HC11, would warrant another part number, say 2801, the "2" indicating two stacks. 

Or a second 16-bit accumulator, such as the 6309 has, would make it a 16-bit CPU, so maybe 21601. But borrowing from the 6309 tends to point to the idea that, beyond a certain point, we'd want to move up to an extended derivative of the 6809.

Well, I don't think I have anything more for this rabbit hole at this moment, so you can return to the irregularly scheduled assembly language tutorial, continuing with getting numbers output.


(Title Page/Index)


Sunday, November 3, 2024

ALPP 02-24 -- Some Address Math for the 68000

  Some Address Math
for the
68000

(Title Page/Index)

After a break for multi-byte negation, because address math is so important, I think I should show you explicit 68000 corollaries for what I've shown you for the 6809, as well as the routines  for the 6801 and for the 6800.

When instructions become more general, they often take more bytes to encode. This is especially clear for the 68000. And when you generalize an operation, it often takes more instructions to implement -- even with a more powerful instruction set CPU. And the more you repeat those multiple instructions, the more opportunity you have to make mistakes. 

More than speed and byte count, this is why we define utility routines like we just looked at for the 6800 and 6801. We don't want to give ourselves too many opportunities for mistakes. (Macros can help with this, but we won't talk about that just yet.)

Between the 6809 and the 68000, it can be kind of a wash -- when you're working on 16-bit numbers and small applications that fit in a 64K memory space. When you start working with 32-bit numbers, it's advantage 68000, ... except then you also tend to work with 32-bit addresses, and the addresses can make byte count swell. 

I transliterated the fig implementation of Forth from 6800 to 68000, and the object image size increased by about 80% (real rough estimate). This is because I didn't want to restrict it to operating in the lower 32K of memory, minus the interrupt vector table, so the virtual machine i-codes (function addresses, really) swelled from 16-bit to 32-bit. And since the Forth is mostly a clot of i-codes, the overall image size swells. 

I started a conversion to direct call, which I got lost in (partly motivating this tutorial), and the code size does seem to improve a bit, but not completely to the size of the 6809 image.

Do look at assembly listings when you try to compare code sizes for stuff. In particular, the 68000 will often seem to take about twice the code bytes that the 6809 takes in these snippets. But when we move to concrete code where pieces come together, the code size comes down closer to the 6809 code size.

And I'll note again, being able to use single instructions instead of utility routines is nice, but it's actually more important that the 68000 has something of an optimal number of registers, so we don't have to worry about pseudo-registers in memory when switching processes.

As always, read the code and the comments in the code, and open up separate browser windows and compare side-by-side.

I'm showing the entire 68000 code in a single block because the abstract operations don't quite map the same, but I'm keeping the order roughly the same to keep it easy to find what to compare. 

[JMR202411070913 addendum:]

You may have missed the mention of the "here pointer" in the 6809 address math chapter:

LOCBAS	EQU	*

In Motorola assemblers, an asterisk where the assembler could parse an address means the location of the current instruction or directive, thus, "here". I'll need to explain more about it later, I'm sure.

[JMR202411070913 addendum end.]

How registers are mapping when moving from 6809 to 68000 --

  • I'm mapping the 6809's S to A7, of course;
  • U to A6;
  • DP will map to A5;
  • X mostly to A0;
  • Y to whatever.
  • B is sort-of mapped to D7;
  • A is sort-of mapped to D6 or the top bytes of D7 or D5 or something, depending on what I need it to do.

(And please don't just copy-and-paste code without thinking.)

* 68000 pointer math

	ORG	SOMETHING
* All of these work fine in-line, rather than called as subroutines.
* In fact, unless specifically specified otherwise, you should in-line.
* You can substitute any data register unless specified otherwise.
*
* Likewise, you can substitute any address register,
* except that A7 should always be in-lined --
* -- except for those routines which specifically handle the return address, 
* but those routines are not really intended to be used anyway.
* Calling a subroutine and playing with the return stack 
* without handling the return address
* just is not a good way to keep control of your program.
*
* And then there is alignment. 68000 needs 16- and 32-bit accesses 
* to be 16-bit aligned, and will throw address errors if they are not.
* (Later CPUs are not so restricted.)
*
* Negate Dn in 8, 16, or 32 bits:
NEGLD7	NEG.L	D7	; .L => 32 bits, .W => 16 bits, .W => 8 bits
	RTS
* On the 6800/6801/6809, you can negate (2's complement) a byte 
* using a 1-byte instruction.
* On the 68000, it takes a 2-byte instruction.
* It takes 5 bytes of instruction to negate 16 bits on 6800/1/9,
* and 13 bytes to negate 32 bits.
* But on the 68000, it takes just two,
* the above 16-bit op-code with a couple of bits changed.
* This is a common pattern with 68000 instructions.
*
* And, for all the time I spend explaining NEG, 
* since the 68000 can subtract registers in either order, 
* we really don't need NEG here.

* Unsigned byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADDBX	AND.W	#$FF,D7	; zero extend it
	ADD.W	D7,A0	; 16-bit source sign extended to 32 bits
	RTS
* Alternative
ADDBXalt
	AND.W	#FF,D7
	LEA	(A0,D7.W),A0	; takes more bytes
	RTS
*
* Signed byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADSBX	EXT.W	D7
	ADD.W	D7,A0	; 16-bit source sign extended to 32 bit An
	RTS
* Alternative
ADSBXalt
	EXT.W	D7
	LEA	(A0,D7.W),A0	; takes more bytes
	RTS
*
* Unsigned byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SUBBX	AND.W	#$FF,D7	; zero extend it
	SUB.W	D7,A0	; 16-bit source sign extended to 32 bits
	RTS

* Signed byte offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SBSBX	EXT.W	D7
	SUB.W	D7,A0	; 16-bit source sign extended to 32 bit An
	RTS
*
* Unsigned 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADDWX	AND.L	#$FFFF,D7	; zero extend it
	ADD.L	D7,A0	
	RTS
* Alternative
ADDWXalt
	AND.L	#FFFF,D7
	LEA	(A0,D7.L),A0	; takes more bytes
	RTS
*
* Signed 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADSWX	ADD.W	D7,A0	
	RTS
* Alternative
ADSWXalt
	LEA	(A0,D7.W),A0	; takes more bytes
	RTS
* Alternative
ADSWXalt2
	LEA	(A0,A1.W),A0	; takes more bytes
	RTS
*
* Unsigned 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SUBWX	AND.L	#$FFFF,D7	; zero extend it
	SUB.L	D7,A0	
	RTS

* Signed 16-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SBSWX	SUB.W	D7,A0	
	RTS
*
* 32-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
ADDLX	ADD.L	D7,A0	
	RTS
* Alternative
ADDLXalt
	LEA	(A0,D7.L),A0	; takes more bytes
	RTS
* Alternative
ADDLXalt2
	LEA	(A0,A1.L),A0	; takes more bytes
	RTS
*
* 32-bit offset
* Should in-line. Any data register, any address register.
* A7 must in-line (see below).
SUBLX	SUB.L	D7,A0	
	RTS
*


*************
* For the return stack
* As explained above, just in-line the LEA.
* These are provided as a solution to a puzzle,
* not as useful code.
*
* Signed byte offset
* Just in-line the EXT.W and ADD.W
ADSBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	EXT.W	D7	; zero extend it
	ADD.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.
*
* Unsigned byte offset
* Just in-line the AND.W and ADD.W
ADDBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.W	#$FF,D7	; zero extend it
	ADD.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.

* Signed 16-bit offset
* Just in-line the ADD.W
ADSWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	ADD.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.
*
* Unsigned 16-bit offset
* Just in-line the AND.L and ADD.L
ADDWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.L	#$FFFF,D7	; zero extend it
	ADD.L	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
* See above about LEA instead of ADD.

* 32-bit offset
* Just in-line the ADD.L
ADDLS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	ADD.L	D7,A7	; 32 bits
	JMP	(A0)	; return via A0
*
* Unsigned byte offset
* Just in-line the AND.W and SUB.W
SUBBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.W	#$FF,D7	; zero extend it
	SUB.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
*
* Unsigned 16-bit offset
* Just in-line the AND.L and SUB.L
SUBWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	AND.L	#$FFFF,D7	; zero extend it
	SUB.L	D7,A7	; 32 bits
	JMP	(A0)	; return via A0
*
* Signed byte offset
* Just in-line the EXT.W and SUB.W
SUBBS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	EXT.W	D7	; sign extend it
	SUB.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0
*
* Signed 16-bit offset
* Just in-line the SUB.W
SUBWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	SUB.W	D7,A7	; 16-bit source sign extended to 32 bits
	JMP	(A0)	; return via A0

* 32-bit offset
* Just in-line the SUB.L
SUBWS	MOVE.L	(A7)+,A0	; get return address, restore stack address
	SUB.L	D7,A7	; 32 bits
	JMP	(A0)	; return via A0
*

* INX and DEX trains and INS and DES trains are meaningless.
* HOWEVER, just to remind ourselves:
* (And all of these work for Y and U, too but IN-LINE them!!)
* (They work for S if in-lined, as well.)
ADD16X	LEA 	16(A0),A0
	RTS
ADD14X	LEA	14(A0),A0
	RTS
SUB16X	LEA	-16(A0),A0
	RTS
* Etc. In-line these.
INX	LEA	1(A0),A0	; Sigh. In-line it. Do not make trains with it. Please.
	RTS
DEX	LEA	-1(A0),A0	; See INX. In-line it. Do not make trains with it. PLEASE.
	RTS
* Note that we can also use ADDQ and SUBQ for offset less than 9
*
* More solutions to puzzles.
* If you called these, you would have to juggle the return address as shown.
* You don't want to do that.
* Just in-line the LEAS instructions.
* Then there's no return address to juggle, no messing with X.
* DO NOT USE THIS CODE other than for examples of silly walks.
ADD16S	MOVE.L	(A7)+,A0
	LEA	16(A7),A7
	JMP	(A0)
* etc.
* Could all be replaced with just LEA	16(A7),A7 in-line!
* That's actually cheaper than just the instruction JSR!!!


* Synthetic stacks restricted within page boundaries make no sense at all
* on the 68000. Except, I suppose they could, sort-of.
*
* In the first place,
* we should be able to use an extra address register to make a third stack.
* If we do, addressing has already been covered, above.
*
* But if we want a software stack maintained by pointers in memory,
* for some reason,
* Given a pseudo-register somewhere in process local variable space
* accessed via A5:
	ORG	SOMEWHERE
	...
QSP	DS.L	1	; a synthetic stack pointer Q
* QSP-LOCBAS has to be within +/-32K on 68000, 2-byte op-code, 2-byte offset, syntax: QSP-LOCBAS(A5)
* 68020 and above allows 32-bit range, 4-byte op-code, 4-byte offset, syntax: (QSP-LOCBAS,A5)
	...
	DS.L	2	; buffer zone
QSTKLIM	DS.L	32
QSTKBAS	DS.L	2	; buffer zone
	...

* 32-bit Dn for synthetic stack (could/should be in-line):
ADDQSP	ADD.L	D7,QSP-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 256-byte page boundary
* so that carries cannot be generated:
* unsigned byte D7
ADDQSPS	ADD.B	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 65536-byte page boundary
* so that carries cannot be generated:
* unsigned 16-bit D7
ADDQSPW	ADD.W	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
*
* 32-bit Dn for synthetic stack (could/should be in-line):
SUBQSP	SUB.L	D7,QSP-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 256-byte page boundary
* so that carries cannot be generated:
* unsigned byte
SUBQSPS	SUB.B	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS
* If QSTKLIM-8 to QSTKBAS+7 are within an even 65536-byte page boundary
* so that carries cannot be generated:
* unsigned 16-bit D7
SUBQSPW	SUB.W	D7,QSP+3-LOCBAS(A5)	; 4 bytes in op-code (+/-32K)
	RTS

* 68000 has no memory indirection
QPSHD7L	MOVE.L	QSP-LOCBAS(A5),A4	; 4 bytes in op-code
	MOVE.L	D7,-(A4)		; 2 bytes in op-code
	MOVE.L	A4,QSP-LOCBAS(A5)	; 4 bytes in op-code
	RTS
*
* 68020+ have memory indirection
QPSHD7LI
	SUBQ.L	#4,QSP-LOCBAS(A5)	; 4 bytes in op-code (SUBQ.W would be faster for medium stack)
	MOVE.L	D7,([A4])		; 4 bytes in op-code
	RTS
*
QPOPD7L	MOVE.L	QSP-LOCBAS(A5),A4	; 4 bytes in op-code
	MOVE.L	(A4)+,D7		; 2 bytes in op-code
	MOVE.L	A4,QSP-LOCBAS(A5)	; 4 bytes in op-code
	RTS
*
* 68020+ have memory indirection
QPOPD7LI
	MOVE.L	([A4]),D7		; 4 bytes in op-code
	ADDQ.L	#4,QSP-LOCBAS(A5)	; 4 bytes in op-code (ADDQ.W would be faster for medium stack)
	RTS


* Register offsets from A7 were dealt with above.

* Lest I forget --
* On the 6800 or 6801, this would be reference by a process-local
* LOCALBASE or similar pseudo-register, which I almost forgot to talk about.
* On the 6809, it could be done by pseudo-register or (with some glue) by DP.
* On the 68000, we are going to use a spare address register,
* and I am going to pick A5.
* All the address math has been shown above,
* the only issue is being explicit about the assembly language idiom.
* Lest I forget --
*
* Given 
	ORG	Whatever
LOCBAS	EQU	*
*	...
VAR	DS.B	m	; or .W or .L, etc.
*
* With A5 known to be set to LOCBAS,
	LEA	LOCBAS(PC),A5
* or
	MOVEA.L	#LOCBAS,A5
*
* In-line snippets --
* For variable VAR within 256 bytes of LOCBAS:
	...
	LEA	VAR-LOCBAS(A5),A0	; that's all! (4-byte op-code)
	...
*
* When VAR is 256 bytes or more away from LOCBAS, but less than 32768
* (or, even, below LOCBAS but within -32768), in other words, signed 16-bit offset:
	...
	LEA	VAR-LOCBAS(A5),A0	; same thing!
	...
*
* It's a little messier when the signed offset doesn't fit in 16 bits, 
* less than -32768 below, or 32768 or greater above --
	...
	MOVE.L	#VAR-LOCBASE,D7		; Any Dn. An will also work, if it's not in use. 6 bytes.
	LEA	(A5,D7.L),A0		; 4 bytes. total 10 bytes. 
	...
*
* From the 68020 on, 32-bit offsets are allowed, but the op-code is also 32-bits plus displacement:
	...
	LEA	(VAR-LOCBASE,A5),A0	; 8 byte total op-code
	...
* 
* Do I really need to show this as subroutines?
* signed 16-bit offset in D7:
LEALBWX	LEA	(A5,D7.W),A0	; PLEASE just do this in-line!
	RTS
*
* 32-bit offset in D7:
LEALBLX	LEA	(A5,D7.L),A0	; PLEASE just do this in-line!
	RTS
*			;-/
* 
* I assume you're not going to be wanting to keep LOCBAS
* in a pseudo-register called LB_BASE.
* But you might want to maintain a separate allocation area
* with a pointer in AL_BASE, like this:
LOCBAS	EQU	*
	...
AL_BASE	DS.L	1
	...
* for signed 16-bit offsets in D7: 
ADDLBW	MOVE.L	AL_BASE-LOCBAS(A5),A0
	ADD.W	D7,A0	; or LEA (A0,D7.W),A0
	RTS
* for unsigned 16-bit offsets:
ADDLBU	AND.L	#$0000FFFF,D7	; unsigned offset
* for 32-bit offsets
ADDLBL	MOVE.L	AL_BASE-LOCBAS(A5),A0
	ADD.L	D7,A0	; or LEA (A0,D7.L),A0
	RTS
*
* 68020 and above allow you to do weird things like this --
	...
	LEA	([AL_BASE-LOCBASE,A5],D7.L),A0
*	...					;  8-o
* ... quite literally letting you index directly off that pseudo-register
* out there in memory.
*
* As near as I can tell,
* memory indirect modes all require an address register,
* or the PC. 
* But that's not so bad, other than some of the modes being overkill.
*
* And, in spite of my mugging, maybe this has been a good way
* to expand your grasp of the power of the 68000 addressing modes.

* Sorry about the mugging. Sort-of. ;-/

As you can see, the 68000 just basically does almost all the address math you need without subroutines.

Including, to some extent, arrays, but let's not go there yet.

As with the previous three chapters, I have not tested the code. It should run, modulo typos.

The 68000 can be hard to wrap your head around. I know. If the above doesn't make sense yet, it's okay. I'll point you back here from time to time when we are working with more concrete examples of using the above

Look at how I've been avoiding things. I think it's time to build a concrete example of stack frames on the 6801.

Or you can jump ahead to getting numeric output in binary.


(Title Page/Index)


 

 

 

 

Friday, November 1, 2024

ALPP 02-22 -- Some Address Math for the 6809

  Some Address Math
for the
6809

(Title Page/Index)

Maybe it feels like going around in circles, but address math is so important that I think I should show you explicit 6809 corollaries for the utility address math routines I've shown you for the 6801 and for the 6800.

When instructions become more general, they often take more bytes to encode.  And when you generalize an operation, it often takes more instructions to implement -- even with a more powerful instruction set CPU. And the more you repeat those multiple instructions, the more opportunity you have to make mistakes. 

This is why we define utility routines like we just looked at for the 6800 and 6801.

But in practice with the 6809, this is not usually the case. 

To get a sense of how the size is affected in real code, you will want to compare these examples I give to the concrete examples I have given -- and give later -- for the other processors.

As much as having actual instructions to do the work for you improves things, the more important improvement is eliminating almost all need for pseudo-registers that have to be managed when switching processes.

Remember to read the code and the comments in the code, and open up separate browser windows to compare side-by-side with the 6800 and 6801. Reading code is important.

Let me say it again: 

No need for pseudo-registers on either the 6809 or 68000!

Unless you really want to synthesize a third stack or something on the 6809. 

Almost -- That's modulo per-process global variables, depending on how you handle them. And modulo some use of stack as temporaries instead of pseudo-registers, because stack is just a better place for temporaries, and is so easily accessed on the 6809.

Let's look at the 6809 code. 

You'll (hopefully) notice that mapping the abstract operations to the 6809 works out somewhat different than for the 6800 and 6801. So I'm showing the 6809 code in a single block and relying more on comments in the code. The order of presentation is roughly the same, so it should be easy enough to find what to compare with what. 

One of the reasons I demonstrate an alternate way to NEGate the Double accumulator is to demonstrate a very useful way to use the stack to avoid using temporary variables in memory. (I guess I need to go back and make this explicit in the 6801 and 6800 address math chapters.)

Do not miss the fact that the 6809 has four indexable registers, and all the address math instructions work for all four indexable registers -- where the routines may not! Where I say in-line, that means just use the instructions rather than calling the routines.  

[JMR202411070913 addendum:]

I don't think I've explained the "here pointer" symbol and idiom yet:

ESPHIB	EQU	*

In Motorola assemblers, an asterisk where the assembler could parse an address means the location of the current instruction or directive, thus, "here". I will have to explain it further later.

[JMR202411070913 addendum end.]

(If you're wondering, fix the mnemonics for the required register -- LEAX for X, LEAY for Y, LEAU for U, LEAS for S, etc. And don't forget the addressing mode index registers. And, no, don't include the RTS at the end when you're inserting the code in-line. 8-/ I know you caught all that, but some people just copy-and-paste without thinking.)

* 6809 pointer math
*	ORG	$80
*	...
*XOFFA	RMB	1 ; don't need these at all
*XOFFB	RMB	1
*XOFFSV	RMB	2
*	...

	ORG	SOMETHING
* All of these work fine in-line, rather than called as subroutines
*
* Two ways to negate D on the 6809:
NEGD	COMA		; 6800 version -- still no NEGD
	NEGB            ; and sign extending doesn't help.
	BNE	NEGDX	; or BCS. but BNE works -- extends 0
	INCA
NEGDX	RTS
*
NEGDS	PSHS	D	; slightly slower, uses stack
	LDD	#0
	SUBD	,S++
	RTS
*
* Unsigned byte offset
* Absolutely should in-line. X only.
ADDBX	ABX	; X only
	RTS
*
* For unsigned byte offset other than X, zero extend B into A
* Destroys A.
* Should in-line for Y or U. Should use ABX for X. Must in-line for S.
ADDBY	CLRA	; for Y/U/S, zero extend B for unsigned offset
	LEAY	D,Y
	RTS
*
* Signed byte offset
* Should in-line for X, Y or U. Must in-line for S.
ADSBX	LEAX	B,X	; sign extended B, Y/U/S also
	RTS
*
* Signed byte offset
* Should in-line for X, Y or U. Must in-line for S.
SBSBX	NEGB		; signed subtract B, Y/U/S also
	LEAX	B,X
	RTS
*
* Unsigned byte offset, zero extend A
* Destroys A
* Could in-line for X, Y or U. Must in-line for S.
SUBBX	CLRA	; B is unsigned, therefore positive
* 16-bit offset, must in-line for S.
SUBDX	COMA		; no NEGD
	NEGB
	BNE	ADDDX	; or BCS. but BNE works -- extends
	INCA
* 16-bit offset, must in-line for S
ADDDX	LEAX	D,X	; Y/U/S also
	RTS

* Alternatively, use D for explicit subtraction
* Here as an example of math that can be done,
* probably not as a useful subroutine.
SUBBXS	CLRA	; B is unsigned, destroys A
SUBDXS	PSHS	D	; for subtraction
	EXG	X,D	; X to subtract, save D
	SUBD	,S++	; do the subtraction
	EXG	X,D	; Offset result to X, restore D
	RTS

* No particular reason to try to use ABX in signed byte offset.
* This is a solution to a puzzle, not useful code.
* You don't really want to do this.
ADDSBX	TSTB
	BPL	ADDSBXA
	LEAX	B,X	; Absolutely no reason not to use this in the first place.
	RTS
ADDSBXA	ABX
	RTS

*************
* For S stack
* As mentioned above, just in-line the LEAS.
* These are also provided as a solution to a puzzle,
* not as useful code.
*
* Signed byte offset
ADSBS	PULS	X	; get return address, restore stack address
	LEAS	B,S	; you really could just in-line this.
	JMP	,X	; return via X
*
* Unsigned byte offset, zero extend A, destroys A, X
ADDBS	CLRA		; just in-line the CLRA and the LEAS D,S
* 16-bit offset
ADDDS	PULS	X	; get return address, restore stack address
	LEAS	D,S
	JMP	,X	; return
*
* Do you really want to do this?
* Unsigned byte offset, zero extend into A, destroys A
SUBBS	CLRA
SUBDS	COMA
	NEGB
	BNE	ADDDS	; or BCS. but BNE works -- extend
	INCA
	BRA	ADDDS	; let ADDDS handle the return address and the math
* Do the math in D for explicit subtraction
* No more useful than the rest of this for X.
* Here just as an example of math that can be done.
SUBBSS	CLRA	; B is unsigned, destroys A
SUBDSS	LDX	,S	; get return address
	STD	,S	; save D
	TFR	S,D	; get S without endangering the stack
	ADDD	#2	; adjust for having D on the stack
	SUBD	,S	; finally subtract the offset
* Alternative 1, leaves D destroyed
	TFR	D,S	; update stack pointer
	JMP	,X	; return via X
* Alternative 2, restores offset in D
	PSHS	D	; working realllllly hard not to destroy D.
	LDD	2,S	; got the offset
	LDS	,S	; update S
	JMP	,X

* INX and DEX trains and INS and DES trains are meaningless.
* HOWEVER, just to remind ourselves:
* (And all of these work for Y and U, too but IN-LINE them!!)
* (They work for S if in-lined, as well.)
ADD16X	LEAX	16,X
	RTS
ADD14X	LEAX	14,X
	RTS
SUB16X	LEAX	-16,X
	RTS
* Etc. In-line these.
INX	LEAX	1,X	; Sigh. In-line it. Do not make trains with it. Please.
	RTS
DEX	LEAX	-1,X	; See INX. In-line it. Do not make trains with it. PLEASE.
*
* More solutions to puzzles.
* If you called these, you would have to juggle the return address as shown.
* You don't want to do that.
* Just in-line the LEAS instructions.
* Then there's no return address to juggle, no messing with X.
* DO NOT USE THIS CODE other than examples of silly walks.
ADD16S	PULS	X
	LEAS	16,S
	JMP	,X
* etc.
* Could all be replaced with just LEAS	16,S; in-line!
* That's actually cheaper than just the instruction JSR!!!


* And stacks restricted within page boundaries make no sense at all on the 6809.
* Pseudo-register somewhere in DP:
QSP	RMB	2	; a synthetic stack pointer Q
	...
	ORG	SOMETHING
	RMB	4	; buffer zone
QSTKLIM	RMB	64
QSTKBAS	RMB	4	; buffer zone
SSTKLIM	RMB	32
SSTKBAS	RMB	4	; buffer zone
	...

* signed B for synthetic stack:
ADBQSP	LDX	QSP
ADBQSX	LEAX	B,X	; does the whole pointer, negatives, too
	STX	QSP
	RTS
*
* unsigned B and D for synthetic stack:
ADUQSP	CLRA		; unsigned B entry point
ADDQSP	LDX	QSP
ADDQSX	LEAX	D,X	; does the whole pointer, negatives, too
	STX	QSP
	RTS
*
* Choose whether you want to negate D or move it around, and see above.
* Or just decide you can add a negative instead of subtracting
*
* Destroys A
SBSQSP	SEX	; sign extend B into A (Yes, that's the mnemonic.)
	BRA	SBDQSP
SBUQSP	CLRA	; B is unsigned, therefore positive
* 16-bit offset
SBDQSP	COMA		; no NEGD
	NEGB
	BNE	ADDQSX	; or BCS. but BNE works -- extends
	INCA
	BRA	ADDQSX

* Alternatively, use D for explicit subtraction
SBSQSPS	SEX	; sign extend B into A (Yes, that's the mnemonic.)
	BRA	SBDQSPS
SBUQSPS	CLRA	; B is unsigned, destroys A
SBDQSPS	PSHS	D	; for subtraction
	LDD	QSP	; Get things in the right place
	SUBD	,S++	; do the subtraction
	STD	QSP	; update
	RTS

* More stuff that there is no reason to do.
* Just in-line the LEAS B,S
ADBSP	PULS	X	; return address
	LEAS	B,S	; signed B, but full 16-bit address math.
	JMP	,X
*
* Just in-line the LEAS D,S
* D for return stack (but we saw this above):
ADDSP	PULS	X	; return address
	LEAS	D,S
	JMP	,X
*
* Just in-line the NEGB	and LEAS B,S, Still cheaper than the call.
* signed B for return stack:
SBBSP	PULS	X	; return address
	NEGB
	LEAS	B,S	; full 16-bit address math
	RTS
*
* This one might be worth a routine for,
* if you actually have to do it.
* D for return stack (but we saw this above):
SBDSP	PULS	X	; return address
	COMA
	NEGB
	BNE	SBDSPM
	INCA
SBDSPM	LEAS	D,S
	JMP	,X
* or
SBDSPS	LDX	,S	; return address	
	STD	,S	; offest
	TFR	S,D
	ADDD	#2	; adjust it
	SUBD	,S
	TFR	D,S
	JMP	,X

As you can see, the 6809 just basically does almost all the address math you need without subroutines.

Uhm, until we get to arrays, but let's not do that yet.

[JMR202411031752 correction:]

In the comments to the code, I suggested (or asserted?) that there would be no reason on the 6809 to allocate a stack entirely within a single page so that the stack pointer math would never overflow, and the increment and decrement could be handled with the INC and DEC instructions only, ignoring overflow.

On my way to bed last night, I realized that would not entirely be true.

Pointer variables in the direct page cannot be indirected without loading the variable into an index register. So if your top of stack pointer is process local, there would be no point in not using the auto-inc/dec modes and LEA instructions to do the index updates.

But if the synthesized stack or queue is global to all processes (such as a system resource allocation stack or queue), it may be reasonable to use absolute (extended mode) addressing, in which case memory indirection is available. In that case, it may be completely sensible to use the optimization of no-overflow INC or DEC in a stack or queue allocated entirely within a single page:

* A synthetic stack contained entirely in a page,
* using absolute (extended mode) addressing:
	ORG	$400	; anywhere that ESPLOB to ESPHIB-1 are all within a page
ESPLOB	RMB	4	; bumper, lowest related address
ESPLIM	RMB	64	; 32 2-byte items possible on stack
ESPBAS	RMB	4	; bumper
ESPHIB	EQU	*	; highest related address (plus 1)
	...
ESP	RMB	2	; only the low byte will change
	...
EPSHD	DEC	ESP+1	; stack all within a page!
	DEC	ESP+1	; no carry
	STD	[ESP]	; indirection
	RTS
*
EPOPD	LDD	[ESP]	; indirection
	INC	ESP+1	; stack all within a page!
	INC	ESP+1	; no carry
	RTS
*
ADDBESP	ADDB	ESP+1	; signed
	STB	ESP+1	
	RTS
*
SUBBESP	PSHS	B	; unsigned
	LDB	ESP
	ADDB	,S+
	STB	ESP
	RTS

Hopefully, I can devote a chapter or three to giving this proper treatment somewhere down the road.

[JMR202411031752 correction end.]

Oh, and I have mentioned, I think, the DP register, how it isn't as fully supported as I'd have liked

The DP can be used as a base for per-process global variables (in other words, variables local to the process, but globally/statically allocated within the process). I discussed this to a certain extent in the 6800 addressing math chapter.

* On the 6800 or 6801, this would be reference by a process-local
* LOCALBASE or similar pseudo-register, which I almost forgot to talk about.
* How to get the effective address of a variable in DP:
* Instead of 
*	LEAX	<VAR
* or
*	LEAX	VAR,DP
* or even 
*	LEAX	VAR-DPBASE,DP
* which we do not have in the 6809,
* we can do this --
*
* Given 
	ORG	$nn00		; even 256-byte page address
	SETDP	$nn
DPBAS	EQU	*
*	...
VAR	RMB	m
*
* In-line snippets --
* For variable VAR within 256 bytes of DPBAS:
	...
	LDB	#VAR-DPBASE	; put the offset in DP in B (unsigned)
	TFR	DP,A		; pull the base address high byte into A
	TFR	D,X		; move it to X
	...
*
* Using DP when VAR is 256 bytes or more away from DPBAS:
	...
	TFR	DP,A		; pull the base address high byte into A
	CLRB			; make the full base address
	ADDD	#VAR-DPBASE	; add the offset
	TFR	D,X		; move it to X
	...
*
* Or, if the assembler lets us split the offset up with advanced math:
	...
	TFR	DP,A
	LDB	#(VAR-DPBASE)&$FF	; bit-and mask -- no carry!
	ADDA	#(VAR-DPBASE)/$100	; add the high byte
	TFR	D,X
	...
* 
* As subroutines --
* unsigned offset in B:
LEADPUX	TFR	DP,A		; pull the base address high byte into A
	TFR	D,X		; move it to X
	RTS
*
* unsigned offset in D:
LEADPDX	TFR	DP,A		; pull the base address high byte into A
	CLRB			; make the full base address
	ADDD	#VAR-DPBASE	; add the offset
	TFR	D,X		; move it to X
	RTS
*
* Because DP is not in the index post-byte,
* in some applications, it may be better to keep 
* LOCBAS as a pseudo-register,
* in which case it would look like this --
* for small offsets < 128: 
ADDLBB	LDX	<LOCBAS	; but do this in-line!
	LEAX	B,X
	RTS
* for 127 < offset < 256, maybe, maybe not:
ADDLBU	CLRA		; unsigned offset
* for larger offsets
ADDLBD	LDX	<LOCBAS	; and definitely do this in-line, too!
	LEAX	D,X
	RTS	

As with the previous two chapters, I have not tested the code. It should run, modulo typos.

Even though I keep saying things like "in-line this", and "you don't need that", it may be hard to visualize the impact that 6809 addressing modes has on addressing math until we compare the stack frame code for the 6800 and 6801 to the stack frame code for the 6809.

Likewise the 68000. But let's get an overview of addressing math on the 68000 before we take a look at a concrete example of stack frames on the 6801. And on our way to addressing math on the 68000, let's take a detour for multi-byte negation on the 6809.

Or you can jump ahead to getting numeric output in binary.


(Title Page/Index)


 

 

 

 

Wednesday, October 30, 2024

ALPP 02-21 -- Some Address Math for the 6801

  Some Address Math
for the
6801

(Title Page/Index)

I had thought I would not need to show this for the 6801, but the difference between addressing math on the 6800 and on the 6801, due to being able to add and subtract the double accumulator and being able to push and pop X is dramatic enough that I guess I should.

This chapter, then, will be an extension of the handwaving and conceptualizing in the unsteady footing chapter

Even if you aren't interested in stack frames, this discussion of addressing math should be useful, although I'm adding it a bit earlier than I had planned.

In the 6801, as I keep noting, we have ABX to help us with address math, but no corollary SBX. 

But the D register math is wide enough to do addresses, the big problem being in moving addresses between D and X. Two pushes and a pop, or two pops and a push, is not bad, but going through a pseudo-register in the direct page works quicker, and takes more bytes of object code. And sometimes you didn't want to use the whole D accumulator.

Now that I think of it, a sign-extend B into A instruction like the 6809's sign-extend instruction, SEX, might have been helpful in a few places. (cough.) Still, just using D is not an onerous burden.

We still have to use a pseudo-register for many/most of the calculations.

-- Which causes issues at interrupt time, unless we have separate pseudo-registers used by separate routines for interrupt-time, and copy the user tasks' pseudo-registers out and in on context switch.

Here are those NEGate D snippets, modified for 6801:

* For reference -- NEGate a 16-bit value in D (same as 6800) --
NEGD	COMA	; 2's complement NEGate is bit COMplement + 1
	NEGB
	BNE	NEGDX	; or BCS. but BNE works -- extends 0
	INCA
NEGDX	RTS
*
* Another way (use Double accumulator subtract):
NEGDS	PSHB
	PSHA
	CLRB	; 0 - D
	CLRA
	TSX
	SUBD	0,X
	INS
	INS
	RTS	
*
* Same thing using Double accumulator and a temporary
* somewhere in DP:
	...
SCRCHD	RMB	2
	...
* somewhere else
NEGDV	STD	SCRCHA
	LDD	#0	: 0 - D
	SUBD	SCRCHA
	RTS
	...

Remember to read the code and the comments in the code, and open up a separate browser window to compare side-by-side with the 6800. Read through my transliterations from the 6800, but don't jump to conclusions before you get to the very end.

Again, assume you have these declarations for the pseudo-registers:

	ORG	$80
	...
XOFFA	RMB	1
XOFFB	RMB	1
XOFFSV	RMB	2
	...

Using D is so much faster than either 8-bit accumulator that it really doesn't make much sense to provide anything but D-offset, but I've kept the 8-bit and subtract-by-negating entry points for reference. Lack of a negate D means this way to subtract de-optimizes subtraction, and, since the D offset is 16-bit, it's quicker to just load a negative offset in D and call ADDDX instead of bothering with using the SUBDX entry point.

ADDBX	CLRA
ADDDX	STX	XOFFSV
	ADDD	XOFFSV
	STD	XOFFSV
	LDX	XOFFSV
	RTS
SUBBX	CLRA	; B is unsigned
SUBDX	COMA
	NEGB
	BNE	ADDDX	; or BCS. but BNE works -- extends
	INCA
	BRA	ADDDX

If you want a SUBDX entry point for some reason, it may be worth keeping the logic separate and moving the operands. The Double accumulator math speeds this up significantly.

* Alternative, don't use ADDDX, use XOFFA and XOFFB instead 
SUBBX	CLRA	; B is unsigned
SUBDX	STD	XOFFA	; subtraction does not commute.
	STX	XOFFSV	; Handle operand order.
	LDD	XOFFSV
	SUBD	XOFFA
	STD	XOFFSV
	LDX	XOFFSV
	RTS

Just so I don't gloss over ABX, here's ADDBX as a subroutine. 8-bit offset SUBBX remains as it was for the 6800, except using ABX for the add means there's not code sharing:

* Working in byte offsets just takes that much more code than D,
* these are all superfluous.
* Well, the ABX instruction can be useful in-line.
* Alternative unsigned byte only
* subtract needs to be checked again
* range 0 to 255
ADDBX	ABX
	STX	XOFFSV
	RTS
* No improvements here without just using D.
SUBBX	NEGB
	BNE	SUBDXL	; or BCS. but BNE works -- extends
	DEC	XOFFSV	; I think inverting the add should work
SUBDXL	ADDB	XOFFSV+1
	BCC	SUBBXL	; still need to bring the carry in
	INC	XOFFSV+1
SUBBXL	STAB	XOFFSV
	LDX	XOFFSV
	RTS

Using ABX for the positive half of the signed 8-bit routines also emphasizes the lack of SBX in the 6801:

* ABX partially improves the positive half of things here,
* but you really don't want to do this.
* Needs to be checked again.
ADDSBX	STX	XOFFSV
	TSTB	; sign extend B
*	BEQ	ADSBXD	; use only if we really want to optimize 0
	BPL	ADSBXU
	NEGB	; high byte is -1 (low byte is not 0 anyway)
	ADDB	XOFFSV+1
	DEC	XOFFSV	; add -1 (I think )
	LDX	XOFFSV
ADSBXD	RTS
ADSBXU	ABX
	STX	XOFFSV
	RTS

Return stack pointer math with byte offsets losing its meaning on the 6801. You really want the speed when doing math on S, so you're just going to use D.

PSHX and PULX helps with handling the return address..

Again, you should recognize that the call writes the return address into the allocated space on allocation, so if you've stored before allocation, you'll be walking on what you stored.

The declarations,  note that we are adding SOFFA for the double accumulator:

* For S stack
* Even though we really don't want to be bumping the return stack that far,
* Using D is just faster on the 6801
	ORG	$90
	...
SOFFA	RMB	1
SOFFB	RMB	1
SOFFSV	RMB	2

And the code, watch the return address code:

* Here's what we can use the 6801 extensions for when doing unsigned byte offsets,
* but, really, use D instead:
	ORG	SOMETHING
ADDBS	PULX	; get return address, restore stack address
	STS	SOFFSV
	ADDB	SOFFSV+1	; can't use ABX because we need X for return
	BCC	ADDBSL
	INC	SOFFSV
ADDBSL	STAB	SOFFSV
	LDS	SOFFSV
	JMP	0,X	; return through X
SUBBS	NEGB
	BNE	ADDBS	; or BCS. but BNE works -- extend
	INCA
	BRA	ADDBS

Doing it with D instead, but use negative offsets instead of the SUBDS entry point:

* Do it with D, instead, but use negative offsets instead of SUBDS:
ADDDS	PULX	; get return address, restore stack address
	STS	SOFFSV
	ADDD	SOFFSV	; can't use ABX because we need X for return
ADDDSL	STD	SOFFSV
	LDS	SOFFSV
	JMP	0,X	; return through X
SUBDS	COMA
	NEGB
	BNE	ADDDS	; or BCS. but BNE works -- extend
	INCA
	BRA	ADDDS

Moving the operands around, if we think we must subtract positive offsets instead of adding negative offset, gets a lot of improvement. Again, just use D instead and call SUBDS instead of trying to optimize with the 8-bit B accumulator:

* use SOFFB instead of ADDBS
* range 0 to 255
* Need to check again
SUBBS	PULX	; get return address, restore stack pointer
	STS	SOFFSV
	STAB	SOFFB
	BPL	SUBBSM
	INC	SOFFSV	; subtract -1 (I think )
SUBBSM	LDAB	SOFFSV+1
	SUBB	SOFFB
	BCC	SUBBSL
	DEC	SOFFSV	; subtract the borrow
SUBBSL	STAB	SOFFSV+1
	LDS	SOFFSV
	JMP	0,X	; return through X
* Do it with D, instead
* use SOFFA instead of ADDDS
SUBDS	PULX	; get return address, restore stack pointer
	STS	SOFFSV
	STD	SOFFA
	LDD	SOFFSV
	SUBD	SOFFA
	STD	SOFFSV
	LDS	SOFFSV
	JMP	0,X	; return through X

At this point, I think it is obvious that long trains of INX are meaningless on the 6801: Two to four, in-line, sure. More, no.

Long trains for S also become questionable, but PULX can make an appearance, which is interesting, though not useful more than for something to think about:
* For small decrements, 8 <= decrement <= 16
* Uses X for return
* Note that this writes the return address in the upper portion of the allocation,
* which may not be desirable.
SUB14S	PULX
	BRA	ISB14S
SUB12S	PULX
	BRA	ISB12S
SUB10S	PULX
	BRA	ISB10S
SUB8S	PULX
	BRA	ISB8S
SUB16S	PULX
ISB16S	DES
	DES
ISB14S	DES
	DES
ISB12S	DES
	DES
ISB10S	DES
	DES
ISB8S	DES
	DES
	DES	; SUB7S and less are shorter in-line
	DES
	DES	
	DES
	DES
	DES
	JMP	0,X
* For small increments, 8 <= increment <= 16
* Uses X for return
ADD14S	PULX
	BRA	IAD14S
ADD12S	PULX
	BRA	IAD12S
ADD10S	PULX
	BRA	IAD10S
ADD8S	PULX
	BRA	IAD8S
ADD6S	PULX
	BRA	IAD6S
ADD16S	PULX
IAD16S	INS
	INS
IAD14S	INS
	INS
IAD12S	INS
	INS
IAD10S	INS
	INS
IAD8S	INS
	INS
IAD6S	INS	; ADD5S and less are shorter in-line
	INS
	INS	
	INS
	INS
	INS
	JMP	0,X

I guess, since I'm being noisy about SBX not being implemented on the 6801, I should also be noisy about ABS (add B to S) and SBS (subtract B from S) being missing.

But so much of the above really becomes irrelevant if we just liberate ourselves from the stack frame mentality/paradigm. Stack frames really ought to be classed among Monty Python's silly walks. 

Stacks allocated entirely within a single page

Concerning the optimization of allocating stacks entirely within a page and only doing math on the low byte, the 6801 offers no improvements to that, only to make the optimization less meaningful. I'll repeat, with the full address math below to make it clear. 

 Oh, but working directly on the parameter stack pointer becomes more interesting.

* And stacks restricted within page boundaries no longer make as much sense on the 6801.
* Pseudo-registers somewhere in DP:
PSP	RMB	2
XOFFSV	RMB	2
XOFFA	RMB	1
XOFFB	RMB	1
SOFFA	RMB	1
SOFFB	RMB	1
SOFFSV	RMB	2
	...
	ORG	$500	; or something
	RMB	4	; buffer zone
PSTKLIM	RMB	64
PSTKBAS	RMB	4	; buffer zone
SSTKLIM	RMB	32
SSTKBAS	RMB	4	; buffer zone
	...

* B for parameter stack:
ADBPSX	STX	PSP
ADBPSP	ADDB	PSP+1	; Stack allocated completely within page, never carries.
	STAB	PSP+1
	LDX	PSP
	RTS
*
* D for parameter stack:
ADDPSX	STX	PSP
ADDPSP	ADDD	PSP
	STD	PSP	; does the whole pointer, negatives, too
	LDX	PSP
	RTS
*
* B for parameter stack:
SBBPSX	STX	PSP
SBBPSP	STAB	XOFFB
	LDAB	PSP+1
	SUBB	XOFFB	; Stack allocated completely within page, never carries.
	STAB	PSP+1
	LDX	PSP
	RTS
*
* D for parameter stack:
SBDPSX	STX	PSP
SBDPSP	STD	XOFFA
	LDD	PSP
	SUBD	XOFFA	; does the whole pointer
	STD	PSP
	LDX	PSP
	RTS

* B for return stack:
ADBSP	PULX	; return address
	STS	SOFFSV
	ADDB	SOFFSV+1	; Stack allocated completely within page, never carries.
	STAB	SOFFSV+1
	LDS	SOFFSV
	JMP	0,X	; return via X
*
* D for return stack (but we saw this above):
ADDSP	PULX	; return address
	STS	SOFFSV
	ADDD	SOFFSV	; does the whole pointer, negatives, too
	STD	SOFFSV
	LDS	SOFFSV
	JMP	0,X	; return via X

* B for return stack:
SBBSP	PULX	; return address
	STS	SOFFSV
	STAB	SOFFB
	LDAB	SOFFSV+1
	SUBB	SOFFB		; Stack allocated completely within page, never carries
	STAB	SOFFSV+1
	LDS	SOFFSV
	JMP	0,X	; return via X
*
* D for return stack (but we saw this above):
SBDSP	PULX	; return address
	STS	SOFFSV
	STD	SOFFA
	LDD	SOFFSV
	SUBD	SOFFA		; does the whole pointer
	STD	SOFFSV
	LDS	SOFFSV
	JMP	0,X	; return via X

As with the last chapter, I have not tested the code. I do think it should run, modulo typos.

[JMR202411021012 addendum:]

 Not stack frame related, but address math. I discussed it in the 6800 address math chapter, and I want to show the 6801 version of the code. 

This is for accessing per-process global variables that don't need such high-speed access that they are worth slowing process switches down with, which is almost all per-process variables except when the hardware application only has a few very limited processes. See the discussion before the 6800 snippets.

* In the DP somewhere, we have a base pointer for the per-process
* variable space, and a working register, something like
	...
LOCBAS	RMB	2
LBXPTR	RMB	2
	...
*
* And, to get the address of variables in the per-process variable space,
* something like these functions --
ADDLBB	CLRA			; entry point for the byte offset in B
ADDLBD	ADDD	LOCBAS		; entry point for larger offsets in A:B
	STD	LBXPTR
	RTS
*
ADDLBX	BSR	ADDLBB	; and load X
	LDX	LBXPTR
	RTS
*
ADDLDX	BSR	ADDLBD	; and load X
	LDX	LBXPTR
	RTS

[JMR202411021012 addendum end.]

And with this in mind, too, while thinking about how the 6801's enhanced instruction set can make some of the above code much less intransigent, let's remind ourselves why the 6809 and 68000 don't need routines like these before we take a look at a concrete example of stack frames on the 6801.

Or you can jump ahead to getting numeric output in binary.


(Title Page/Index)


 

 

 

 

ALPP 02-20 -- Some Address Math for the 6800

  Some Address Math
for the
6800

(Title Page/Index)

Perhaps I would not have gotten so tangled up in the discussion of stack frames if I had simply written this chapter immediately after the demonstration of 16- and 32-bit arithmetic on the 68000. But sometimes you just need to see a reason for doing something before you see someone doing it, or it blows your mind.

What is the difference between address math and other math?

Not a lot. You still have to pay attention to signs and stuff, and watch what happens when you wrap around the limits of your registers. Rings are fun, but you have to get used to them. 

Ah, yes, right. One thing about general address math is that you need to be aware of the limits of your registers. You often don't know in advance where in memory the address you're working on is going to be.

Not to say you don't have to be aware of limits in non-address math -- rather, where the limits hit and how they hit can be different, so you have to watch a different way.

One other difference is that, for general math, you want your call and result parameters in places where they can be easily carried from one stage in calculations to the next. That's why I have been demonstrating the use of the parameter stack versus global variables (versus registers).

For address math, if possible, you absolutely want your parameters and the result in registers, specifically the result in a particular register that can be used in addressing.

In the earliest CPUs, the math itself was hard enough (unknown enough) that addressing seemed to be an afterthought -- or even outside the plans. You can't plan well without knowing what you're planning for -- and what you're planning.

We really didn't know what we were doing. 

Intel, for instance, almost killed themselves in the mid-1970s working on a CPU design that was supposed to be the be-all-and-end-all of CPUs, the iAPX 432. But there was too much theory without experience, and it was slow and fought with itself to get work done. When they saw deadlines pass without end in sight, especially when rumors of what Motorola was doing hit the backyard fence, they scrambled and used part of what they had learned and produced the 8086, and the 8086 was definitely an improvement on the 8080 -- and saved their bacon when the 432 that was delivered didn't live up to promise. And the 8086 was a small enough step forward that it was easy for customers to adopt -- setting the stage for Intel to lead by adopting small improvements in steps that could be handled. But the 8086 also was, and its descendants still are, more than a little baroque.

Motorola, for their part, had figured out they needed to do something radical to stay competitive, and had started examining source code for the 6800 that they had access to, looking for ways to relieve computational bottlenecks. They used that research in the original design of the 68000, and there was a parallel team that had access to the research and put it to use in the design of the 6809.

And they hit a home run on the 6809 -- almost. Brought three runners in and left the DP register stranded on 3rd, so to speak. If you think of DP as the pinch runner or something. Okay, the metaphor doesn't quite work, unless you think of the DP register as the pinch runner for a wider address space, which it almost was.

The 68000 was another home run -- out of season and some overkill. And it has some warts, too.

Every real CPU is going to have warts. It's a mathematical requirement. 

I'm not kidding. There is an axiom in systems science 

Every model is insufficient to reality.

And that has some consequences:

  • Every system has vulnerabilities, and
  • every system contains the seeds of its own undoing, and
  • every market window is a sandpit.

Translated into general science, we know in advance that every theory and every law will eventually fail.

But that kind of cold water just is not popular in the sales department, so, instead of emblazoning it on the halls of all higher learning and in the chambers of legislatures, we hide it away. 

(Mostly -- there is some recognition at times -- POSIWID.) 

All of that to warn you:

Ugly code in here. 

I did some handwaving and conceptualizing for the 6801 in the unsteady footing chapter. I'm continuing with more handwaving and untested code in this chapter, but for the 6800.

First, in the 6800, we have nothing special to add a constant to the index register with anything but an ephemeral result. That's great for some things like constant offsets (thus, the 6800's indexed mode), but not so great for some other things. And it's always a positive constant, which makes some stack-related uses hard.

In the 6801, we have ABX to add a small offset -- unsigned, less than 256 -- but no SBX to subtract an offset, and no signed ASBX or whatever.

The way the instruction set is constructed, we end up having to use a variable in memory to do the math, and because we have to use X to index the stack(s), passing the offset in as a dynamically allocated parameter is a case of trying to resolve a cyclic dependency.

Thus, we simply have to use a pseudo-register -- preferably in the direct page. 

-- Which causes issues at interrupt time, unless we have separate pseudo-registers used by separate routines for interrupt-time, and copy the user tasks' pseudo-registers out and in on context switch.

We'll be using 16-bit negation rather frequently, keep a couple or three snippets in mind:

* For reference -- NEGate a 16-bit value in A:B --
NEGAB	COMA	; 2's complement NEGate is bit COMplement + 1
	NEGB
	BNE	NEGABX	; or BCS. but BNE works -- extends 0
	INCA
NEGABX	RTS
*
* Another way, using stack for temporary:
NEGABS	PSHB
	PSHA
	CLRB	; 0 - A:B
	CLRA
	TSX
	SUBB	1,X
	SBCA	0,X
	INS
	INS
	RTS	
*
* Same thing using a temporary
* somewhere in DP:
	...
SCRCHA	RMB	1
SCRCHB	RMB	1
	...
* somewhere else
NEGABV	STAA	SCRCHA
	STAB	SCRCHB
	CLRA		: 0 - A:B
	CLRB
	SUBB	SCRCHB
	SBCA	SCRCHA
	RTS
	...

I'm going to assume that you'll be reading the code and the comments closely enough to tell when you should doubt me.

Assume you have these declarations for the pseudo-registers:

	ORG	$80
	...
XOFFA	RMB	1
XOFFB	RMB	1
XOFFSV	RMB	2
	...

These entry points should add and subtract offsets in A:B. Note that the code inverts A:B to do the subtraction, to avoid commutation issues. (Note carefully the INCA. I think I have this right for handling the NEGB when B is zero.)

ADDBX	CLRA
ADDDX	STX	XOFFSV
	ADDB	XOFFSV+1
	ADCA	XOFFSV
	STAB	XOFFSV+1
	STAA	XOFFSV
	LDX	XOFFSV
	RTS
SUBBX	CLRA	; B is unsigned
SUBDX	COMA
	NEGB
	BNE	ADDDX	; or BCS. but BNE works -- extends
	INCA
	BRA	ADDDX

As an alternative, we could move the operands around using more pseudo-registers (and remembering the consequences). This code may be a little easier to believe in, but it does mean two more bytes to save away and restore on context switch.

* Alternative, don't use ADDDX, use XOFFA and XOFFB instead 
SUBBX	CLRA	; B is unsigned
SUBDX	STAA	XOFFA	; subtraction does not commute.
	STAB	XOFFB	; Handle operand order.
	STX	XOFFSV
	LDAA	XOFFSV
	LDAB	XOFFSV+1
	SUBB	XOFFB
	SBCA	XOFFA
	STAA	XOFFSV
	STAB	XOFFSV+1
	LDX	XOFFSV
	RTS

You can optimize the above a bit if you limit offsets to 0 to 255, which is a completely reasonable restriction for many applications. I won't show those. I don't want to wear you out with too much untested code.

Signed byte offset (-128 to 127) is also completely reasonable for many applications, and may offer some aesthetic satisfaction:

* this is faster than SUBDX and almost as fast as ADDDX, 
* Range is -128 to 128 which should be enough for many purposes.
* But unsigned byte-only can be faster.
* Needs to be checked again.
ADDSBX	STX	XOFFSV
	TSTB	; sign extend B
*	BEQ	ADSBXD	; use only if we really want to optimize 0
	BPL	ADSBXU
	NEGB	; high byte is -1 (low byte is not 0 anyway)
	ADDB	XOFFSV+1
	DEC	XOFFSV	; add -1 (I think )
	BRA	ADSBXL
ADSBXU	ADDB	XOFFSV+1
	BCC	ADSBXL
	INC	XOFFSV
ADSBXL	LDX	XOFFSV
ADSBXD	RTS

And we can do similar things with the return stack, S. S, in particular, should never need offsets larger than 255 on the 6800, so we'll focus on the unsigned byte options. 

The stack has the additional constraints of requiring some means of handling the return address.

One more thing, you should recognize that the call writes the return address into the allocated space on allocation. If there is something important there, it's toast.

The declarations:

* For S stack
* unsigned byte only,
* because we really don't want to be bumping the return stack that much
	ORG	$90
	...
SOFFB	RMB	1
SOFFSV	RMB	2

And the code, watch the return address code:

ADDBS	TSX
	LDX	0,X	; get return address
	INS
	INS		; restore stack address
	STS	SOFFSV
	ADDB	SOFFSV+1
	BCC	ADDBSL
	INC	SOFFSV
ADDBSL	STAB	SOFFSV
	LDS	SOFFSV
	JMP	0,X	; return through X
SUBBS	NEGB
	BNE	ADDBS	; or BCS. but BNE works -- extend
	INCA
	BRA	ADDBS

Again, the subtraction can alternatively move the operands into the right order, at the cost of using another pseudo-register:

* use SOFFB instead of ADDBS
* range 0 to 255
* Need to check again
SUBBS	TSX
	LDX	0,X	; get return address
	INS
	INS		; restore stack pointer
	STS	SOFFSV
	STAB	SOFFB
	BPL	SUBBSM
	INC	SOFFSV	; subtract -1 (I think )
SUBBSM	LDAB	SOFFSV+1
	SUBB	SOFFB
	BCC	SUBBSL
	DEC	SOFFSV	; subtract the borrow
SUBBSL	STAB	SOFFSV+1
	LDS	SOFFSV
	JMP	0,X	; return through X

You'll remember I made reference to long trains of INX and DEX as a substitute for direct math on X:

* For small increments <= 16
ADD16X	INX
	INX
ADD14X	INX
	INX
ADD12X	INX
	INX
ADD10X	INX
	INX
ADD8X	INX
	INX
ADD6X	INX
	INX
	INX	; ADD4X and less shorter in-line
	INX
	INX	
	INX
	RTS

* For small decrements <= 16
SUB16X	DEX
	DEX
SUB14X	DEX
	DEX
SUB12X	DEX
	DEX
SUB10X	DEX
	DEX
SUB8X	DEX
	DEX
SUB6X	DEX
	DEX
	DEX	; SUB4X and less shorter in-line
	DEX
	DEX	
	DEX
	RTS

Just jump to the label for the offset you need to add or subtract.

I know it looks ... ugly. But it works, and it avoids the use of pseudo-registers, and it's fast, and it actually doesn't use up more code space than the general routines we've looked at. These are worth considering.

And you're thinking, well, that's not going to work for the return stack? 

Hah!

* For small decrements, 8 <= decrement <= 16
* Uses X for return
* Note that this writes the return address in the upper portion of the allocation,
* which may not be desirable.
SUB14S	TSX
	LDX	0,X
	BRA	ISB14S
SUB12S	TSX
	LDX	0,X
	BRA	ISB12S
SUB10S	TSX
	LDX	0,X
	BRA	ISB10S
SUB8S	TSX
	LDX	0,X
	BRA	ISB8S
SUB16S	TSX
	LDX	0,X
ISB16S	DES
	DES
ISB14S	DES
	DES
ISB12S	DES
	DES
ISB10S	DES
	DES
ISB8S	DES
	DES
	DES	; SUB7S and less are shorter in-line
	DES
	DES	
	DES	; two less because of the return address
	JMP	0,X

* For small increments, 8 <= increment <= 16
* Uses X for return
ADD14S	TSX
	LDX	0,X
	BRA	IAD14S
ADD12S	TSX
	LDX	0,X
	BRA	IAD12S
ADD10S	TSX
	LDX	0,X
	BRA	IAD10S
ADD8S	TSX
	LDX	0,X
	BRA	IAD8S
ADD16S	TSX
	LDX	0,X
IAD16S	INS
	INS
IAD14S	INS
	INS
IAD12S	INS
	INS
IAD10S	INS
	INS
IAD8S	INS
	INS
	INS	; ADD7S and less are shorter in-line
	INS
	INS	
	INS
	INS
	INS
	INS	; two more to cover the return address
	INS
	JMP	0,X

What's that? Do I hear complaints about the smell.

It's ugly, but it could be useful.

Stacks allocated entirely within a 256-byte page

Finally, if we are talking about stacks (and other largish things in memory), it may be possible to arrange them in memory so that the stacks lie completely within a single 256 byte page, such that the high byte of address does not change. This particular trick was used to great effect on the 6502 and 6805, in particular. 

We can use it on the 6800 in some cases, if we can be absolutely sure that everybody who ever touches the code is aware of the requirement to keep each stack entirely within a single page.

* Stacks within page boundaries:
* Pseudo-registers somewhere in DP:
PSP	RMB	2
XOFFSV	RMB	2
XOFFB	RMB	1
SOFFB	RMB	1
SOFFSV	RMB	2
	...
	ORG	$500	; or something
	RMB	4	; buffer zone
PSTKLIM	RMB	64
PSTKBAS	RMB	4	; buffer zone
SSTKLIM	RMB	32
SSTKBAS	RMB	4	; buffer zone
	...

* For parameter stack:
ADBPSX	STX	PSP
ADBPSP	ADDB	PSP+1	; Stack allocated completely within page, never carries.
	STAB	PSP+1
	LDX	PSP
	RTS
*
SBBPSX	STX	PSP
SBBPSP	STAB	XOFFB
	LDAB	PSP+1
	SUBB	XOFFB	; Stack allocated completely within page, never carries.
	STAB	PSP+1
	LDX	PSP
	RTS

* For return stack:
ADBSP	TSX
	LDX	0,X	; return address
	ADDB	#2		; faster, same byte count
	STS	SOFFSV
	ADDB	SOFFSV+1	; Stack allocated completely within page, never carries.
	STAB	SOFFSV+1
	LDS	SOFFSV
	JMP	0,X	; return via X

SBBSP	TSX
	LDX	0,X	; return address
	ADDB	#2		; faster, same byte count
	STS	SOFFSV
	STAB	SOFFB
	LDAB	SOFFSV+1
	SUBB	SOFFB		; Stack allocated completely within page, never carries
	STAB	SOFFSV+1
	LDS	SOFFSV
	JMP	0,X	; return via X
Again, I have not tested the code. It should run. I think.

As a reminder, we've already seen what code looks like without stack frames. The only reason I'm showing you this stuff is so that you understand why stack frames may not be preferred for many applications (and, if you can understand that, maybe you can sometime see it for all applications).

Well, no, not the only reason. Maybe the only reason I'm showing it to you now rather than later.

[JMR202411020931 addendum:]

This is not stack frame related, but it's address math related, and I think it would be good to discuss it here, lest I forget --

There are two approaches to per-process variables. 
  • Pseudo-registers like PSP, XWORK, XOFFSV, SOFFSV, etc. will either be saved and restored on process switch or will have separate versions for each task, if there are not too many.
  • Most per-process variables with global allocation should be in a per-process address space. 

You'll usually use both, a few pseudo-registers for variables that need quick access, and they need to just a few to keep the management overhead on task/process switch to a minimum. Every pseudo-register must be saved and restored on process switch --

Except for a couple of special cases, 

  • It's useful to keep system pseudo-registers separate from non-system pseudo-registers, complete with separate routines to manage them.
  • If there are just a few non-system processes in a small hardware application, it may be useful to give each process its own pseudo-registers, along with the routines to manage them.

What kinds of things need to be pseudo-registers? 

XWORK and other such temporaries, including SOFFSV and such above.

And PSP, as well. (Note that, if the system functions use a parameter stack, it should be a separate SPSP or something, which would have to have its own support routines.)

If there are a lot of per-process variables, you would need, separate from pseudo-registers, a process-local space. And you would need a pointer to that space, with routines to access the variables there:

* In the DP somewhere, we have a base pointer for the per-process
* variable space, and a working register, something like
	...
LOCBAS	RMB	2
LBXPTR	RMB	2
	...
*
* And, to get the address of variables in the per-process variable space,
* something link these --
ADDLBB	CLRA			; entry point for the byte offset in B
ADDLBD	ADDB	LOCBAS+1	; entry point for larger offsets in A:B
	ADCA	LOCBAS
	STAA	LBXPTR
	STAB	LBXPTR+1	; let other code load X
	RTS
*
ADDLBX	BSR	ADDLBB	; and load X
	LDX	LBXPTR
	RTS
*
ADDLDX	BSR	ADDLBD	; and load X
	LDX	LBXPTR
	RTS

[JMR202411020931 addendum end.]

With all this in mind, look at how the 6801's enhanced instruction set can make some of the above code much less intransigent before we take a look at a concrete example of stack frames on the 6800.

Or you can jump ahead to getting numeric output in binary.


(Title Page/Index)