Saturday, January 4, 2025

ALPP 03-XX (18) -- Multiplying by Small Constants (Shift Left and Add)

I'm putting this chapter here for reference. I discovered I was heading too deep, too early, but I want to keep this chapter handy. I haven't checked the code or the explanations carefully, be careful if you read or try to use what I have here.

Multiplying by Small Constants
(Shift Left and Add)

(Title Page/Index)

 

We've worked out multiplying by some constant powers of 2, hopefully enough to have some confidence in using bit shifts for multiplying.  

(Note that we are not using Trachtenberg methods.

So, I said that multiplying by small constants that are not powers of ten would still be easy. Considering I said that multiplying by powers of two would be easy, you may be doubting me.

Fair enough.

But let's look at multiplying by constant 3. 

We can look at it several different ways:

3X == X + X + X

Multiplying something by three is adding it to itself three times. That's what the algebra says, and when I say it in somewhat ordinary English, it makes sense. Sort of. But if we are very careful about how we say this, we have to say adding the number to zero three times. 

Adding it to itself three times could actually mean something else, something we just talked about in the last chapter, and are still talking about here.

But multiplying something by two -- doubling -- is adding it to zero twice, or adding it to itself (once). Anyway,

3X == 2X + X

Multiplying something by three is multiplying by two and then adding it again -- or left-shifting in binary once and then adding it to the product again.

For a byte on the 6809, catching carries:

	CLRA	; for carries
	LDB	,U	; get the byte
	LSLB		; 2X
	ROLA		; grab any high bit
	ADDB	,U	; 2X+X==3X
	ADCA	#0	; get any carry

Let's do that to a 16-bit integer, as a subroutine. And capture the carries.

MUL3	LDD	,U
	CLR	,-U	; for carries
	LSLB		; 2X
	ROLA
	ROL	,U	; get carries
	ADDD	1,U	; + X makes 3X
	BCC	MUL3NC
	INC	,U	; get carries
MUL3NC	STD	1,U
	RTS

If we weren't capturing carries and setting it up as a subroutine, it would be really short. 

Speaking of capturing carries, you will note that this subroutine accepts a 16-bit integer as input, but returns a 24-bit integer with the carries in the added most significant byte. 

Let's not tell anyone, but , rather than using shifts, we might have simply loaded and then added twice, also:

MUL3ADD	LDD	,U
	CLR	,U	; for carries
	LDD	1,U		; 1X
	ADDD	1,U	; 2X
	ROL	,U	; get first carry
	ADDD	1,U	; 3X
	BCC	MUL3NC
	INC	,U	; get 2nd carry
MUL3ANC	STD	1,U
	RTS

The first load is the same as adding it to zero, so that's really just multiplying by 3 the hard way, adding it up three times. 

Adding a number to itself is another way to shift it left by one bit, by the way.

How about another way that works on the 6809 and 6801? Both have the MUL A × B instruction that leaves the product in D:

MUL3MUL	LDD	,U
	PSHU	A	; save high byte
	LDA	#3
	MUL		; multiply low byte in B
	STD	1,U	; save 16-bit result
	LDB	,U	; get high byte back
	CLR	,U	; for result addition
	LDA	#3
	MUL		; multiply high byte in B
	ADD	,U
	STD	,U
	RTS

MUL takes 10 cycles on the 6801 and 11 on the 6809, so it would take a little longer than the shift method, but not much, and with a similar byte count.

But this routine using MUL could be generalized into an 8-bit by 16-bit routine that could be fairly quick for both small variables and small constants:

* 16 by 8 to 24-bit multiply
* 16-bit multiplicand in 2nd and 3rd bytes,
* 8-bit multiplier in 8 bits on top:
MUL16X8	LDB	2,U	; low byte
	LDA	,U	; multiplier
	MUL
	STB	2,U	; store result low byte
	PSHU	A	; save result middle byte temp
	LDD	1,U	; multiplier in A, high byte in B
	MUL
	ADDB	,U+	; add result middle byte, pop it
	ADCA	#0	; add the carry into the high byte
	STD	,U	; save middle and high bytes of result
	RTS
* 
MUL3CAL	LDB	#3
MUL3ROB	PSHU	B
	BRA	MUL16X8	; rob the RTS
MUL5CAL	LDB	#5
	BRA	MUL3ROB	; rob the PSHU
MUL10CAL	LDB	#10
	BRA	MUL3ROB	; rob the PSHU

Cool enough?

How about multiplying by 5 via shift-and-add, just to compare? Five is four plus one:

MUL5	LDD	,U
	CLR	,-U	; for carries
	LSLB		; 2X
	ROLA
	ROL	,U	; carries
	LSLB		; 2X again makes 4X
	ROLA
	ROL	,U	; carries
	ADDD	,U	; + X makes 5X
	BCC	MUL5NC
	INC	,U	; get carries
MUL5NC	STD	1,U
	RTS

Similar length to the MUL method, but just a little faster.

How about 10? 

Ten is five times 2.

Shifting left and then calling MUL5 would lose us a carry. But we can call MUL5 and then shift once more left. Or use the shift-and-add MUL5 code and hang another shift on the end:

* Multiply a 16-bit integer by 10,
* return as a 24 bit integer, to keep the carries.
* For the 6809
MUL10	LDD	,U
	CLR	,-U	; push a zero for overflow
	LSLB		; 2X
	ROLA
	ROL	,U	; catch the overflow
	LSLB		; 2X again makes 4X
	ROLA
	ROL	,U	; catch the overflow
	ADDD	1,U	; + X makes 5X
	BCC	MUL10NC
	INC	,U	; carry in the addition
MUL10NC		LSLB		; 2X again makes 10X
	ROLA
	ROL	,U	; catch the overflow
	STD	1,U
	RTS

But if we have the MUL16X8 routine, it's probably going to be about as fast to call that.

That's the 6809. How about actual code on the other processors?

The conversion is straightforward. I recommend trying it. 

For the 6801, you will be loading X from PSP and updating PSP appropriately. Also, you will be able to replace LSLB ; ROLA pairs with LSLD.

For the 6800, in addition to loading and updating PSP, you will need to split the double accumulator operators into individually working on A and B.

As a rough guess, given how fast the MUL is on 6801 and 6809, it would be reasonable to just use MUL and ADD on both. Shift-and-add will be useful on the 6800, however.

For the 68000, don't try to be too literal. Give 32-bit results instead of 24.

Oh, okay, let's look at code for the 68000:

MUL3_16:	; guessing it'll be about 48 processor cycles
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	LSL.L	#1,D7
	ADD.L	(A6)
	MOVE.L	D7,(A6)	; return in 32 bits
	RTS

MUL3_16_ADD:
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	ADD.L	(A6)		; Or maybe do it in-register?
	ADD.L	(A6)
	MOVE.L	D7,(A6)
	RTS

* No 8-bit MUL in 68000
* MULU takes 38 + 2 per bit processor clocks.
MUL3_16_MUL:	; MULU by #3 takes about 46 processor cycles, about 12 memory clocks.
	MOVE.W	(A6)+,D7	; pop it
	MULU.W	#3,D7	; ignore source high, 32-bit result in D7
	MOVE.L	D7,-(A6)
	RTS

* 16 by 16 to 32 bit multiply
* 16-bit multiplicand in 3rd and 4th bytes,
* 16-bit multiplier in 16 bits on top:
MUL16X16:
	MOVE.W	2(A6),D7
	MULU.W	(A6),D7	; ignore source high, 32-bit result in D7
	MOVE.L	D7,(A6)
	RTS
* 
MUL3CAL_16:
	MOVE.W	#3,-(A6)
	BRA.S	MUL16X16	; rob the code
MUL5CAL_16:
	MOVE.W	#5,-(A6)
	BRA.S	MUL16X16	; rob the code
MUL10CAL_16:
	MOVE.W	#10,-(A6)
	BRA.S	MUL16X16	; rob the code

MUL5_16:
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	LSL.L	#2,D7	; 4X
	ADD.L	(A6)	; 4X+X
	MOVE.L	D7,(A6)	; return in 32 bits
	RTS

MUL10_16:
	CLR.W	-(A6)	; for carries
	MOVE.L	(A6),D7
	LSL.L	#2,D7	; 4X
	ADD.L	(A6)	; 4X+X
	LSL.L	#1,D7	; 2(4X+X)
	MOVE.L	D7,(A6)	; return in 32 bits
	RTS

And 16 bit integers ought to be big enough for anyone, right? I mean, who would ever do anything with an integer bigger than 30,000?

(Which is part of where the old "Who needs more than 64K of memory?" excuse for corner-cutting in the early-to-mid 1980s  came from.)

However, we do often want to do this with 32-bit operands, and may want to catch the carries, from that, even. So let's look at giving 32-bit results with carries in 40, I mean, 48 bits. (We shouldn't want to do math in odd numbers of bytes on the 68000.):

Draw your own conclusions about optimization.

 

Recap

 

And now we can do part of what is necessary for decimal input and output. That was quick. 

 

 

 

 

 

Dividing from the top.

 

 

multiply byte by ten, remainder from divide by 100?


***********

 


 


(Title Page/Index)

 

 

 

 

No comments:

Post a Comment