!pr3
Speed vs. Space.............................Bob Sander-Cederlof

There are always tradeoffs.  If you have plenty of memory, you can write faster code.  If you have plenty of time, you can write smaller code.  In an "academic" situation you may have plenty of both, so you can write "creative" code, stretching the frontiers of knowledge.  In a "real" world it seems there is never enough time or memory, so jobs have to be finished on a very short schedule, fit in a tiny ROM or RAM, and run like greased lightning.

A case in point is last month's segment of the DP18 series:  the SHIFT.MAC.RIGHT.ONE subroutine on page 8 takes about 1827 clock cycles, and fits in 25 bytes.  Upon reflection, I see a way to write a 34-byte version that takes only 1029 cycles.  If I can use nine more bytes, I can shave about 800 microseconds off each and every multiply.  (Maybe a total of a whole minute per day!)  That might be important, or it might not; but seeing the two techniques side-by-side is probably valuable.

!lm+5
1970 SHIFT.MAC.RIGHT.ONE
1980      LDY #4     4 BITS RIGHT
1990 .0   LDX #1     20 BYTES
2000      LSR MAC
2010 .1   ROR MAC,X
2020      INX        NEXT BYTE
2030      PHP
2040      CPX #20
2050      BCS .2     NO MORE BYTES
2060      PLP
2070      JMP .1
2080 .2   PLP
2090      DEY        NEXT BIT
2100      BNE .0
2110      RTS

1970 SHIFT.MAC.RIGHT.ONE
1980      LDX #0     FOR X=0 TO 19
1990      TXA        NEW 1ST NYBBLE = 0
2000 .1   STA TEMP   SAVE FOR HI NYBBLE
2010      LDA MAC,X  MOVE LOW NYBBLE
2020      ASL            TO HI SIDE
2030      ASL
2040      ASL
2050      ASL
2060      PHA        SAVE ON STACK
2070      LDA MAC,X  MOVE HI NYBBLE
2080      LSR            TO LOW SIDE
2090      LSR
2100      LSR
2110      LSR
2120      ORA TEMP   MERGE WITH NEW
2130      STA MAC,X      HI NYBBLE
2140      PLA        HI NYBBLE OF NEXT BYTE
2150      INX        NEXT X
2160      CPX #20
2170      BCC .1
2180      RTS
!lm-5
The smaller method uses two nested loops.  The inner loop shifts all 20 bytes of MAC right one bit.  The outer loop does the inner loop four times.  If I counted cycles correctly, the time is 4*(19*23+18)+7.  The faster method uses one loop to scan through the twenty bytes one time.  The timing works out as 20*51+9.

Upon still further reflection, it dawned on me that a 38 byte version could run in 840 cycles!  This version processes the bytes from right to left instead of left to right; eliminates the PHA-PLA and STA-ORA TEMP of the second version above; and loops only 19 times rather than 20.  The timing is 19*43+23.

!lm+5
1970 SHIFT.MAC.RIGHT.ONE
1980      LDX #19      FOR X = 19 TO 1 STEP -1
1990 .1   LDA MAC,X    SHIFT HI- TO LO-
2000      LSR
2010      LSR
2020      LSR
2030      LSR
2040      STA MAC,X    SAVE IN FORM 0X
2050      LDA MAC-1,X  GET LO- OF HIGHER BYTE
2060      ASL
2070      ASL
2080      ASL
2090      ASL
2100      ORA MAC,X    MERGE THE NYBBLES
2110      STA MAC,X
2120      DEX          NEXT X
2130      BNE .1       ...UNTIL 0
2140      LDA MAC      PROCESS HIGHEST BYTE
2150      LSR          INTRODUCE LEADING ZERO
2160      LSR
2170      LSR
2180      LSR
2190      STA MAC
2200      RTS
!lm-5

Of course an even faster approach is to emulate the loops I wrote for shifting 10-bytes left or right 4-bits.  The program would look like this:

!lm+5
1970 SHIFT.MAC.RIGHT.ONE
1980      LDY #4
1990 .1   LSR MAC
2000      LSR MAC+1
           .
           .
           .
2180      LSR MAC+19
2190      DEY
2200      BNE .1
2210      RTS
!lm-5

This version takes 2+3*20+4 = 66 bytes.  Yet the timing is only (4*6+5)*20+7 = 587 clock cycles.  And by writing out the four loops all the way, we use 4*3*20 = 240 bytes; the time would be 4*6*20 or 480 cycles.
How about another example?  The MULTIPLY.ARG.BY.N subroutine on the same page last month was nice and short, but very slow.  The subroutine is called once for each non-zero digit in the multiplier, or up to 20 times.  What it does is add the multiplicand to MAC the number of times corresponding to the current multplier digit.  If we assume the distribution of digits is random, with equal probablility for any digit 1...9 in any position, the average number of adds will be 5.  Actually there will be zero digits too, so the average will be 4.5 instead of 5, with the subroutine not even being called for zero digits.

For 20 digits, 4.5 addition loops per digit, that is an average of 90 addition loops.  And a maximum, when all digits are 9, of 180 addition loops.

Now, if there is enough RAM around, we can pre-calculate all partial products from 1 to 9 of the multiplicand and save them in a buffer area.  Each partial product will take 11 bytes.  We already have the first one in ARG, so for 2...9 we will need 8*11 or 88 bytes of storage.  It will take 8 addition loops to form these partial products.  Once they are all stored, the MULTIPLY.ARG.BY.N subroutine will always do exactly one addition loop no matter what the non-zero digit is. Therefore the maximum number of addition loops is 8+20 or 28, compared to 180!  And the average (assuming there will be 2 zero digits out of 20 on the average) will be 26 addition loops.

The inner loop in MULTIPLY.ARG.BY.N, called "addition loop" above, takes 20 cycles.  If we implement this new method, we will have shortened the average case from 1800 to 520 cycles, and the maximum from 3600 to 560 cycles.  Of course the whole DMULT routine includes more time-consuming code, but this subroutine was the biggest factor.  Taking the SHIFT.MAC.RIGHT.ONE improvements also, we have shortened the overall time in the average case by 2078 cycles, or 2 milliseconds per multiply.  In the maximum case, the savings is nearly 4 milliseconds.

Of course, it takes more code space as well as the 88-byte partial product buffer for the new method.  And it will take more time to write such a program.  You have to make tradeoffs.
1
