!pr1
Fastest 6502 Multiplication Yet................Charles Putney
                                    Shankill, Dublin, Ireland

Here is an 8x8 multiply routine that will blow your socks off!  The maximum time, including both a calling JSR and a returning RTS, is only 66 cycles!  The minimum is 60 cycles, and most factors will multiply in 63 cycles.  Recall that the fastest time in Bob S-C's January 1986 AAL article for a 6502 was 132 cycles.  My new one is twice as fast!

As with most fast routines, there is a trade off in memory space.  My program uses 1024 bytes of lookup tables.  This isn't so bad if you really need or want a 2:1 speed advantage.

My routine is based on the fact that:

       4 * X * Y = (X+Y)^2 - (X-Y)^2

I got this idea from an article in EDN Magazine by Arch D. Robison (October 13, 1983, pages 263-4).  His routine used the fact that:

       2 * X * Y = X^2 + Y^2 - (X-Y)^2

Robison's method requires three dips into the lookup tables.  Formulated to the same method for passing parameters, his method takes either 74 or 77 cycles.  Here is my rendition of his method:





   <<<<listing of Robison's program>>>>
!np
The entries in the two tables (SQL and SQH) are the squares of the numbers from 0 to 255, divided by two.  The low bytes are in the SQL table, and the high bytes are in SQH.  Dividing by two throws away an important bit for odd factors, but lines 1160-1170 compensate for the loss.

I looked for a way to add fewer table entries together and came upon the sum^2 - diff^2.  Since the sum can be as large as 255+255=510, I need twice as much table space.  Lest you despair of typing in such a large table, let me offer an Applesoft program which will write a text file of the source code for the table:







       <<<listing of Applesoft source creator>>>>







My tables contain the squares divided by four.  I can hear you saying, "Wait a minute!  You can't just divide by four and truncate!"  Well, even squares are all multiples of four; odd squares are all multiples of four with a remainder = 1.  The sum of two numbers and the difference of the same numbers are either both even or both odd.  Therefore, we never lose anything by throwing away our truncated 1.

The number of cycles my MULT8 takes depends on the values of the two factors.  You call MULT8 with one factor in the A-register and the other in the X-register.  If (A) is less than (X), it takes an extra 3 cycles to perform a complement operation.  If the sum of the factors is greater than 255, add another three cycles.  To summarize,

                 A>=X  |  A<X
       -----------------------
       sum<256  |  60  |  63
       sum>255  |  63  |  66
       -----------------------

Just for fun, I also wrote a program to generate the square/4 tables.  This takes less time than loading the tables from disk, so it could mean faster booting for some hi-resolution game program that needs super-fast multiplications.  It is in lines 1560-2100 below.

The origin I used in my program is meant just to allow me to test it.  I wrote an Applesoft program to call TEST at $6000 (CALL 24576).  The program POKEd two factors at $FA and $FB, called TEST, and then checked the result at the same two locations.  If you want to use MULT8, you should just assemble it along with the rest of your program, without any special origin.  You should make sure that the tables start on an even page boundary, or it will cost you up to 8 cycles extra for indexing across a page boundary.
