!pr2
Prime Benchmark for 65802..................Bob Sander-Cederlof

Jim Gilbreath really started something.  He is the one who popularized the use of the Sieve of Eratosthenes as a benchmark program for microcomputers and their various languages.  You can read about it in BYTE September 1981, "A High-Level Language Benchmark"; and later in BYTE January 1983, "Eratosthenes Revisited".

In a nutshell, the benchmark creates an array of 8192 bytes, representing the odd numbers from 1 to 16383.  The prime numbers in this array are flagged by the program using the Eratosthenes algorithm.  All of the times published in the BYTE articles are for ten repetitions of the algorithm.

The second article lists page after page of timing results for various computers and languages.  They range from .0078 seconds for an assembly language version running in an IBM 3033, to 5740 seconds for a Cobol version in a Xerox 820.

There are many factors which affect the results, not just the basic speed of the computer involved.  The language used is obviously significant, as some languages are more efficient than others for particular purposes.  Slight variations in the implementation of the Eratosthenes algorithm can be very significant.  The skill and persistence of the programmer are also very important.

Gilbreath's times for the Apple II vary from 2806 seconds for an Applesoft version to 160 seconds for a Pascal version.  The same table shows an OSI Superboard, using a 6502 like the Apple, ran an assembly language version in 13.9 seconds.  (I don't know what the clock rate of the OSI board was.)

We have published a series of articles in AAL on the same subject.  "Sifting Primes Faster and Faster", in October 1981, gave programs in Apple assembly language by William Robert Savoie and myself.  At the time I had overlooked the fact that BYTE's times were for ten trips through the program, so I was perhaps a little overly enthusiastic.  The table below shows the adjusted times for ten repetitions.

!lm+5
       Version                 Time in seconds
       
     My Integer BASIC version      1880
     Mike Laumers Int BASIC        2380
     Mike's compiled by FLASH!      200
     Bill Savoie's 6502 assembly   13.9
     My first re-write of Bill's    9.3
     My 6502 version                7.4
     My 6502 with faster clear      6.9
!lm-5

I challenged you readers to do it faster, and some of you did.  Charles Putney ("Even Faster Primes", Feb 1982 AAL) knocked the time for ten trips down to 3.3 seconds.  Tony Brightwell ("Faster than Charlie", Nov 1982 AAL) combined tricks from number theory with a faster array clear technique to trim the time to 1.83 seconds.
Peter McInerney sent us an implementation he did on the DTACK Grounded 68000 board, which uses a 12.5 MHz clock.  His program ("68000 Sieve Benchmark", July 1984 AAL) did 10 repetitions in .4 seconds.  (An 8 MHz time was logged in the BYTE article at .49 seconds.  Upping the clock speed does not always speed everything up proportionally, due to the need to wait for slower memory chips.)  I translated Peter's code back to 6502 code in "Updating the 6502 Prime Sifter", same issue.  My time for ten loops was 1.75 seconds.  In that article I stated, "...it remains to be seen what the 65802 could do.

David Eyes, in his new book on 65816 Assembly Language, presents a version which uses the expanded capabilities in that chip.  He evidently did not build on our base, because his time for a 4 MHz 65816 was 1.56 seconds.  I presume that means if the clock rate was the same as Apple's it would have taken 6.24 seconds.  I have been previewing David's book, from the galleys, but the listing of that program was not included in the material I received from the typesetter.

I decided to try updating my 1984 version to 65802 code, using whatever tricks I could come up with.  The result runs ten times in 1.4 seconds in the 65802 plugged into my Apple II Plus.  I suppose that means a 4 MHz version would run in .35 seconds, or faster than a 12.5 MHz 68000!

Lines 1100-1210 are an outer shell to drive the PRIME program.  The shell begins and ends by ringing the Apple bell, to help me run my stopwatch.  I ran the PRIME program 1000 times, and then divided the time by 100 to get the seconds for ten repetitions.  In between ringing the bells everything is done in 65802 mode.  Lines 1110-1120 turn on "native" mode, and lines 1190-1200 restore "emulation" mode.

When you switch on native mode the M and X bits always come up as 1's.  That is, both are set to 8-bit mode.  The M-bit controls the size of operations on the A-register, and the X-bit controls the size for the X- and Y-registers.  Line 1130 turns on 16-bit mode for the A-register.  I use this setting throughout the rest of the program, until we go back to emulation mode.  All operations which affect the A-register will be 16-bits, while I will only use X and Y with 8-bit values.

Lines 1140-1180 call PRIME 1000 times.  Since I have Mbit=0, line 1140 uses the 16-bit LDA immediate.  STA COUNT stores both bytes:  the low byte at COUNT and and the high byte at COUNT+1.  DEC COUNT decrements the full 16-bit value, returning a .NE. status until both bytes are zero.  This is certainly a lot easier than a two-byte decrement in 6502 code:

       LDA COUNT
       BNE .1
       DEC COUNT+1
  .1   DEC COUNT
       BNE ...      ...NOT AT 0000 YET
       LDA COUNT+1
       BNE ...      ...NOT AT 0000 YET
Line 1140 may need some explanation, since there are now at least four assemblers available for the Apple which handle 65802 assembly language.  Each of the four have chosen a different way to inform the assembler about the number of bytes to assemble for immediate operands.  S-C Macro uses  a "#" to indicate and 8-bit operand, and "##" to indicate a 16-bit immediate operand.  This seems to me to be the easiest to figure out when I come back to read a program listing after several weeks of working on something else.  The "double #" is an immediate visual clue (pun intended) that the immediate operand is double size.

Since ORCA/M was a Hayden Software product, and David Eyes was product manager of ORCA/M at Hayden as well as an early contributor to 65816 design, ORCA/M turned out to be the first assembler to include 65816 support.  Mike Westerfield had a version running before the rest of us even knew the 65816 was going to exist.  Consequently, Mike's and David's choices for assembly syntax and rules has achieved the honor of being used in the 65816 data sheet and in David's book.

Mike and David decided to inform the assembler what size immediate operands to use with two assembler directives.  LONGA controls the size of immediate operands on LDA, CMP, ADC, ORA, EOR, AND, BIT, and SBC:  LONGA ON makes them 16-bits, LONGA OFF makes them 8-bits.  Likewise, LONGI ON or OFF controls the immediate operands on LDX, LDY, CPX, and CPY.  You have to sprinkle your code with these so that the assembler always knows which size to use.  Since the directives may not be close to the affected lines of code, it can be a chore to read unfamiliar source code.

Merlin Pro uses a single directive to inform the assembler as to the settings of M and X which will be in effect at execution time.  The directive is called "MX", and can have an operand of 0, 1, 2, or 3 (or a symbol whose value is 0-3).  The bits of the value correspond to the M- and X-bit settings:

       MX 0    M=0, X=0  (both 16-bits)
       MX 1    M=0, X=1  (A/16, XY/8)
       MX 2    M=1, X=0  (A/8, XY/16)
       MX 3    M=1, X=1  (A/16, XY/16)

I understand that the latest version of Lazerware's Lisa Assembler supports the 65816, but I don't have a copy.  I do not know how Randy Hyde indicates immediate operand size.

By the way, in all of the assemblers it is entirely up to the programmer to be sure that you keep all the immediate sizes correct.  There is no way for an assembler to second-guess you on this.  If you tell it to make a 16-bit operand, and then execute that instruction in 8-bit mode, the third byte will be treated as the next opcode.  Vice versa is just as bad.  I have blown it many times already, with the result that I am a lot more careful now.

Now let's look at the PRIME subroutine itself.  The first section clears an array of 8192 bytes, storing $00 in each byte.  There are a lot of ways to store zeroes.  The most obvious is with a loop of STA addr,X lines, such as we used in previous versions.  The 65802 has a STZ instruction, which stores zero without using the A-register, but it is not faster.  We could store a zero at the beginning of the array and then use an overlapping MVN instruction to copy that zero through the whole array:

       LDX ##BASE
       LDY ##BASE+1
       LDA ##8190
       MVN 0,0

That would be simple, but it would take over 56000 cycles.  We can do a lot better than that.

My version uses the PHD instruction 4096 times to push 8192 zeroes on the stack.  I start by setting the stack register to point at the last byte of my array (BASE+8191).  Each PHD pushes the direct page register (which is currently set to $0000) on the stack.  My loop includes 16 PHD's, so 256 times around will fill the array (or empty it, if you like).  All this action is in lines 1320-1380.  To save space in the source code, rather than write 16 lines of PHD's, I wrote them out as hex strings in lines 1350-1360.

Lines 1310, 1390-1410 save and restore the original stack pointer.  (At first I didn't do this, with disastrous results!  The stack pointer was sitting just below the cleared array.  When I did an RTS, the next opcode encountered was $00, which is a BRK.  Since I was in native mode, the BRK vectored through $FFE6,7 instead of $FFFE,F.  Et cetera.)  Note that the TSX only saves the low byte of S, because X is in 8-bit mode.  I am assuming that the high byte was $01, since I came from normal Apple 6502 code.  Lines 1390-1400 put $01 in front of the low byte, and the TCS puts both bytes back in the S-register.

Lines 1430-1440 push the address of the fifth byte in the array onto the stack.  Since the 65802 has a stack-relative addres- sing mode, we can access the pointer with an address of "1,S".  Remember the bytes in the array represent the odd numbers.  The fifth byte represents the number 9, which is the square of the first odd prime (3).  (At a very slight penalty in speed, we can change line 1430 to "LDA ##BASE" and delete line 1460.)

Lines 1480-1520 update the pointer we are keeping on the stack to point to the next square.  For an explanation of how this works, go to the July 1984 and Nov 1982 articles.  Lines 1530-1540 skip the sifting process for numbers that have already been flagged as non-prime.

Lines 1550-1580 compute the prime number itself from the index (2*index+1) and store it into the operand bytes of the "ADC ##" instruction at line 1630.  Ouch! Self-modifying code!  But that is often the price of speed.

Line 1590 picks up the pointer to the square of the prime, which is the first number that must be flagged as non-prime, from our holding location on the stack.  Lines 1610-1640 get tricky.  Line 1610 puts the current pointer in the D-register, which tells where in RAM the direct page starts.  This means that the "STX 0" in line 1620 stores into the byte pointed to.  X was holding the current index, so we are storing a non-zero number into that byte, which flags it as being non-prime.

As a pleasant side effect, the non-zero numbers being stored in the array have meaning.  If we double the value we stored and add one, we will get the value of the prime factor of the non-prime number.  After the whole PRIME program has executed, the flag value will produce the largest prime factor.

In the loop of lines 1610-1640, we keep adding the prime number to the pointer value in the A-register, and transferring the result to the D-register.  Hence the STX 0 will store X at multiples of the prime number.  The loop terminates when the pointer value in the A-register goes negative.  Why?  Because we carefully positioned the array from $6000 to $7FFF.  The first time we add the prime to the pointer and get an address $8000 or higher, we know we went off the end of the array.  Addresses of $8000 or higher will set the negative status flag, so our loop terminates.

Lines 1660-1680 bump the prime index by one, and test for hav- ing reached the largest prime of interest.  If not, we go back to sift out the next one.  If we are finished, lines 1690-1700 restore the D-register to point to true page zero.  Line 1710 pops the pointer off the stack, and that's all there is to it!



<<<<listing>>>>

Here is an Applesoft program which will look through the array PRIME produces.  Every zero byte in the array indicates a prime number.  The value of the prime number at ARRAY+I is I*2+1, since the array only represents odd numbers.  This program prints out the value 1 first, which really is not considered a prime number, but it does make the table easier to read.

The program is designed to display 10 8-character fields on a line, which works well on the Apple 80-column screen.  I left out the code to print a RETURN after 10 numbers, because the Apple screen automatically goes to the next line.

Line 120 prints out the primes.  Delete line 125 if all you want to see is primes.  Line 125 prints the largest prime fac- tor of nonprimes, followed by "*" and the other factor (which may not be prime).  For example, 16383 is printed as 127*129.

100  HIMEM: 24576
110  FOR A = 24576 TO 32767
120  IF  PEEK (A) = 0 THEN
      PRINT RIGHT$("       " + STR$((A - 24576)*2+1),8);
125  IF  PEEK (A) <> 0 THEN
      F1 =  PEEK (A) * 2 + 1
      : F2 = ((A - 24576) * 2 + 1) / F1
      : PRINT RIGHT$("      "+STR$(F1)+"*"+STR$(F2),8);
140  NEXT 
