Sunday, February 24, 2019

Taking the Sieve Too Far

A post on a Fortran program for counting primes in the Vintage {computers | microprocessors | microcontrollers } FB group inspired me to dig an old post on small primes back up and refurbish it a bit.

(If you don't know how this works, refer to the post on small primes.)

/* Archetypical implementation of the sieve of eratosthenes in C
** By Joel Rees, Amagasaki, Japan, 2015, 2018
** All rights reserved.
** Permission granted by the author to use this code
** for any purpose,
** on condition that substantial use
** shall retain this copyright and permission notice.

#include <stdio.h>
#include <stdlib.h>

#define MAXSIEVE 100000000L

#define FINALPASS ( ( MAXSIEVE + 1 ) / 2 )

char * sieve;

int main( int argc, char * argv[] )
    unsigned long index, prime;
    unsigned long count = 0;
    unsigned long maxsieve, finalpass;
    char * unscanned;

    if ( argc > 1 )
       maxsieve = strtoul( argv[ 1 ], &unscanned, 0 );
       if ( ! ( unscanned > argv[ 1 ] ) || ( maxsieve > MAXSIEVE ) )
       {  maxsieve = MAXSIEVE;
    finalpass = ( maxsieve + 1 ) / 2;
    sieve = malloc( maxsieve );
    if ( sieve == NULL )
       printf( "Can't allocate %lu byte array, quitting.\n", maxsieve );
       return EXIT_FAILURE;

    sieve[ 0 ] = 0;
    sieve[ 1 ] = 0;
    for ( index = 2; index < maxsieve; index = index + 1 )
    {   sieve[ index ] = 1;

    for ( prime = 2; prime < finalpass; prime = prime + 1)
        if ( sieve[ prime ] == 1 )
        {   for ( index = prime + prime; index < maxsieve; index = index + prime )
            {   sieve[ index ] = 0;

    for ( index = 2; index < maxsieve; index = index + 1 )
        if ( sieve[ index ]  != 0 )
    printf( "%lu primes less than %lu\n", count, maxsieve );
    return EXIT_SUCCESS;

This version only counts primes, and it alows you to specify the largest number.

Execution time on this old 1.2 GHz single processor:

$ time ./primecount 100000000
5761455 primes less than 100000000

real    0m14.008s
user    0m13.733s
sys    0m0.196s

It uses the same sieve-in-array approach as the small primes example, so you probably should not run it at a hundred million if you don't have at least a gigabyte of RAM.

I hope there's no bug and that's the correct answer.

A note:

I had the array statically allocated at one point, and the compile actually took several seconds to write that big empty array. But the execute time was not quite a second shorter. The extra execution time for the dynamic array may have simply been the system thrashing after the previous version (because this box only has one gigabyte of RAM and a 100 Meg of a Gig is enough to push things around in virtual memory).

And here's how it runs under Cygwin on a much more recent 2.5 GHz dual core Intel CPU:
$ time ./primecount 100000000
5761455 primes less than 100000000

real    0m11.349s
user    0m11.107s
sys     0m0.062s

It's only using one core, but why doesn't the double clock speed halve the execution time? I dunno. You tell me.

I could use the second core by forking the process and having the second process lag behind the first one. The algorithm becomes semi-non-obvious:

First, the first processor sets the flags in the lower half of the array, and the second process sets the flags in the upper half. Maybe bus contention would not eat up the savings in halving the load.

We'll start the parent process on a single pass at multiples of 2, and it will quickly try to do this to the first part of the sieve array:

0123456789 10111213141516171819 20212223242526272829
0011010101 0101010101 0101010101

We'll start the child process on a single pass at multiples of 3, and it will quickly try to do this to the first part of the sieve array:

0123456789 10111213141516171819 20212223242526272829
0011110110 1101101101 1011011011

Note that the effects are independent of each other.

Now, if either the parent or child finishes its first single pass before the other finishes even starts its first, the only damage is an unneeded extra pass on 4, and maybe even 6, 8, and 9. For small arrays, trying to use both processors could end up being (slightly) slower. For large arrays, we can be pretty sure that there won't be more than a few unnecessary passes.

Ideally, by the time one or the other finishes its first pass, the other will have already made its way through the first 15 of its pass, and it will have

0123456789 10111213141516171819 20212223242526272829
0011010100 0101000101 0001010001

and this is good enough to let the process avoid wasting a pass on 4 and start 5.

But the next process to finish will also see 5. (Check for yourself why probing the second or third in the pass isn't sufficient.)

So it looks like we need a shared variable for each to store its current pass on. And there will be some non-zero probability of read-interleaved write, so we should probably protect it with some semaphore or other mutual-exclusion technique.

This shared variable will help at the end, as well, when passes become very quick.

Lack of such mutual exclusion won't make it give the wrong result, it will just result in wasted passes.

Maybe I'll do the code for multiple processes later on.