Parallel Sieve

2025-04-19 07:45:00

This is part 5 of a series based on thread design patterns. In this post we will implement a parallel version of the Sieve of Eratosthenes to generate 20k primes on the Super 8008.

Updates to the Super-8008 Mame Emulator

Memory Access

Mame provides some extensive support for memory mappers and memory banking that I was not taking advantage of in the first implementation. Before I was setting up function pointers within the memory range and handling the different address and bus flags within those functions.

For example on each blaster, the old code would do something like this:


map(0x0000, 0x3fff).rw(
		FUNC(super8008_blaster_device::blaster_memory_read), 
		FUNC(super8008_blaster_device::blaster_memory_write));

...

uint8_t super8008_blaster_device::blaster_memory_read(offs_t offset)
{
	int jumper = m_ext_take->read();
	if (BIT(take_state, jumper) && 0 <= offset && offset < m_ram->size()){
		uint8_t data = m_ram->pointer()[offset];
		return data;
	} else {
		//Selected when the ext_take line is low
		return 0x0;
	}
}

This type of code can be switched to a memory view:


map(0x0000, 0x3fff).view(m_view);
	m_view[0](0x0000, 0x3fff).ram().share("ram");
	m_view[1](0x0000, 0x3fff).noprw();

...

void super8008_blaster_device::ext_take(int state)
 
{
	int jumper = m_ext_take->read();
	m_take_state = state;
	if (!BIT(m_take_state, jumper))
	{
		m_view.select(1);
	}
	else
	{
		m_view.select(0);
	}
}

Now when the bus signals change, we just switch to the correct view and we can remove the custom read/write functions.

On the master card a 74670 4x4 register file is used for the memory mapper. I implemented this as four distinct memory views.


void super8008_state::select_memory_view(uint8_t index)
{
	memory_map_info mmap_info = get_mmap_info(index);
	uint8_t bank_index = ((mmap_info.ra13 << 1) | mmap_info.ra12);
	
    if (!mmap_info.rom_cs)
	{
		m_reg_views[mmap_info.mmap_index]->select(0);
		m_reg_rom_banks[mmap_info.mmap_index]->set_entry(bank_index);
	}
	else if (!mmap_info.ext_cs)
	{
		m_reg_views[mmap_info.mmap_index]->select(1);
	}
	else
	{
		m_reg_views[mmap_info.mmap_index]->select(2);
		m_reg_ram_banks[mmap_info.mmap_index]->set_entry(bank_index);
	}
}

If you are curious, this commit has the bulk of the changes: Changed memory access to use MAME's memory system

Detached Mode

Most of the time when I'm programming the Super-8008 I use the bitbanged serial. This allows me to use SyncTerm to upload code and test it. Using SyncTerm is a lot faster than updating the Mame code to load different roms. That also means that I can use the routines in the system rom. The previous post about the super-8008 has instructions for setting up the serial connection to Mame.

Here is a demo of using sync term to upload a hellorld program. Notice in the first memory dump it is showing the assembly code:


        include "bitfuncs.inc"
        cpu 8008new

        org 1000H

    puts: equ 26D6H
    prompt: equ 2095H

    textout:   mvi h,hi(text)
               mvi l,lo(text) 
               call puts
               call prompt

        org 1100H
    text:
        db "\r\n"
        db "Hellorld\r\n",0

Both puts and prompt are rom routines from the monitor. A full list of the rom routines can be found in the Symbol Table at the bottom of the monitor listing created by the asl assembler.

In this demo I am using the front panel in detached mode to play a game similar to Kill the Bit. The objective is to take one of the blasters and then try to restart it to line up its LED with its neighbors:

Parallel Sieve of Eratosthenes

If you are reading this I'm pretty sure you are familiar with the algorithm to find primes. You start with a list of all numbers between 0 and N. Then starting at number two cross off 2*2, 2*3, 2*4.... Then move to the next number, 3 and do the same. Anything not crossed off when you reach sqrt(N) is prime.

Similar to what we did for the ray-tracer, we can partition the grid and assign workers to each section. The worker just needs to calculate it's offset into the table and then loop. Inside the loop it increments the offset by the number being checked and crosses off numbers in its partition.

Here's a animation of how that works. The animation gets faster for each number being checked. Each number is assigned a different color from the color wheel divided by sqrt(n).

Calculating offsets

Before starting on the assembly code, I created a simple multithreaded version in python. One thing to notice in this code is the offset calculation:


def get_offset(i, x, y):
    if x <= i <= y:
        off = i - x
        return off
    elif i < x:
        if (x % i) == 0:
            return 0
        num = ((x // i) * i) + i
        if x <= num <= y:
            off = num - x
            return off
        else:
            return -1
    else:
        return -1

This calculates the first location of a multiple of the number that is being checked inside of a partition. For example if we are checking and crossing off for 13 in a partition of 1000 to 2000. It first checks to see if 13 is in range 1000 <= 13 <= 2000. Since it is not, it then checks to see if 1000 is divisible by 13. If it was, we would start at the zero offset in this partition. Since it is not we divide 1000 by 13, take the floor and then multiply by 13: 76 * 13 = 988. This will give us a number less than the start of the partition. If we add 13 to that number we should be in range: 988 + 13 = 1001. We subtract off the start of the partition and get the offset of 1.

The intel 8008 was introduced in 1972 based off of the Datapoint 3300. Below is a Datapoint 2200 from the Datapoint museum that used to be in San Antonio Texas.

Datapoint Datapoint Insides

The 8008 is limited even in terms of 8-bit architectures. It only has 8-bit adds and subtracts, the carry flag is weird, it doesn't have a general purpose stack and the call stack is fixed at 16 entries.

The first order of business was to write 16-bit math routines for addition, subtraction, multiplication and division. They are not the most optimized versions. They are using binary multiplication and binary long division. If you would like more details on these routines let me know.

Decimal Output

Besides the math needed for the offset calculation, I also wanted base-10 output instead of hex. I ported this 6502 code I found on the nesdev wiki. It is using this neat 8,4,2,1 trick to speed things up a bit.

This is what it looks like in python:


def decout_8421(x):
    s = ""
    digits = "0123456789"
    inc = [8,4,2,1]
    stab = sub_table()
    i = 0
    while i < len(stab):
        digit = 0
        for j in inc:
            if x >= stab[i]:
                x = x - stab[i]
                digit += j
            i+=1
        s += digits[digit]        
    s += digits[x]
    return s

The precomputed tables for this can be found in decout_sub in the assembly code. I used this code to create a standalone 16-bit hexadecimal to decimal converter.

Code and Demo

Printing 10k primes on Master only: prime-master.asm
Printing 20k primes using all Blasters: primes.asm

One thing I noticed when I started timing things was that the non-parallel version was faster. I was using a smaller list of numbers for testing. The main benefit from running the algorithm in parallel is to speedup crossing off numbers and that only happens up to the sqrt(n).

Besides the extra processors, using the blasters provides a lot more memory since each blaster has 4 banks of ram for a max prime table size of 3 * 4096 = 12k. I'm only multiplying by three here since one bank is used for the code. To run for a larger number when not using the blasters I needed to add bank switching to prime-master.asm.

Another change I made to make the two implementations closer is to add a menu item that will precompute the offset table before starting the Sieve algorithm. You can see this as the Initialize blasters step in the demo.

Conclusion

This was a fun project I got to learn a good bit about the 8008. Also, I wanted a way to dogfood the changes I had made to the memory system and this allowed me to do that. It has been a long time since I wrote thousands of lines of assembly code so it was probably good for the gray matter too.