December 27, 2025

Simulating A Systolic Array With SIMD (Kind Of)

A simple simulation of a systolic array using SIMD operations in Rust to speed up matrix multiplication.

#TPU #rust #systolic array #artificial intelligence

Simulating A Systolic Array With SIMD (Kind Of)

I decided to try to understand how a TPU works. Since I don’t have any hardware expertise, I decided to try to simulate a systolic array using SIMD operations. The result is not a “real” simulation, because I did not try to simulate the hardware itself, but the idea behind it and the dataflow (partially).

I wanted to try to make my CPU perform matrix multiplications faster using a strategy that vaguely resembled how a systolic array works.

What is a systolic array?

Simply put, it is a device that organizes a bunch of data processing units in a homogeneous way to perform a predefined operation. The one we will focus on is matrix multiplication, but it could do other things like correlation, convolution, or even sorting data.

A systolic array is the primary building block of a TPU (tensor processing unit), which nowadays has become much more interesting to me because of artificial intelligence. AI can ultimately be reduced to a ton of matrix multiplications, and the CPU is terrible at performing matrix multiplication.

In the case of matrix multiplication operations, each PE receives data from the “top” and from the “left,” then performs a multiplication with the two values, stores the result in an accumulator register, and passes the values forward. The values continue cascading down and to the right until the last PE (on the bottom-right edge) performs the last calculation.

Here is a diagram that demonstrates the dataflow

For comparison, a 256x256 systolic array is capable of performing up to 65,536 operations per clock tick, whereas a CPU (assuming scalar single-core execution) can do 1 operation per clock tick. This comparison is not 100% fair, but it serves to illustrate the discrepancy in parallel operation capabilities.

Right now, when this kind of operation needs to be done, a GPU is used (hence Nvidia stocks 👀), but imagine a scenario where you are playing a game and one of the enemies in the game is controlled by a more sophisticated AI model. The GPU might already be doing a lot of work to process the graphics, the CPU is occupied serving data to the GPU, running the OS, and, as we already know, CPUs are not known to be great with matrix multiplication anyway. That is the scenario where a TPU comes in handy and is one of the primary reasons I think edge computing hardware will become more relevant in the future.

What is SIMD

Modern CPUs usually contain special instructions commonly called SIMD that can perform one operation with multiple values at the same time (in the same core), and it’s easier to understand them if you try to think in “Assembly mode”. An example using pseudo-code could look like this (I don’t know Assembly):

A = [1, 2, 3, 4, 5, 6, 7, 8]
B = [1, 1, 3, 5, 8, 13, 21, 34]
C = [0; 8]

C[0] = A[0] * B[0]
C[1] = A[1] * B[1]
C[2] = A[2] * B[2]
C[3] = A[3] * B[3]
C[4] = A[4] * B[4]
C[5] = A[5] * B[5]
C[6] = A[6] * B[6]
C[7] = A[7] * B[7]

Every multiplication operation is performed one after the other.

Using SIMD, a pseudo-code that can illustrate what happens could look like this:

A = [1, 2, 3, 4, 5, 6, 7, 8]
B = [1, 1, 3, 5, 8, 13, 21, 34]
C = A * B

Unfortunately, I can’t think of a better way to represent this, but the takeaway is that in the first example, the CPU will load the values of A and B one by one into its registers, perform the multiplication, and store the value back. While in the second case (using SIMD), all the values of A are loaded at once into a “very big register,” same for B, the multiplication is applied element-wise, and the result is stored back into memory.

This starts to make a difference once we start to do matrix multiplication. For instance, multiplying a 256x256 matrix with another 256x256 matrix (ignoring possible optimizations) would require almost 17 million individual multiplications alone. But there are also additions, loading values from memory, and other operations involved. It can get pretty slow (and it will).

Throughout this article, I’ll be using the term SIMD (which is more generic) to refer to the instruction set available on my PC’s CPU, which is AVX2. It can load up to 256 bits per lane, which is equivalent to 8 floating point numbers of 32 bits. The code examples shown will most likely not work with other kinds of SIMD instruction sets.

Implementing

For the implementation I decided to use Rust.

The first step is to implement a tradicional matrix multiplication using the CPU in a scalar way with single-core execution to have something to compare to. But before that we need a way to create matrices, I decided to create a function that takes number of columns and number of rows as input and generates a vector of vectors with random values using the rand crate.

use rand::Rng;

fn gen_rand_matrix(rows: usize, cols: usize) -> Vec<Vec<f32>> {
    let mut rng = rand::rng();
    let mut matrix = vec![vec![0.0f32; cols]; rows];
    for i in 0..rows {
        for j in 0..cols {
            matrix[i][j] = rng.random_range(0.0..1.0);
        }
    }
    matrix
}

Now the function that multiplies two matrices and the main function to run and gather the execution time

use std::time::Instant;
use rand::Rng;

fn gen_rand_matrix(rows: usize, cols: usize) -> Vec<Vec<f32>> {...}

fn scalar_matrix_mult(a: &[Vec<f32>], b: &[Vec<f32>]) -> Vec<Vec<f32>> {
    let rows = a.len();
    let inner = a[0].len();
    let cols = b[0].len();
    assert_eq!(
        inner,
        b.len(),
        "Matrix dimensions do not match for multiplication"
    );
    let mut result = vec![vec![0.0f32; cols]; rows];
    for i in 0..rows {
        for j in 0..cols {
            let mut sum = 0.0f32;
            for k in 0..inner {
                sum += a[i][k] * b[k][j];
            }
            result[i][j] = sum;
        }
    }
    result
}

fn main() {
    let a = gen_rand_matrix(256, 256);
    let b = gen_rand_matrix(256, 256);
    let start_classical = Instant::now();
    let _ = scalar_matrix_mult(&a, &b);
    let duration_classical = start_classical.elapsed().as_millis();
    println!("scalar matrix multiplication: {duration_classical} ms");
}

On my machine it takes around 350ms to multiply these two square matrices.

Now, we need to implement this to work with SIMD, to do that we need to use the nightly build of cargo and use the portable_simd feature

#![feature(portable_simd)]
use std::simd::num::SimdFloat;
use std::simd::Simd;
use std::time::Instant;
use rand::Rng;

I also defined a constant which represent the amount of float32 values that fits on each SIMD lane const N: usize = 8; and a type to represent a lane type Lane = Simd<f32, 8>;

But before implementing the function for the matrix multiplication we need to consider the data “layout”, since we are trying to simulate a systolic array, it makes sense to organize the data of each matrix to be fed to the function in a similar way it would happen in a systolic array, with columns from the first matrix coming from the “left” and rows from the second matrix coming from the “top”. The functions pack_rows and pack_cols take a matrix (vector of vectors) as input and returns the same matrix but organized into Lanes (vector of vector of lanes). Not very memory efficient, but straightforward.

fn pack_rows(a: &[Vec<f32>]) -> Vec<Vec<Lane>> {
    a.iter()
        .map(|row| {
            row.chunks(N)
                .map(|chunk| {
                    let mut buf = [0.0; N];
                    buf[..chunk.len()].copy_from_slice(chunk);
                    Lane::from_array(buf)
                })
                .collect()
        })
        .collect()
}

fn pack_cols(b: &[Vec<f32>]) -> Vec<Vec<Lane>> {
    let rows = b.len();
    let cols = b[0].len();
    (0..cols)
        .map(|j| {
            let mut col = Vec::new();
            for chunk_start in (0..rows).step_by(N) {
                let mut buf = [0.0f32; N];
                for i in 0..N {
                    let idx = chunk_start + i;
                    if idx < rows {
                        buf[i] = b[idx][j];
                    }
                }
                col.push(Lane::from_array(buf));
            }
            col
        })
        .collect()
}

With that in place, we can now create the matrix multiplication function that takes a “packed” matrix and uses SIMD to perform the operations

fn simd_matrix_mult(a_rows: &[Vec<Lane>], b_cols: &[Vec<Lane>]) -> Vec<Vec<f32>> {
    let mut result = vec![vec![0.0f32; b_cols.len()]; a_rows.len()];

    for i in 0..a_rows.len() {
        for j in 0..b_cols.len() {
            let a_vecs = &a_rows[i];
            let b_vecs = &b_cols[j];
            let mut acc = 0.0f32;

            for idx in 0..a_vecs.len() {
                acc += (a_vecs[idx] * b_vecs[idx]).reduce_sum();
            }

            result[i][j] = acc;
        }
    }

    result
}

I also decided to create a function that compares two matrices to see if they are equivalent, considering a tolerance value because we are dealing with float arithmetic

fn matrices_equivalent(a: &[Vec<f32>], b: &[Vec<f32>], tol: f32) -> bool {
    if a.len() != b.len() || a[0].len() != b[0].len() {
        return false;
    }

    for i in 0..a.len() {
        for j in 0..a[0].len() {
            if (a[i][j] - b[i][j]).abs() > tol {
                println!("Mismatch at ({}, {}): {} vs {}", i, j, a[i][j], b[i][j]);
                return false;
            }
        }
    }

    true
}

I also altered the main function to run both function, time it and compare the results

fn main() {
    let a = gen_rand_matrix(256, 256);
    let b = gen_rand_matrix(256, 256);

    let a_rows = pack_rows(&a);
    let b_cols = pack_cols(&b);

    let start_classical = Instant::now();
    let scalar_result = scalar_matrix_mult(&a, &b);
    let duration_classical = start_classical.elapsed().as_millis();
    println!("scalar matrix multiplication: {duration_classical} ms");

    let start_simd = Instant::now();
    let simd_result = simd_matrix_mult(&a_rows, &b_cols);
    let duration_simd = start_simd.elapsed().as_millis();
    println!("SIMD matrix multiplication: {duration_simd} ms");

    if matrices_equivalent(&scalar_result, &simd_result, 0.01) {
        println!("Results are equivalent.");
    } else {
        println!("Results differ significantly.");
    }
}

Because rearranging the data takes time and that would not happen in a real hardware I decided to not consider the time it takes to pack the matrix when comparing the execution time, but I’ll show them separately.

Here is the output

Packing took: 2 ms
scalar matrix multiplication: 354 ms
SIMD matrix multiplication: 74 ms
Results are equivalent.

The absolute values don’t mean much, because each hardware will output different execution times, but the difference between the values is substantial. That becomes more evident if we increase the matrix size, here is a comparison going from 256 up to 2048.

Matrix size	Scalar	SIMD	Diff
256x256	~350ms	~75ms	~4.5x
512x512	~3s	~0.6s	~5x
1024x1024	~27s	~4.5s	~6x
2048x2048	~245s	~37s	~6.5x

We can observe an increase in comparative performance that is almost linear, and off course that will eventually cap because of memory bottle-neck, but non the less, I find this incredible.

Using multiple cores

Since we already did all of this, why not go a step forward and use all available cores, each using SIMD to see what results we get?

To do that I decided to use the rayon crate and the idea is simple, each CPU core (in my case 8 in total) will use SIMD to perform multiple operations in parallel and the matrix will be divided into bigger chunks of 8 so that each bigger chunk can be processed by an individual core. since each SIMD lane can work with up to 8 float32 values at once, each bigger chunk will be of size 64 (eight squared).

Here is the function:

fn parallel_simd_matrix_mult(
    a_rows: &[Vec<Lane>],
    b_cols: &[Vec<Lane>],
) -> Vec<Vec<f32>> {
    let m = a_rows.len();
    let n = b_cols.len();
    let tile_size = 64;

    let mut result = vec![vec![0.0f32; n]; m];

    result
        .par_chunks_mut(tile_size)
        .enumerate()
        .for_each(|(tile_row_idx, tile_chunk)| {
            let i0 = tile_row_idx * tile_size;
            let i_max = (i0 + tile_chunk.len()).min(m);

            for (ii, row) in tile_chunk.iter_mut().enumerate() {
                let i = i0 + ii;

                for j0 in (0..n).step_by(tile_size) {
                    let j_max = (j0 + tile_size).min(n);
                    for j in j0..j_max {
                        let a_vecs = &a_rows[i];
                        let b_vecs = &b_cols[j];

                        let mut acc = 0.0f32;
                        for (a_lane, b_lane) in a_vecs.iter().zip(b_vecs.iter()) {
                            acc += (*a_lane * *b_lane).reduce_sum();
                        }

                        row[j] = acc;
                    }
                }
            }
        });

    result
}

To be 100% honest, this is the part where I lost the plot and just used chatGPT to get to the final function, at this point I was just trying to get even more impressive numbers and was not carrying as much about how I got there.

Anyway, here is the comparison

Matrix size	SIMD	SIMD + Rayon	Diff
256x256	~70ms	~10ms	~7x
512x512	~600ms	~45ms	~13x
1024x1024	~4.65s	~0.25s	~18.5x
2048x2048	~37.5s	~1.75s	~21x
4096x4096	~285s	~13s	~22x

I did not compare all three at once because doing the last multiplication would take too much time running on scalar CPU

Conclusion

To me all of this was a great learning experience, I don’t think any of these things are “useful” in the context of professional software development, but it made me appreciate more the work that goes into solving a “simple problem” such as making matrix multiplication go faster. It took me a lot of time to understand how things were working in this software “simulation”, imagine what goes into create the real hardware.

At the end I’m excited to see what other awesome technologies will be created, enhanced or reused to support wide usage of AI models.