<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=145304570664993&amp;ev=PageView&amp;noscript=1">

91ÊÓƵAPP

Exploiting tile level parallelism on IPU

Dec 03, 2021

Exploiting Tile-Level Parallelism on 91ÊÓƵAPP IPUs

Written By:

Mark Pupilli

We're Hiring

Join us and build the next generation AI stack - including silicon, hardware and software - the worldwide standard for AI compute

Join our team

In a previous post on Medium, we studied a that demonstrates how to write a Poplar/C++ program which exploits the large number of cores (tiles) available on 91ÊÓƵAPP’s intelligence processing units (IPUs). The IPU is not only massively parallel due to its large number of independent cores but there is also significant parallelism available within each tile. In this post we are going to optimise a simple Poplar codelet to take advantage of this tile-level parallelism.


Tile-Level Parallelism on IPU

An IPU consists of many cores called tiles, a single GC200 chip contains 1472 tiles that can run completely independent programs. Within each tile there are two other ways we can parallelise work. Each tile has 6 hardware worker threads that are scheduled in a round-robin (barrel) schedule with one instruction being executed per context in turn. Whilst this is a serial issue of instructions, one per thread, any memory accesses continue in parallel with the other worker contexts so that reads/writes as seen from each context are single cycle. Each one of these worker threads executes their own instruction stream giving the IPU the ability to execute 8832 MIMD programs at once. The other form of parallelism within a tile is vector SIMD floating point arithmetic: each worker on each tile can issue 2 x fp32 or 4 x fp16 vectorised arithmetic operations per cycle.

Optimising Codelets to use Worker Threads

There are two ways we can create work for the threads on each tile. The first is at the graph programming level: if we create vertices in the same compute set that have no data dependencies then Poplar’s graph compiler will schedule those vertices to run in parallel across workers. This is how we parallelised the path tracing codelet in the (graph construction code ). The advantage of this technique is that we can describe it at the compute graph level which may simplify the codelet/kernel implementation. The disadvantage is that we might end up duplicating vertex state six times which could have implications for per-tile memory consumption.

Fortunately Poplar provides an alternative Vertex class for use in C++ codelets: MultiVertex. In a MultiVertex the worker ID is passed directly to the compute method of the codelet. The advantage of this is that vertex state can now be shared amongst the workers and that the worker level parallelism is explicit in the low level code. The disadvantage is increased code complexity because work needs to be explicitly split in the codelet and the programmer needs to be aware of a few thread safety issues (see the for details).


MultiVertex

In the original path tracing code we did not bother to parallelise ray generation over workers as it was not a bottleneck. Even though this is a poor candidate for optimisation effort it is a good illustrative example for how to use MultiVertex due to the simplicity of the codelet so lets see how to change the GenerateCameraRays Vertex to a MultiVertex.

Here is our original vertex:

Note that commented code fragments between brackets <<>> are either listed elsewhere to avoid clutter or are omitted entirely for succinctness. The vertex state, for example, is going to be common to both the Vertex and MultiVertex and is listed here separately:

Changing the Vertex into a MultiVertex is the easy part. We simply inherit from a different base and change the signature of the compute method ending up with the following:

You can see we now have access to the workerId directly in the compute method. We can also call numWorkers() to get the number of worker threads available (future designs of the IPU are free to change the number of worker threads so we avoid hard coding six workers for forward compatibility). Now we need to decide how to distribute the processing amongst workers. Lets look at the the lambda fragment from the original vertex << Lambda to consume random numbers >>:

This function steps through the noise buffer each time it is called consuming the next entry. The simplest change we can make in our MultiVertex is to start each thread at a different offset (the thread’s workerId) and then increment the index by the worker count each time it is called:

Now we are consuming random numbers 6 times faster. That was easy! The ray generation loop is a little more complicated but the principle will be the same. Lets look at the original Vertex’s code fragment << Ray generation loop >>:

Here we have a bit more choice of how to split up the two loops across workers. Because the IPU tile is directly accessing its private SRAM we do not necessarily need to make sure the worker threads access contiguous memory locations for cache coherence. However, later we are going to try and vectorise the computation so having whole rows processed by each thread will be friendly to those SIMD ops. With this in mind we apply the same principle as before to the row indexing: each worker will process one row, and they will start offset by the worker count. We also need to offset the output indexing in the same way. This leads to the following loop in the MultiVertex:

We didn’t need to modify the inner loop at all and we have finished the conversion to a MultiVertex. With a bit of care to get the indexing right this is a very straightforward way to get a 6x speed up for this type of code.

Vectorising the Inner Loop

Almost all modern architectures have some form of SIMD instruction level parallelism and the IPU is no different. Popc is based on LLVM and will auto-vectorise code where possible but we can also manually vectorise on IPU without resorting to writing assembly code. There are some vectorisation utility functions in the header <ipu_vector_math> and we will also use <ipu_memory_intrinsics> which is going to give us access to some special memory instructions that complement the vector ops.

Manually vectorising the codelet is going to be a more invasive change than parallelising over worker threads (so in reality we should make sure it is worth the effort by profiling first). But for this illustration lets jump straight in and see the new structure of the SIMD version of the MultiVertex:

We start by including the headers we need. Note that up until now we have been using plain C++. However, popc (which compiles C++ codelets) is a multi-target compiler: it produces code for three architectures by default: ipu1, ipu2, and CPU. The CPU target allows use of an . In order to maintain support for IPUModel/ CPU target we need to guard IPU specific code using #ifdef __IPU__ (and note this includes import of the IPU specific headers).

Next we create a new vertex with a different name GenerateCameraRaysSIMD which is not strictly necessary but it will allow the wider application to enable/disable vectorisation by . The SIMD vertex has a modification to the antiAliasNoise and rays vertex state but all the other states remain the same. The most important modification is to guarantee the data is aligned to 8-byte boundaries (this will allow us to use special load/store instructions). The other specifies a more compact pointer format (which we can do because we know the data sizes from other state).

The final structural changes above are to inline the lambda function and library call to light::pixelToRay and to unroll the inner loop by a factor of two (so you can see above the column index now increments by two not one). These changes will hurt readability but we would expect manually vectorised code to be quite verbose.

Now we have seen the new structure we can look at the specific fragments. First in << Setup pointers >> lets see how we intend to access the noise buffer and output rays:

The first line creates a pointer to the start row for each worker: this is a pointer to a vector type of 4 half (fp16) values. To offset it we index into the noise buffer using the workerId and multiply by four (because we cast the pointer to a vector type each worker will consume noise in chunks of four values). On the next line we setup a constant half4 value (which is just a broadcast of the noise scale factor to a four-wide half vector). Setting up the row index and output pointer are relatively simple too: as before startRow is offset by workerId while the new output pointer is offset by workerId multiplied by the number of output ray components (noting there are two per pixel so numRayComponentsInRow is the number of columns multiplied by two).

The next setup code fragment before we get to the inner loop is << Pre-compute values related to camera model >>. Because we have inlined light::pixelToRay we see many of the values used there are constant (related to the camera’s field-of-view). Hence we can pre-compute some vector values for use in the inner loop:

We will not go through the details of this simple FOV camera model. The important thing to note is that we can pre-compute many vector values here and there are functions in the ipu:: namespace that operate on vector types (e.g. tan will apply element-wise across the half2 vector). The other thing to note is that we pre-cast the startCol index to half. Casts can be relatively expensive so its better to do them in outer loops and keep inner loops doing nothing but float arithmetic and memory operations. We also pre-compute unrollInc that we will use to increment the first element of a half2 vector by one but leave the second untouched.

Finally we are ready to look at the inner loop. Due to all the setup we have done this is actually pretty simple:

You can see that we can use standard arithmetic operators with vector types so the code is quite readable. The only notable parts of this loop are the special load store functions. The IPU is capable of issuing a single load store instruction which simultaneously increments a pointer. We use these to read and write 8-bytes (two half2 vectors) at a time which is why we needed to specify 8-byte alignment in the vertex state earlier.

And that’s it we are done. You can see the complete code in 91ÊÓƵAPP’s Github examples: . Note that we could have used the half4 vector type to vectorise the inner loop by another factor of 2 but we will leave this as an exercise for the reader.

Checking the Assembly

The last thing we might want to do when optimising at this level is to look at the code generated for this inner loop to sanity check it is efficient. We can do this by calling Popc with the following options: popc -O3 -S --target=ipu2 codelets.cpp -o simd.S and then we can find the inner loop in the output file which looks like this:

.LBB1_5:                     # =>This Inner Loop

Header: Depth=1
        {
                add $m2, $m2, 2
                f16v2add $a0, $a5, $a5
        }
        ld32 $a7, $m11, $m15, 14           # 4-byte Folded

Reload
        {
                ld64step $a2:3, $m15, $m0+=, 6
                f16v2add $a5, $a5, $a7
        }
        ld32 $a6, $m11, $m15, 11 # 4-byte Folded

Reload
        f16v2sub $a0, $a0, $a6
        {
                st32 $a0, $m11, $m15, 15
                mov $a4, $a1
        }
        ld64 $a0:1, $m11, $m15, 6 # 8-byte Folded

Reload
        f16v4mul $a2:3, $a0:1, $a2:3
        {
                ld32 $a0, $m11, $m15, 15
                mov $a1, $a4
        }
        f16v2mul $a0, $a1, $a0
        f16v2add $a4, $a5, $a5
        f16v2sub $a4, $a4, $a6
        f16v2mul $a4, $a1, $a4
        f16v2add $a5, $a5, $a7
        f16v2add $a2, $a0, $a2
        f16v2add $a3, $a4, $a3
        st64step $a2:3, $m15, $m1+=, 1
        ld32 $m4, $m7, $m15, 5
        ld32 $m4, $m4, $m15, 0
        cmpult $m5, $m2, $m4
        brnz $m5, .LBB1_5

We won’t go into full details of the instruction set architecture here but there are a few interesting things to note. First, we can see there are a number of instructions prefixed with f16v2 these are the 2 x fp16 vector arithmetic instructions we were hoping to see. Second, we can see that the 8-byte (64-bit) load/stores we used appear like this:

        {
                ld64step $a2:3, $m15, $m0+=, 6
                f16v2add $a5, $a5, $a7
        }

This is actually a single instruction packet: IPU tiles can dual issue a floating point and memory instruction in the same cycle. So if we are counting issued instructions in the inner loop there are 21 (not 25).

Conclusions

We have seen how we can optimise IPU codelets using two orthogonal techniques. First, with very minimal changes, we can get a 6x speed-up simply by utilising all the worker threads on a tile. Second, if necessary, we can use manual vectorisation and specialised memory operations using built in vector types and utility functions (without resorting to coding in assembly). Full details of Vertex programming can be found in 91ÊÓƵAPP’s .