Why we need SIMD

(parallelprogrammer.substack.com)

181 points by atan2124 days ago

13 comments

Remnant44121 days ago
I'm just happy that finally, with the popularity of zen4 and 5 chips, AVX512 is around ~20% of the running hardware in the steam hardware survey. It's going to be a long while before it gets to a majority - Intel still isn't shipping its own instruction set in consumer CPUs - but its going the right direction.Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width.Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole.
- Rarebox120 days ago
 Interesting example! I've been learning AVX512 by using it to optimize Huffman coding. I found _mm512_permutexvar_epi8 and used it to do byte-indexed lookups, but _mm512_permutex2var_epi8 means I can get by with 2 shuffles and 1 comparison instead of 4 shuffles and 3 comparisons. In the end, on my CPU (AMD 9950x), changing to _mm512_permutex2var_epi8 only sped up compression by ~2%.Compared to Huff0[1] (used by Zstd), my AVX512 code is currently ~40% faster at both compression and decompression. This requires using 32 datastreams instead of 4 used by Huff0.[1] <a href="https://github.com/Cyan4973/FiniteStateEntropy" rel="nofollow">https://github.com/Cyan4973/FiniteStateEntropy</a>
 - camel-cdr120 days ago
 Oh, this is cool, I wanted to look into using SIMD for huffman as well.For decode, do you use AVX512 to speedup the decode via caching the decode of small codewords?Do you decode serially or use the self syncronizing nature of huffman codes to decode the stream from multiple offsets in parallel? I haven't seen the later done in SIMD before.Are there any new SIMD instructions you'd like to see in future ISA extensions?OpenPower has proposed a scalar instruction to speedup prefix-code decoding: <a href="https://libre-soc.org/openpower/prefix_codes/" rel="nofollow">https://libre-soc.org/openpower/prefix_codes/</a>
- shihab121 days ago
 Could you please elaborate on your example? Thanks.
 - Remnant44121 days ago
 Sure.. in detail and abstracted slightly, the byte table problem:Maybe you're remapping RGB values [0..255] with a tone curve in graphics, or doing a mapping lookup of IDs to indexes in a set, or a permutation table, or .. well, there's a lot of use cases, right? This is essentially an arbitrary function lookup where the domain and range is on bytes.It looks like this in scalar code:transform_lut(byte* dest, const byte* src, int size, const byte* lut) { for (int i = 0; i < size; i++) { dest[i] = lut[src[i]]; } }The function above is basically load/store limited - it's doing negligible arithmetic, just loading a byte from the source, using that to index a load into the table, and then storing the result to the destination. So two loads and a store per element. Zen5 has 4 load pipes and 2 store pipes, so our CPU can do two elements per cycle in scalar code. (Zen4 has only 1 store pipe, so 1 per cycle there)Here's a snippet of the AVX512 version.You load the lookup table into 4 registers outside the loop:<pre><code> __m512i p0, p1, p2, p3; p0 = _mm512_load_epi8(lut); p1 = _mm512_load_epi8(lut + 64); p2 = _mm512_load_epi8(lut + 128); p3 = _mm512_load_epi8(lut + 192); </code></pre> Then, for each SIMD vector of 64 elements, use each lane's value as an index into the lookup table, just like the scalar version. Since we only can use 128 bytes, we DO have to do it twice, once for the lower and again for the upper half, and use a mask to choose between them appropriately on a per-element basis.<pre><code> auto tLow = _mm512_permutex2var_epi8(p0, x, p1); auto tHigh = _mm512_permutex2var_epi8(p2, x, p3); </code></pre> You can use _mm512_movepi8_mask to load the mask register. That instruction sets each lane is active if its high bit of the byte is set, which perfectly sets up our table. You could use the mask register directly on the second shuffle instruction or a later blend instruction, it doesn't really matter.For every 64 bytes, the avx512 version has one load&store and does two permutes, which Zen5 can do at 2 a cycle. So 64 elements per cycle.So our theoretical speedup here is ~32x over the scalar code! You could pull tricks like this with SSE and pshufb, but the size of the lookup table is too small to really be useful. Being able to do an arbitrary super-fast byte-byte transform is incredibly useful.
 - vincenthwt121 days ago
 I love lookup tables. Thanks for sharing!
 - kbolino121 days ago
 Here's a non-parallel and unoptimized implementation of that operation in Go:<pre><code> func _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 { var dst [64]uint8 for j := 0; j < 64; j++ { i := idx[j] src := a if i&0b0100_0000 != 0 { src = b } dst[j] = src[i&0b0011_1111] } return dst } </code></pre> Basically, for a lookup table of 8-bit values, you need only 1 instruction to perform up to 64 lookups simultaneously, for each 128 bytes of table.
lordnacho121 days ago
When I optimize stuff, I just think of the SIMD instructions as a long sandwich toaster. You can have a normal toaster that makes one sandwich, or you can have a 4x toaster that makes 4 sandwiches as once. If you have a bunch of sandwiches to make, obviously you want to align your work so that you can do 4 at a time.If you want to make 4 at a time though, you have to keep the thing fed. You need your ingredients in the cache, or you are just going to waste time finding them.
jasonthorsness121 days ago
Compared to GPU programming the gains from SIMD are limited but it's a small-multiple boost and available pretty much everywhere. C# makes it easy to use through Vector classes. WASM SIMD still has a way to go but even with the current 128-bit you can see dramatic improvements in some buffer-processing cases (I did a little comparison demo here showing a 20x improvement in bitwise complement of a large buffer: <a href="https://www.jasonthorsness.com/2" rel="nofollow">https://www.jasonthorsness.com/2</a>)
- corysama121 days ago
 > a small-multiple boostQuick reminder that a 20x boost is better than going from O(n) to O(log n) for up to a million items. And, that log n algorithms often are simply not possible for many problems.
 - ccapitalK121 days ago
 Am I getting the math wrong here? Going from O(n) to O(log n) (with no change in constant factor) for a million items would be going from ~1000000c ops to ~20c ops, which would be a 50000x improvement?
 - DarkNova6121 days ago
 Big O Notation omis constant factors that tend to be significantly larger for log-n algorithms.I think he talked from personal experience.
 - f1shy121 days ago
 Could be. But very poorly stated if so.Anyway, I do not think that even "typically" such statement can remotely be truth. It is 2 orders of magnitude away (20 to 5000).
 DarkNova6121 days ago
 True, but it wouldn’t be unthinkable. Especially if the O(n) algorithm accesses data sequentially and the O(log n) has indirections.But maybe the author simply made it up.
 corysama120 days ago
 You are all right. I was wrong. Should not post math just before going to sleep… So, let’s make the same mistake first thing in the morning:x=20log2(x) at 143<a href="https://www.wolframalpha.com/input?i=x%3D20*log2%28x%29" rel="nofollow">https://www.wolframalpha.com/input?i=x%3D20*log2%28x%29</a>
- zozbot234121 days ago
 The WASM folks should just include an arbitrary-length vector compute extension. We should also explore automatically compiling WASM to GPU compute as appropriate, the hardware independence makes it a rather natural fit for that.
 - galangalalgol120 days ago
 Aren't they loosely modeled on riscV? RiscV has an arbitrary length vector extension. So did cray. They seem like a good idea. In practice, I usually structure intermediate data products in memory with the largest simd size I expect the vectorizer will ever optimize for, so compiling it for something longer won't get me anything.
- adrian_b121 days ago
 You mean the gains from SIMD on the CPU, because all GPU programs use even wider SIMD than the CPUs, with width of 1024 bits or 2048 bits, for even greater die area and power consumption savings in comparison with non-SIMD.The width of the SIMD instructions is not visible when programming with NVIDIA CUDA or with the similar compilers for Intel CPUs (ispc or oneAPI with SYCL targeting CPUs or OpenMP with appropriate pragmas), but only because the compiler takes care of that.
- ncruces121 days ago
 I merged a few PRs to SIMD optimize Wasm WASI libc, but it all got stalled in str(c)spn (which is slightly more sophisticated than the rest).There wasn't much appetite for any of it on Emscripten.<a href="https://github.com/WebAssembly/wasi-libc/pulls?q=is%3Apr+optional+simd" rel="nofollow">https://github.com/WebAssembly/wasi-libc/pulls?q=is%3Apr+opt...</a>
 - jasonthorsness121 days ago
 subscribed to the str(c)spn thread for the eventual explanation of why the non-simd version seemed to give the wrong answer
 - ncruces121 days ago
 Very likely the issue is with the test infrastructure and not with the code under test.My own fuzzing doesn't report any inconsistencies. But fuzzing is always necessarily incomplete.
- fulafel121 days ago
 The high arithmetic bandwidth on GPUs is of course SIMD based as well. They just tend to have a ISPC style compilation model that doesn't expose the SIMD lanes in the source code. (Whereas SIMD even after decades is very lightly utilized by compilers on the CPU side).
 - zozbot234121 days ago
 It's SIMD-based at the lowest level, but there's also the use of very high hardware multithreading (the threads are called, AIUI, "wavefronts" or "warps") on each compute unit/stream processor to hide memory access latency. Recent SPARC CPU's have 8-way hardware multithreading on the individual CPU core, GPU's can easily go even higher than that.
 - fulafel120 days ago
 Yep, this also reflects the design target of GPUs targeting much larger working sets, so have higher main memory bandwidth and rely less on caches. CPUs rather have few fast threads of execution working on hot cached data than many slow ones talking to main memory (because N-way thread level parallelism often splits your cache N ways, to N working sets)
p0nce121 days ago
4 lanes of SIMD (like in say SSE) is not necessarily 4x faster because of the memory access, sometimes it's better than that (and often it's less).PSHUFB wins in case of unpredictable access patterns. Though I don't remember how much it typically wins.PMOVMSKB can replace several conditionals (up to 16 in SSE2 for byte operands) with only one, winning in terms of branch prediction.PMADDWD is in SSE2, and does 8 byte multiplies not 4. SSE4.1 FP rounding that doesn't require changing the rounding mode, etc. The weird string functions in SSE4.2. Non-temporal moves and prefetching in some cases.The cool thing with SIMD is that it's a lot less stress for the CPU access prediction and branch prediction, not only ALU. So when you optimize it will help unrelated parts of your code to go faster.
chasil121 days ago
The author has neglected the 3DNow! SIMD instructions from AMD.They were notable for several reasons, although they are no longer included in modern silicon.<a href="https://en.wikipedia.org/wiki/3DNow" rel="nofollow">https://en.wikipedia.org/wiki/3DNow</a>!
- srean120 days ago
 I remember having their manual around. I vaguely remember that the Athlon series used to be faster with floating points than the then new Pentium-4s. For that matter even the Pentium-3s had better floating point performance than Pentium-4s.
dpifke121 days ago
Related: Go is looking to add SIMD intrinsics, which should provide a more elegant way to use SIMD instructions from Go code: <a href="https://go.dev/issue/73787" rel="nofollow">https://go.dev/issue/73787</a>
vardump121 days ago
Wider SIMD would be useful, especially with AVX-512 style improvements. 1024 or even 2048 bits wide operations.Of course memory bandwidth should increase proportionally otherwise the cores might have no data to process.
- TinkersW121 days ago
 I wouldn't mind, but might need to increase cache line size on x86, as avx512 has reached the current size.
 - p_l121 days ago
 AVX-512 size is explicitly a derivative of P54C having 64 byte cache line in fact
- TimorousBestie121 days ago
 I would love to be able to fit small matrices (4x4 or 16x16 depending on precision) in SIMD registers together with intrinsics for matrix arithmetic.
- owlbite121 days ago
 Much better to burn the area for multiple smaller units, its a bit more area for frontend handling, but worth it for the flexibility (see Apple's M-series chips vs intel avx*).
 - Remnant44121 days ago
 Yes and no. I think neon is undersized for today at 128bit registers -- if you're working with doubles for example, that's only two values per register, which is pretty anemic. Things like shuffles and other tricky bitops benefit from wider widths as well (see my other reply)
 - adgjlsfhk1121 days ago
 Agreed that 128 bit is undersized, but 512 feels pretty good for the time being. We're unlikely to see further size increases since going to 1024 would require doubling the cache line, register file, and ram bandwidth, while just adding an extra fma port is far less hardware.
 - Remnant44121 days ago
 totally - especially given how bandwidth constrained CPUs still are, going wider than 512 doesn't make much sense. 512 itself was a stretch for quite a long time (and all the negative press on the original implementations was a consequence of being not-quite-ready for primetime), but for current hardware I think it's perfect.But 128bit is just ancient. If you're going to go to significant trouble to rewrite your code in SIMD, you want to at least get a decent perf return on investment!
 adgjlsfhk1121 days ago
 128 bit is already really nice for things like Int8 comparison (e.g lots of string operations and Swiss Dict key search)
- api121 days ago
 This would start looking a lot like a GPU.
 - Veliladon121 days ago
 GPUs are literal SIMD devices. Usually 32 or 64 ALU lanes.
 - tubs121 days ago
 They are sim-t. It’s slightly different. I made them for a living for quite some time.
 - adrian_b120 days ago
 Nope.That is just a software view provided by the CUDA compiler and the NVIDIA device driver, which take care of distributing the computation over SIMD lanes and real hardware threads (called warps in the NVIDIA obfuscating jargon).The GPU processor is just a processor with SIMD and FGMT (fine-grained multi-threading). There is nothing special from this point of view.For any CPU you can write a compiler that provides the "SIMT" software model and there are such compilers for the Intel/AMD CPUs.SIMT is just one of many useless synonyms created by NVIDIA for traditional terms previously used for decades in computing publications, i.e. it is the same with what Hoare called "array of processes" (in 1978-08, later implemented in the language Occam) and actually the same with "parallel for" of OpenMP, just with a slightly different syntax (i.e. CUDA separates in the source file the header and the body of the parallel FOR).
 msla118 days ago
 Maybe you should listen to the expert instead of thinking you know better?
 - Bolwin121 days ago
 What's 't'? Textures? Triangles? Tensors?
 craigacp121 days ago
 SIMT - Single Instruction, Multiple Threads
 kmeisthax121 days ago
 Threads.It's less a difference in instruction set capability and more a difference in mentality.Like, for SIMD, you have to say "ok, we're working in vector land now" and start doing vector loads into vector registers to do vector ops on them. Otherwise, the standard variables your program uses are scalars and you get less parallelism. On a GPU this is flipped: the regular registers are vector, and the scalar ones (if you have any) are the weird ones. And because of this the code you write is (more or less) scalar code where everything so happens to magically get done sixteen times at once.As you can imagine, this isn't foolproof and there's a lot of other things that have to change on GPU programming in order to be viable. Like, conditional branching has to be scalar, since the instruction pointer register is still scalar. But you can have vectors of condition flags (aka "predicates"), and make all the operations take a predicate register to tell which specific lanes should and shouldn't execute. Any scalar conditional can be compiled into predicates, so long as you're OK with having to chew through all instructions on both branches[0].[0] A sufficiently smart shader compiler could check if the predicate is all-false or all-true and do a scalar jump over the instructions that won't execute. Whether or not that's a good idea is another question.
 jabl121 days ago
 One way to think of SIMT is that instead of vector instructions you have a 'fork' instruction which turns on vector mode where the scalar instructions execute on all the vector lanes. Your SIMT code must then include a 'lane index' variable somewhere (of course e.g. in CUDA it's more complicated with blocks, warps etc etc., but in principle it's just a more detailed way of doing lane indexing) so that all the threads work on different data. There is traditionally a shared program counter (PC) (in reality on GPU's, something like per-warp PC's so you still have multiple PC's), where in case of divergent control flow lanes are masked off (though post-Volta Nvidia HW has per-lane PC's). Then finally when you're done with your parallel algorithm you execute a 'join' instruction which blocks until all the lanes have reached that point, and then turns off all the lanes except one so you're now in scalar mode again.Now whether this is actually how the hardware operates or whether the compiler in the GPU driver turns the SIMT code into something like SIMD code for the actual HW is another question.
 HelloNurse121 days ago
 The different "mentality" is whether sequential execution or lockless parallelism prevails. SIMD instructions in a CPU are very small islands of parallelism that do a strictly limited number of elementary small jobs simultaneously within a single opcode, with trivial synchronization (the next opcode is executed next); SIMT jobs in a GPU are an arbitrary number and arbitrarily long and complex, you don't know whether they are waiting or running or done without explicit synchronization primitives, and in-order execution doesn't extends beyond the boundary of a shader.
 taktoa121 days ago
 I think what you're describing is SPMD, which is a compilation strategy, not a hardware architecture. I am not sure but I think SIMT is SIMD but with multiple program counters (1 per N lanes) to enable some limited control flow divergence between lane groups.
 AlotOfReading121 days ago
 The PC is shared in traditional SIMT, but diverging branches are masked out until they execute. Nvidia introduced per-thread PCs with Volta. I think AMD still uses a shared PC across each wavefront?
- account4mypc121 days ago
 AMX registers are 1024 *bytes*
dang121 days ago
Recent and related:Why do we even need SIMD instructions? - <a href="https://news.ycombinator.com/item?id=44850991">https://news.ycombinator.com/item?id=44850991</a> - Aug 2025 (8 comments)
kristianp121 days ago
No mention of branches, which is a complementary concept. If you unwind your loop, you can get part of the way to SIMD performance by keeping the CPU pipeline filled.
imtringued120 days ago
While the blog post clearly highlights the core benefit of SIMD.>Yet this new family of instructions was able to describe a great deal more work per instruction – the key benefit of SIMD.It doesn't really explain why it is a benefit. From a ALU perspective, each SIMD lane could in theory execute a different operation. So why would simultaneous instructions as in SIMD ever be a thing? You could just keep cranking up the ILP (instruction level parallelism) by adding more ALUs, naturally extending to MIMD. Each SIMD lane would be equivalent to a full CPU core, so why is everyone stopping short of unlocking even more performance?Because instruction memory (SRAM) is incredibly expensive.By using SIMD you are reducing the number of instructions needed to describe repeating calculations. If you work with 32-bit floats and 512-bit instructions representing 16 lanes, you've made a 16-fold reduction to the required quantity of instruction memory and subsequent fetches from that memory to express the same parallel computation compared to using ILP.
- d_tr120 days ago
 Wouldn't decoding also be an issue?
Panzerschrek121 days ago
The main problem with SIMD instructions is that regular code doesn't use them. Almost always someone need to write SIMD code manually to achieve good performance, which is rarely done and if so, only in some tight loops and niche cases. Like cryptography-related code in a browser may be SIMD-based, but other code uses almost no SIMD.Modern compilers are able sometimes to vectorize regular code, but this is done only occasionally, since compilers can't often prove that read/write operations will access valid memory regions. So one still needs to write his code in such a way that compiler can vectorize it, but such approach isn't reliable and it's better to use SIMD instruction directly to be sure.
- Tuna-Fish121 days ago
 And this is one of the biggest reasons why AVX-512 is so important. It doesn't just double the register width, it adds masking, and crucially suppresses faults from memory accesses when the corresponding element is masked. This makes it much more practical for compilers to automatically vectorize generated code.
erichocean121 days ago
ISPC is a great way to target this. C-like, generates the SIMD code, can be called from C/C++/etc.
aboardRat4121 days ago
Why does such an abbreviation still exist in 2025?They have been in the CPUs for so long that I expected them to be inseparable to the degree that people wouldn't even remember they were a separate thing in the past.