8 comments

  • ridiculous_fish8 hours ago
    This paper missed the state of the art!!!<p>This concerns unsigned division by a 32 bit constant divisor. Compilers routinely optimize this to a high-multiply by a &quot;magic number&quot; but this number may be up to 33 bits (worst-case, divisor dependent, 7 is a problem child). So you may need a 32x33-bit high multiply.<p>What compilers do today is some bit tricks to perform a 32x33 bit high multiply through a 32x32 multiply and then handling the last bit specially, through additions and bitshifts. That&#x27;s the &quot;GM Method&quot; in the paper; the juice to be squeezed out is the extra stuff to handle the 33rd bit.<p>What the paper observes is that 32x33 high multiply can be performed via a 64x64 high multiply and then the extra stuff goes away. Well yes, of course.<p>But amazingly in ALL of these worst case situations you can compute a different magic number, and perform a (saturating) increment of the dividend at runtime, and ONLY need a 32 bit high multiply.<p>That is, these are the two algorithms for unsigned division by constants where the magic number overflows:<p>- This paper: Promote to twice the bit width, followed by high multiply (32x64 =&gt; 64) and then bitshift<p>- SOTA: Saturating increment of the dividend, then high multiply (32x32 =&gt; 32) and then bitshift<p>Probably the paper&#x27;s approach is a little better on wide multipliers, but they missed this well-known technique published 15 years ago (and before that, I merely rediscovered it):<p><a href="https:&#x2F;&#x2F;ridiculousfish.com&#x2F;blog&#x2F;posts&#x2F;labor-of-division-episode-iii.html" rel="nofollow">https:&#x2F;&#x2F;ridiculousfish.com&#x2F;blog&#x2F;posts&#x2F;labor-of-division-epis...</a>
    • purplesyringa8 hours ago
      The paper doesn&#x27;t require a bitshift after multiplication -- it directly uses the high half of the product as the quotient, so it saves at least one tick over the solution you mentioned. And on x86, saturating addition can&#x27;t be done in a tick and 32-&gt;64 zero-extension is implicit, so the distinction is even wider.
      • aleph_minus_one5 hours ago
        &gt; And on x86, saturating addition can&#x27;t be done in a tick<p>Perhaps I misunderstand your point, but I am rather sure that in SSE...&#x2F;AVX... there do exist instructions for saturating addition:<p>* (V)PADDSB, (V)PADDSW, (V)PADDUSB, (V)PADDUSW<p>* (V)PHADDSW, (V)PHSUBSW
        • dzaima3 hours ago
          Unfortunately, that&#x27;s only vector, and ≤16-bit ints at that, no 32-bit ints; and as the other reply says, nearly non-existent multiply-high which generally makes vectorized div-by-const its own mini-hell (but doing a 2x-width multiply with fixups is still better than the OP 4x-width method).<p>(...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.)<p>(And, for what it&#x27;s worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too)
        • oxxoxoxooo3 hours ago
          On x86, there is no vector instruction to get the upper half of integer product (64-bits x 64-bits). ARM SVE2 and RISC-V RVV have one, x86 unfortunately does not (and probably wont for a long time as AVX10 does not add it, either).
          • namibj3 hours ago
            There is one for the f64 FMA recycling IFMA from AVX512 they have for bignum libraries;it&#x27;s a 52 bit unsigned multiply and accumulates either the low or the high output halves into a 64bit accumulator.<p>It&#x27;s surely no 64 bit but it&#x27;s much more than 32 bit. And it&#x27;s giving you access to the high halves so you can use it to compute 32x32-&gt;64 on vector even if only half as packed as that could be.
    • vrighter5 hours ago
      From my research you can always fit it in a 32x32 multiply, you only need the extra bit at compile time. The extra bit tells you how to adjust the result in the end, but the adjustment is also a constant.
    • dooglius8 hours ago
      Any idea why no one applied this to llvm in the interim?
  • foltik9 hours ago
    Compilers already optimize division of a 32 bit integer x by a constant d like:<p><pre><code> c = 2^a &#x2F; d so: d = 2^a &#x2F; c so: x &#x2F; d = (x * c) &#x2F; 2^a </code></pre> And &#x2F;2^a is equivalent to &gt;&gt;a, so we have:<p><pre><code> x &#x2F; d = (x * c) &gt;&gt; a </code></pre> Which is just a multiply and shift, cheaper than integer division.<p>Problem is, with some divisors like 7, 14, 21, ... the smallest c that works to produce an exact result is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift, add.<p>What the paper points out is twofold:<p>1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround.<p>2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get &gt;&gt; 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:<p><pre><code> c’ = c &lt;&lt; (64 - a) </code></pre> Now (x * c’) &gt;&gt; 64 is equivalent to (x * c) &gt;&gt; a. And the result is just waiting for you in a 64 bit register, no shift needed.<p>Just one multiply to divide a 32 bit integer by 7. Nice.
  • dmcq220 minutes ago
    At this rate it might be worthwhile turning integer divisions by an unknown which is constant in a loop into a call to a routine to generate the inverse and then use that within the loop. Though I guess that would be pretty infrequent compared to the floating point equivalent.
  • purplesyringa8 hours ago
    I must admit I&#x27;m surprised to see this -- Lemire offhandedly mentioned in the famous remainder blog post (<a href="https:&#x2F;&#x2F;lemire.me&#x2F;blog&#x2F;2019&#x2F;02&#x2F;08&#x2F;faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lemire.me&#x2F;blog&#x2F;2019&#x2F;02&#x2F;08&#x2F;faster-remainders-when-the...</a>) that 64-bit constants can be used for 32-bit division, and even provided a short example to compute the remainder that way (though not the quotient). Looking a bit more, it seems like libdivide didn&#x27;t integrate this optimization either.<p>I <i>guess</i> everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented?
  • gopalv9 hours ago
    The linked clang PR is also very readable.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-project&#x2F;pull&#x2F;181288&#x2F;files" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-project&#x2F;pull&#x2F;181288&#x2F;files</a><p>As the PR clearly points out, you can do this in a register but not inside vectors.<p>I don&#x27;t think fastdiv has had an update in years, which what I&#x27;ve used because compilers can&#x27;t do &quot;this is a constant for the next loop of 1024&quot; like columnar sql needs.
  • pkaye8 hours ago
    Look up the book Hackers Delight if you like there low level algorithms and bit manipulations
  • trendbuilder10 hours ago
    [flagged]