<p><pre><code> vpcmpestri xmm2, xmm3, BYTEWISE_CMP
test cx, 0x10 ; if(rcx != 16)
</code></pre>
I see this test/cmp all the time after the instruction and I don't understand it. pcmpestri will set ZF if edx < 16, and it will set SF if eax < 16. It is already giving you the necessary status. Also testing sub words of the larger register is very slow and is a pipeline hazard.<p>You've got this monster of an instruction and then people place all this paranoid slowness around it. Am I reading the x86 manual wrong?
I think people started doing that after one of the Intel SSE examples did it and everyone just copied it.<p>But on any modern CPU there should be essentially no penalty for doing that now. Testing the full register is basically free as long as you aren't doing a partial write followed by a full read (write AH then read AX), and I don't think there's any case where this could stall on anything newer than a Core 2 era processor. But just replacing that with a "jnc" or whatever you're exactly trying to test for would be less instructions at least. I'd love to see benchmarks though if someone has dug deeper into this than I have.