Hi,
I want to check the range of a vector of double-precision variables, in order to branch to a slow path on exceptional out-of-range cases. My code looks like the following:
// if(any(!(x < 4.) || (x < 2))) { ... }
__mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
__mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
if(!_mm512_kortestz(toobig, toosmall)) {
// do something with out-of-range numbers (slow path)
}
// do something with in-range numbers (fast path)
I expect it to map to a 3-instruction sequence. However, icc (13.1) seems to generate extra data movement and masking between comparisons and test:
### __mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
vcmpnltpd k2, zmm0, QWORD PTR .L_2il0floatpacket.5[rip]{1to8} #20.23 c1
### __mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
vcmpltpd k3, zmm0, QWORD PTR .L_2il0floatpacket.6[rip]{1to8} #21.25 c5
kmov eax, k2 #20.23 c9
mov dl, dl #21.25 c9
kmov edx, k3 #21.25 c13
### if(!_mm512_kortestz(toobig, toosmall)) {
movzx eax, al #22.9 c13
movzx edx, dl #22.9 c17
kmov k0, eax #22.9 c17
kmov k1, edx #22.9 c21
kortest k0, k1 #22.9 c25
je ..B3.3 # Prob 50% #22.9 c25
It seems the compiler generates instructions to clear the high-order bits of the mask. As I understand it, vcmppd already clears the upper part of the mask, so the zero-extend instructions do not seem to serve any useful purpose. Since the code before the branch is on the critical path, I would rather avoid the overhead.
I am attaching a self-repro case, compiled with icpc -mmic -fsource-asm -masm=intel -S mmask8.cpp
If I am not using the proper idiom, what is the recommended way to test __mmask8 variables?