Quantcast
Viewing all articles
Browse latest Browse all 7

Efficient branching on double vector comparison using intrinsics?

Hi,

I want to check the range of a vector of double-precision variables, in order to branch to a slow path on exceptional out-of-range cases. My code looks like the following:

    // if(any(!(x < 4.) || (x < 2))) { ... }
    __mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
    __mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
    if(!_mm512_kortestz(toobig, toosmall)) {
        // do something with out-of-range numbers (slow path)
    }
    // do something with in-range numbers (fast path)

I expect it to map to a 3-instruction sequence. However, icc (13.1) seems to generate extra data movement and masking between comparisons and test:

###     __mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
        vcmpnltpd k2, zmm0, QWORD PTR .L_2il0floatpacket.5[rip]{1to8} #20.23 c1
###     __mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
        vcmpltpd  k3, zmm0, QWORD PTR .L_2il0floatpacket.6[rip]{1to8} #21.25 c5
        kmov      eax, k2                                       #20.23 c9
        mov       dl, dl                                        #21.25 c9
        kmov      edx, k3                                       #21.25 c13
###     if(!_mm512_kortestz(toobig, toosmall)) {
        movzx     eax, al                                       #22.9 c13
        movzx     edx, dl                                       #22.9 c17
        kmov      k0, eax                                       #22.9 c17
        kmov      k1, edx                                       #22.9 c21
        kortest   k0, k1                                        #22.9 c25
        je        ..B3.3        # Prob 50%                      #22.9 c25

It seems the compiler generates instructions to clear the high-order bits of the mask. As I understand it, vcmppd already clears the upper part of the mask, so the zero-extend instructions do not seem to serve any useful purpose. Since the code before the branch is on the critical path, I would rather avoid the overhead.

I am attaching a self-repro case, compiled with icpc -mmic -fsource-asm -masm=intel -S mmask8.cpp

If I am not using the proper idiom, what is the recommended way to test __mmask8 variables?


Viewing all articles
Browse latest Browse all 7

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>