Intrinsics for Load Operations

variable	definition
`src`	source element to use based on writemask result
`k`	writemask used as a selector
`mem_addr`	pointer to base address in memory
`base_addr`	pointer to base address in memory to begin load or store operation

_mm_mask_expandloadu_pd

__m128d _mm_mask_expandloadu_pd(__m128d src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandpd

Load as many contiguous double-precision (64-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 2 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm_maskz_expandloadu_pd

__m128d _mm_maskz_expandloadu_pd(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandpd

Load as many contiguous double-precision (64-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 2 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm256_mask_expandloadu_pd

__m256d _mm256_mask_expandloadu_pd(__m256d src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandpd

Load as many contiguous double-precision (64-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm256_maskz_expandloadu_pd

__m256d _mm256_maskz_expandloadu_pd(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandpd

Load as many contiguous double-precision (64-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm_mask_expandloadu_ps

__m128 _mm_mask_expandloadu_ps(__m128 src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandps

Load as many contiguous single-precision (32-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm_maskz_expandloadu_ps

__m128 _mm_maskz_expandloadu_ps(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandps

Load as many contiguous single-precision (32-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm256_mask_expandloadu_ps

__m256 _mm256_mask_expandloadu_ps(__m256 src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandps

Load as many contiguous single-precision (32-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 8 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm256_maskz_expandloadu_ps

__m256 _mm256_maskz_expandloadu_ps(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vexpandps

Load as many contiguous single-precision (32-bit) floating-point elements from unaligned memory at mem_addr as there are ones in the low 8 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm_mmask_i32gather_pd

__m128d _mm_mmask_i32gather_pd(__m128d src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherdpd

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i32gather_pd

__m256d _mm256_mmask_i32gather_pd(__m256d src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherdpd

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm_mmask_i32gather_ps

__m128 _mm_mmask_i32gather_ps(__m128 src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherdps

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i32gather_ps

__m256 _mm256_mmask_i32gather_ps(__m256 src, __mmask8 k, __m256i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherdps

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm_mmask_i64gather_pd

__m128d _mm_mmask_i64gather_pd(__m128d src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherqpd

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i64gather_pd

__m256d _mm256_mmask_i64gather_pd(__m256d src, __mmask8 k, __m256i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherqpd

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm_mmask_i64gather_ps

__m128 _mm_mmask_i64gather_ps(__m128 src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherqps

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i64gather_ps

__m128 _mm256_mmask_i64gather_ps(__m128 src, __mmask8 k, __m256i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vgatherqps

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm_mask_load_pd

__m128d _mm_mask_load_pd(__m128d src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovapd

Load packed double-precision (64-bit) floating-point elements from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm_maskz_load_pd

__m128d _mm_maskz_load_pd(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovapd

Load packed double-precision (64-bit) floating-point elements from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm256_mask_load_pd

__m256d _mm256_mask_load_pd(__m256d src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovapd

Load packed double-precision (64-bit) floating-point elements from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm256_maskz_load_pd

__m256d _mm256_maskz_load_pd(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovapd

Load packed double-precision (64-bit) floating-point elements from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm_mask_load_ps

__m128 _mm_mask_load_ps(__m128 src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovaps

Load packed single-precision (32-bit) floating-point elements from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm_maskz_load_ps

__m128 _mm_maskz_load_ps(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovaps

Load packed single-precision (32-bit) floating-point elements from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm256_mask_load_ps

__m256 _mm256_mask_load_ps(__m256 src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovaps

Load packed single-precision (32-bit) floating-point elements from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm256_maskz_load_ps

__m256 _mm256_maskz_load_ps(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovaps

Load packed single-precision (32-bit) floating-point elements from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm_mask_loadu_pd

__m128d _mm_mask_loadu_pd(__m128d src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovupd

Load packed double-precision (64-bit) floating-point elements from memoy into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_maskz_loadu_pd

__m128d _mm_maskz_loadu_pd(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovupd

Load packed double-precision (64-bit) floating-point elements from memoy into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_mask_loadu_pd

__m256d _mm256_mask_loadu_pd(__m256d src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovupd

Load packed double-precision (64-bit) floating-point elements from memoy into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_maskz_loadu_pd

__m256d _mm256_maskz_loadu_pd(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovupd

Load packed double-precision (64-bit) floating-point elements from memoy into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_mask_loadu_ps

__m128 _mm_mask_loadu_ps(__m128 src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovups

Load packed single-precision (32-bit) floating-point elements from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_maskz_loadu_ps

__m128 _mm_maskz_loadu_ps(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovups

Load packed single-precision (32-bit) floating-point elements from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_mask_loadu_ps

__m256 _mm256_mask_loadu_ps(__m256 src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovups

Load packed single-precision (32-bit) floating-point elements from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_maskz_loadu_ps

__m256 _mm256_maskz_loadu_ps(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovups

Load packed single-precision (32-bit) floating-point elements from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_mask_load_epi32

__m128i _mm_mask_load_epi32(__m128i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa32

Load packed 32-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm_maskz_load_epi32

__m128i _mm_maskz_load_epi32(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa32

Load packed 32-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm256_mask_load_epi32

__m256i _mm256_mask_load_epi32(__m256i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa32

Load packed 32-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm256_maskz_load_epi32

__m256i _mm256_maskz_load_epi32(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa32

Load packed 32-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm_mask_load_epi64

__m128i _mm_mask_load_epi64(__m128i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa64

Load packed 64-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm_maskz_load_epi64

__m128i _mm_maskz_load_epi64(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa64

Load packed 64-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

_mm256_mask_load_epi64

__m256i _mm256_mask_load_epi64(__m256i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa64

Load packed 64-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm256_maskz_load_epi64

__m256i _mm256_maskz_load_epi64(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqa64

Load packed 64-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

_mm_mask_loadu_epi16

__m128i _mm_mask_loadu_epi16(__m128i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu16

Load packed 16-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_maskz_loadu_epi16

__m128i _mm_maskz_loadu_epi16(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu16

Load packed 16-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_mask_loadu_epi16

__m256i _mm256_mask_loadu_epi16(__m256i src, __mmask16 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu16

Load packed 16-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_maskz_loadu_epi16

__m256i _mm256_maskz_loadu_epi16(__mmask16 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu16

Load packed 16-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm512_mask_loadu_epi16

__m512i _mm512_mask_loadu_epi16(__m512i src, __mmask32 k, void const* mem_addr)

CPUID Flags: AVX512BW

Instruction(s): vmovdqu16

Load packed 16-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm512_maskz_loadu_epi16

__m512i _mm512_maskz_loadu_epi16(__mmask32 k, void const* mem_addr)

CPUID Flags: AVX512BW

Instruction(s): vmovdqu16

Load packed 16-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_mask_loadu_epi32

__m128i _mm_mask_loadu_epi32(__m128i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu32

Load packed 32-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_maskz_loadu_epi32

__m128i _mm_maskz_loadu_epi32(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu32

Load packed 32-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_mask_loadu_epi32

__m256i _mm256_mask_loadu_epi32(__m256i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu32

Load packed 32-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_maskz_loadu_epi32

__m256i _mm256_maskz_loadu_epi32(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu32

Load packed 32-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_mask_loadu_epi64

__m128i _mm_mask_loadu_epi64(__m128i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu64

Load packed 64-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_maskz_loadu_epi64

__m128i _mm_maskz_loadu_epi64(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu64

Load packed 64-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_mask_loadu_epi64

__m256i _mm256_mask_loadu_epi64(__m256i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu64

Load packed 64-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_maskz_loadu_epi64

__m256i _mm256_maskz_loadu_epi64(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vmovdqu64

Load packed 64-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_mask_loadu_epi8

__m128i _mm_mask_loadu_epi8(__m128i src, __mmask16 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu8

Load packed 8-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_maskz_loadu_epi8

__m128i _mm_maskz_loadu_epi8(__mmask16 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu8

Load packed 8-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_mask_loadu_epi8

__m256i _mm256_mask_loadu_epi8(__m256i src, __mmask32 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu8

Load packed 8-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm256_maskz_loadu_epi8

__m256i _mm256_maskz_loadu_epi8(__mmask32 k, void const* mem_addr)

CPUID Flags: AVX512BW, AVX512VL

Instruction(s): vmovdqu8

Load packed 8-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm512_mask_loadu_epi8

__m512i _mm512_mask_loadu_epi8(__m512i src, __mmask64 k, void const* mem_addr)

CPUID Flags: AVX512BW

Instruction(s): vmovdqu8

Load packed 8-bit integers from memory into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm512_maskz_loadu_epi8

__m512i _mm512_maskz_loadu_epi8(__mmask64 k, void const* mem_addr)

CPUID Flags: AVX512BW

Instruction(s): vmovdqu8

Load packed 8-bit integers from memory into the return value using zeromask k (elements are zeroed out when the corresponding mask bit is not set). mem_addr does not need to be aligned on any particular boundary.

_mm_mask_expandloadu_epi32

__m128i _mm_mask_expandloadu_epi32(__m128i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandd

Load as many contiguous 32-bit integers from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm_maskz_expandloadu_epi32

__m128i _mm_maskz_expandloadu_epi32(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandd

Load as many contiguous 32-bit integers from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm256_mask_expandloadu_epi32

__m256i _mm256_mask_expandloadu_epi32(__m256i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandd

Load as many contiguous 32-bit integers from unaligned memory at mem_addr as there are ones in the low 8 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm256_maskz_expandloadu_epi32

__m256i _mm256_maskz_expandloadu_epi32(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandd

Load as many contiguous 32-bit integers from unaligned memory at mem_addr as there are ones in the low 8 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm_mask_expandloadu_epi64

__m128i _mm_mask_expandloadu_epi64(__m128i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandq

Load as many contiguous 64-bit integers from unaligned memory at mem_addr as there are ones in the low 2 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm_maskz_expandloadu_epi64

__m128i _mm_maskz_expandloadu_epi64(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandq

Load as many contiguous 64-bit integers from unaligned memory at mem_addr as there are ones in the low 2 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm256_mask_expandloadu_epi64

__m256i _mm256_mask_expandloadu_epi64(__m256i src, __mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandq

Load as many contiguous 64-bit integers from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are copied from src when the corresponding mask bit is not set).

_mm256_maskz_expandloadu_epi64

__m256i _mm256_maskz_expandloadu_epi64(__mmask8 k, void const* mem_addr)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpexpandq

Load as many contiguous 64-bit integers from unaligned memory at mem_addr as there are ones in the low 4 bits of mask k, and place them in the result element positions corresponding to the positions of the ones in the mask (elements are zeroed out when the corresponding mask bit is not set).

_mm_mmask_i32gather_epi32

__m128i _mm_mmask_i32gather_epi32(__m128i src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherdd

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i32gather_epi32

__m256i _mm256_mmask_i32gather_epi32(__m256i src, __mmask8 k, __m256i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherdd

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm_mmask_i32gather_epi64

__m128i _mm_mmask_i32gather_epi64(__m128i src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherdq

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i32gather_epi64

__m256i _mm256_mmask_i32gather_epi64(__m256i src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherdq

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm_mmask_i64gather_epi32

__m128i _mm_mmask_i64gather_epi32(__m128i src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherqd

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i64gather_epi32

__m128i _mm256_mmask_i64gather_epi32(__m128i src, __mmask8 k, __m256i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherqd

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm_mmask_i64gather_epi64

__m128i _mm_mmask_i64gather_epi64(__m128i src, __mmask8 k, __m128i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherqq

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

_mm256_mmask_i64gather_epi64

__m256i _mm256_mmask_i64gather_epi64(__m256i src, __mmask8 k, __m256i vindex, void const* base_addr, const int scale)

CPUID Flags: AVX512F, AVX512VL

Instruction(s): vpgatherqq

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into the return value using writemask k (elements are copied from src when the corresponding mask bit is not set). scale should be 1, 2, 4 or 8.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® C++ Compiler Classic Developer Guide and Reference

Intrinsics for Load Operations