image/svg+xml V4FMADDPS/V4FNMADDPS — Packed Single-Precision Floating-Point Fused Multiply-Add (4-iterations) Instruction Operand Encoding Description This instruction computes 4 sequential packed fused single-precision floating-point multiply-add instructions with a sequentially selected memory operand in each of the four steps. In the above box, the notation of “+3” is used to denote that the instruction accesses 4 source registers based on that operand; sources are consecutive, start in a multiple of 4 boundary, and contain the encoded register operand. This instruction supports memory fault suppression. The entire memory operand is loaded if any of the 16 lowest significant mask bits is set to 1 or if a “no masking” encoding is used. The tuple type Tuple1_4X implies that four 32-bit elements (16 bytes) are referenced by the memory operation portion of this instruction. Rounding is performed at every FMA (fused multiply and add) boundary. Exceptions are also taken sequentially. Pre- and post-computational exceptions of the first FMA take priority over the pre- and post-computational excep- tions of the second FMA, etc. Opcode/ Instruction Op/ En 64/32 bit Mode Support CPUID Feature Flag Description EVEX.512.F2.0F38.W0 9A /r V4FMADDPS zmm1{k1}{z}, zmm2+3, m128 AV/VAVX512_4FMAPSMultiply packed single-precision floating-point values from source register block indicated by zmm2 by values from m128 and accumulate the result in zmm1. EVEX.512.F2.0F38.W0 AA /r V4FNMADDPS zmm1{k1}{z}, zmm2+3, m128 AV/VAVX512_4FMAPSMultiply and negate packed single-precision floating-point values from source register block indicated by zmm2 by values from m128 and accumulate the result in zmm1. Op/EnTupleOperand 1Operand 2Operand 3Operand 4 ATuple1_4XModRM:reg (r, w)EVEX.vvvv (r)ModRM:r/m (r)NA image/svg+xml Operation src_reg_id is the 5 bit index of the vector register specified in the instruction as the src1 register. define NFMA_PS(kl, vl, dest, k1, msrc, regs_loaded, src_base, posneg): tmpdest := dest // reg[] is an array representing the SIMD register file. FOR j := 0 to regs_loaded-1: FOR i := 0 to kl-1: IF k1[i] or *no writemask*: IF posneg = 0: tmpdest.single[i] := RoundFPControl_MXCSR(tmpdest.single[i] - reg[src_base + j ].single[i] * msrc.single[j]) ELSE: tmpdest.single[i] := RoundFPControl_MXCSR(tmpdest.single[i] + reg[src_base + j ].single[i] * msrc.single[j]) ELSE IF *zeroing*: tmpdest.single[i] := 0 dest := tmpdst dest[MAX_VL-1:VL] := 0 V4FMADDPS and V4FNMADDPS dest{k1}, src1, msrc (AVX512) KL, VL = (16,512) regs_loaded := 4 src_base := src_reg_id & ~3 // for src1 operand posneg := 0 if negative form, 1 otherwise NFMA_PS(kl, vl, dest, k1, msrc, regs_loaded, src_base, posneg) Intel C/C++ Compiler Intrinsic Equivalent V4FMADDPS __m512 _mm512_4fmadd_ps( __m512, __m512x4, __m128 *); V4FMADDPS __m512 _mm512_mask_4fmadd_ps(__m512, __mmask16, __m512x4, __m128 *); V4FMADDPS __m512 _mm512_maskz_4fmadd_ps(__mmask16, __m512, __m512x4, __m128 *); V4FNMADDPS __m512 _mm512_4fnmadd_ps(__m512, __m512x4, __m128 *); V4FNMADDPS __m512 _mm512_mask_4fnmadd_ps(__m512, __mmask16, __m512x4, __m128 *); V4FNMADDPS __m512 _mm512_maskz_4fnmadd_ps(__mmask16, __m512, __m512x4, __m128 *); SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal. Other Exceptions See Type E2; additionally #UDIf the EVEX broadcast bit is set to 1. #UDIf the MODRM.mod = 0b11. This UNOFFICIAL reference was generated from the official Intel® 64 and IA-32 Architectures Software Developer’s Manual by a dumb script. There is no guarantee that some parts aren't mangled or broken and is distributed WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE .