image/svg+xmlVDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single PrecisionInstruction Operand EncodingDescriptionThis instruction performs a SIMD dot-product of two BF16 pairs and accumulates into a packed single precision register.“Round to nearest even” rounding mode is used when doing each accumulation of the FMA. Output denormals are always flushed to zero and input denormals are always treated as zero. MXCSR is not consulted nor updated. NaN propagation priorities are described in Table 5-1. OperationDefine make_fp32(x):// The x parameter is bfloat16. Pack it in to upper 16b of a dword. The bit pattern is a legal fp32 value. Return that bit pattern.dword := 0dword[31:16] := xRETURN dwordOpcode/InstructionOp/En64/32 bit Mode SupportCPUID Feature FlagDescriptionEVEX.128.F3.0F38.W0 52 /rVDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcstAV/VAVX512VLAVX512_BF16Multiply BF16 pairs from xmm2 and xmm3/m128, and accumulate the resulting packed single precision results in xmm1 with writemask k1.EVEX.256.F3.0F38.W0 52 /rVDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcstAV/VAVX512VLAVX512_BF16Multiply BF16 pairs from ymm2 and ymm3/m256, and accumulate the resulting packed single precision results in ymm1 with writemask k1.EVEX.512.F3.0F38.W0 52 /rVDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcstAV/VAVX512FAVX512_BF16Multiply BF16 pairs from zmm2 and zmm3/m512, and accumulate the resulting packed single precision results in zmm1 with writemask k1.Op/EnTupleOperand 1Operand 2Operand 3Operand 4AFullModRM:reg (w)EVEX.vvvv (r)ModRM:r/m (r)NATable 5-1. NaN Propagation PrioritiesNaN PriorityDescriptionComments1src1 low is NaNLower part has priority over upper part, i.e., it overrides the upper part.2src2 low is NaN3src1 high is NaNUpper part may be overridden if lower has NaN.4src2 high is NaN5srcdest is NaNDest is propagated if no NaN is encountered by src2.

image/svg+xmlVDPBF16PS srcdest, src1, src2VL = (128, 256, 512)KL = VL/32origdest := srcdestFOR i := 0 to KL-1:IF k1[ i ] or *no writemask*:IF src2 is memory and evex.b == 1:t := src2.dword[0]ELSE:t := src2.dword[ i ]// FP32 FMA with daz in, ftz out and RNE rounding. MXCSR neither consulted nor updated.srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+1]) * make_fp32(t.bfloat[1])srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+0]) * make_fp32(t.bfloat[0])ELSE IF *zeroing*:srcdest.dword[ i ] := 0ELSE: // merge masking, dest element unchangedsrcdest.dword[ i ] := origdest.dword[ i ]srcdest[MAXVL-1:VL] := 0Intel C/C++ Compiler Intrinsic EquivalentVDPBF16PS __m128 _mm_dpbf16_ps(__m128, __m128bh, __m128bh);VDPBF16PS __m128 _mm_mask_dpbf16_ps( __m128, __mmask8, __m128bh, __m128bh);VDPBF16PS __m128 _mm_maskz_dpbf16_ps(__mmask8, __m128, __m128bh, __m128bh);VDPBF16PS __m256 _mm256_dpbf16_ps(__m256, __m256bh, __m256bh);VDPBF16PS __m256 _mm256_mask_dpbf16_ps(__m256, __mmask8, __m256bh, __m256bh);VDPBF16PS __m256 _mm256_maskz_dpbf16_ps(__mmask8, __m256, __m256bh, __m256bh);VDPBF16PS __m512 _mm512_dpbf16_ps(__m512, __m512bh, __m512bh);VDPBF16PS __m512 _mm512_mask_dpbf16_ps(__m512, __mmask16, __m512bh, __m512bh);VDPBF16PS __m512 _mm512_maskz_dpbf16_ps(__mmask16, __m512, __m512bh, __m512bh);SIMD Floating-Point ExceptionsNone.Other ExceptionsSee Table2-49, “Type E4 Class Exception Conditions”.

This UNOFFICIAL reference was generated from the official Intel® 64 and IA-32 Architectures Software Developer’s Manual by a dumb script. There is no guarantee that some parts aren't mangled or broken and is distributed WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.