Tag[fma] Recent Newest Questions

How should I implement a generic FMA/FMAF instruction in software?

FMA is a fused multiply-add instruction. The fmaf (float x, float y, float z) function in glibc calls the vfmadd213ss instruction. I want to know how ...

Fastest way to multiply and sum/add two arrays (dot product) - unaligned surprisingly faster than FMA

Hi I have the following code: public unsafe class MultiplyAndAdd : IDisposable { float[] rawFirstData = new float[1024]; float[] rawSecondDat ...

Terminology: why "floating multiply-add" instead of "fused multiply-add"?

C11 (and newer): 7.12.13 Floating multiply-add IEEE 754-2008: fused multiply add, fusedMultiplyAdd Wikipedia: fused multiply-add ...

How to find magic multipliers for divisions by constant on a GPU?

I was looking at implementing the following computation, where divisor is nonzero and not a power of two unsigned multiplier(unsigned divisor) { ...

CUDA half float operations without explicit intrinsics

I am using CUDA 11.2 and I use the __half type to do operations on 16 bit floating point values. I am surprised that the nvcc compiler will not prope ...

incompatible types when assigning to type ‘__m256d’ from type ‘int’

I'm working on a project to optimize Matrix Multiplication and I'm trying to use intrinsics. Here's a bit of the code I'm using : All the lines us ...

How to refine floating-point division on FMA-capable GPUs?

When writing computational code for GPUs using APIs where compute shaders are translated via SPIR-V (in particular, Vulkan), I am guaranteed that ULP ...

GCC inclusion of AVX512's “Fused Multiply Add” instructions when compiling for Cascade-Lake processors

According to gcc's documention, compiling with "-march=cascadelake" does not enable the flag -AVX512IFMA (which, if I understand correctly, enables su ...

More aggresive optimization for FMA operations

I want to build a datatype that represents multiple (say N) arithmetic types and provides the same interface as an arithmetic type using operator over ...

How to disable fma3 instructions in gcc

I need to disable FMA3 instructions (for backward compatibility issue) for the 64bit system. I'v used _set_FMA3_enable(0) in my windows environment. A ...

How advantageous is using fused multiply-accumulate for double-precision?

I am trying to understand if is advantageous using std::fma with double arguments by looking at the assembly code that is generated, I am using the fl ...

Using FMA instructions for an FFT algorithm

I have a bit of C++ code that has become a somewhat useful FFT library over time, and it has been made to run decently fast using SSE and AVX instruct ...

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R ...

Difference between FMA and naive a*b+c?

In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z." So what's the difference between FMA and a naive code wh ...

How to use fused multiply and add in AVX for 16 bit packed integers

I know there it is possible to do multiply-and-add using a single instruction in AVX2. I want to use multiply-and-add instruction where each 256-bit A ...

How to solve “illegal instruction” for vfmadd213ps?

I have tried AVX intrinsics. But it caused "Unhandled exception at 0x00E01555 in test.exe: 0xC000001D: Illegal Instruction." I used Visual studio 201 ...

Is there a way to use OpenCL C mad function in Vulkan SPIR-V?

As we know, there's at least 2 ways to calculate a * b + c: ret := a*b; ret := ret + c; ret := fma(a, b, c); But in OpenCL C, there's a thir ...

clang/gcc only generates fma with -ffast-math; why?

On icc 19, a dot product compiles down to a loop over an fma instruction. On clang and gcc, the fma is only generated with -ffast-math. However, -ffa ...

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here: https://software.intel.com/sites/landingpage/Intrin ...

Throughput FMA and multiplication on X86 Broadwell

I am suspecting last Intel architecture to perform the mnemonic MUL like a FMA but with a null addition (on broadWell architecture). In details, I am ...