Tag[neon] Recent Newest Questions

Is there a way to treat the register file as an array in ARMv8 (scalar or Neon)?

Suppose I have a short array v of say 8 int64_t. I have an algorithm that needs to access different elements of that array, which are not compile-time ...

Fastest way to search an array on m1 mac

I am trying to load an array of u16s from memory and find the first element that is less than some number, as fast as possible on an M1 mac. I have be ...

Looking for performance improvement of NEON code to match clipping area on the screen

Here is my test code to find 1st clipping area on the screen. Two subroutines and dummy loops in the code to compare the performance of them. point_i ...

bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD

Following my x86 question, I would like to know how it is possible to vectorized efficiently the following code on Arm-v8: static inline uint64_t Co ...

Detailed documentation on arm intrinsics support versions

I am trying to build an infrastructure (and database) so that people can detect the available SIMD intrinsics without connecting to the actual hardwar ...

Convert vector compare mask into bit mask in AArch64 SIMD or ARM NEON?

Lets take the example of "ABAA". I can use result = vceqq_u8(input, vdupq_n_u8('A')) to get FF 00 FF FF (or 0xFFFF00FF). Sometimes I only need to kno ...

Cycle count neon for M2?

Is there a resource on how many cycles SIMD is on apple M1/M2? Like x86 https://uops.info/table.html or agner fog? I wish I could give a bigger bounty ...

Sémantics of the VMLA ARM instruction

Am I right in saying that the VMLA.F32 instruction is fully equivalent to a F32 multiplication (complete with rounding step) followed by a F32 additio ...

ARM v7: SIMD lookup table for 32-bit floats

I have a vector of float32 numbers. For each element I have to find cos,sin I want to use a lookup table instead of the default library. Is there an ...

How to swap the byte order for individual words in a vector in ARM/ACLE

I usually write portable C code and try to adhere to strictly standard-conforming subset of the features supported by compilers. However, I'm writing ...

how to properly do multiply accumulate with NEON intrinsics

i need to do a simple multiply accumulate of two signed 8 bit arrays. This routine runs every millisecond on an ARM7 embedded device. I am trying to ...

Do these aarch64 intrinsics have alignment requirements?

Do the core::arch::aarch64 functions vld1q_u8 and vst1q_u8 have any alignment requirements? The documentation doesn't mention any, but the documentati ...

Memory copying: ARM STM vs. ARM NEON

I need to copy large amounts of memory (on the order of 47k) (example, from a USB buffer to a more permanent buffer). This is using an ARM Cortex A8. ...

What is the difference between sse2neon and arm_neon.h?

I am trying to build software to run on aws graviton3. To get the most out of the performance, aws advice to use sse2neon to port codes with SSE intri ...

GCC flag for emulating floating point operations in software on ARMv8 platform with neon FPU

Background I am trying to compile and run this on a Raspberry Pi 4 Model B Rev 1.4 running Ubuntu 20.04 LTS aarch64. The output of the lscpu command ...

efficiently creating a list of pointers to a character in a buffer using arm neon simd

I've been rewriting some performance sensitive parts of my code to aarch64 neon. For some things, like population count, i've managed to get a 12x spe ...

Why ARM NEON Intrinsics slower than C++ on simple vector multiplication task?

I try to multiply data in two float pointers and store the result into the third pointer, here is the C++ code: Optimize it by NEON Intrinsics: ...

Software optimization guide for AArch64 Neon and SVE

There is ARM software optimization guide (e.g., https://developer.arm.com/documentation/swog309707/latest for neoverse n1). This guide doesn't seem t ...

Neon : Perform Vector multiplication with a scalar value

Hi i am new to neon programming. Looking for vector multiplication with a scalar value. For adding two vector i was able to perform using following co ...

What kind of assembly instruction is this ld1 {v4.16b - v7.16b}, [x10]?

The below assembly instruction is AArch64 NEON / ASIMD assembly code. and found some related page about ld1 instruction. but there are no reference ...