Tag[memory-bandwidth] Recent Newest Questions

Analysing performance of transpose function

I've written a naive and an "optimized" transpose functions for order-3 tensors containing double-precision complex numbers and I would like to analyz ...

Question about bandwidth ceilings in roofline models

I don't quite understand the bandwidth factor in roofline models described in Wikipedia (like the pic and its caption shown below): why the inters ...

Can hardware threads access main memory at the same time?

I am trying to understand microarchitecture. When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), ca ...

Can all of L2/L3 cache be used by data? If so, why does the Graviton 3 bandwidth plot drop off after half the L2/L3 size, but only gradually?

Consider Graviton3, for example. It's a 64-core CPU with per-core caches 64KiB L1d and 1MiB L2. And a shared L3 of 64MiB across all cores. The RAM ban ...

How to calculate the L3 cache bandwidth by using the performance counters linux?

I am trying to use linux perf to profile the L3 cache bandwidth gor a python script. I see that there are no available commands to measure that direct ...

Measure memory write bandwidth using C

I'm trying to measure the write bandwidth of my memory, I created an 8G char array, and call memset on it with 128 threads. Below is the code snippet. ...

C++ Optimize Memory Read Speed

I'm creating an int (32 bit) vector with 1024 * 1024 * 1024 elements like so: which holds 4 GB of random data at that point. And then I'm simply su ...

Why accessing an array of int8_t is not faster than int32_t, due to cache?

I have read that when accessing with a stride both loops should perform similarly, as memory accesses are in a higher order than multiplication. I ...

Is memory a bottleneck in matrix addition (SIMD Instructions)?

I'm trying to optimize 2d matrix addition in C using SIMD instructions (_mm256_add_pd, store, load, etc.). However, I'm not seeing a large speedup at ...

OpenCL Memory Bandwidth/Coalescing

Summary: I'm trying to write a memory bound OpenCL program that comes close to the advertised memory bandwidth on my GPU. In reality I'm off by a fac ...

How to efficiently vectorize polynomial computation with condition (roofline model)

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible. Exa ...

Does NUMA impact memory bandwidth, or just latency?

I have a problem that is memory bandwidth limited -- I need to read a lot (many GB) of data sequentially from RAM, do some quick processing and write ...

In this NVIDIA blog post, why was copying faster via shared memory?

A few years back, NVIDIA's Mark Harris posted this: An Efficient Matrix Transpose in CUDA C/C++ in which he described how to perform matrix transpos ...

Why is performance gain of C# SIMD low with larger arrays than tiny arrays?

I've been working on a Deep Learning Library writing on my own. In matrix operations, getting the best performance is a key for me. I've been research ...

MOVSD performance depends on arguments

I just noticed a pieces of my code exhibit different performance when copying memory. A test showed that a memory copying performance degraded if the ...

Random memory write is slower than random memory read?

I'm trying to figure out memory access time of sequential/random memory read/write. Here's the code: #include <assert.h> #include <stdio.h&g ...

memory bandwidth for many channels x86 systems

I'm testing the memory bandwidth on a desktop and a server. The peak bandwidth of the system is I'm using my own triad function from STREAM to m ...

what does STREAM memory bandwidth benchmark really measure?

I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark. Below is the comment from stream.c. What is the ...

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better) when the val ...

Can the Intel performance monitor counters be used to measure memory bandwidth?

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level). ...