Memory Bandwidth Napkin Math


31 bookmarks. First posted by cpswan 16 days ago.


Nice blog post on CPU memory bandwidth, comparing sequential and random access, by
from twitter_favs
6 days ago by demon386
Reading random values RAM is slow, 0.46 GB/s. Reading random values from L1 cache is very very fast, 13 GB/s. This is faster than the 11 GB/s performance we got sequentially reading int32 from RAM.
...
Random access from RAM is slow. Catastrophically slow. Less than 1 GB/s slow for both int32. Random access from the cache is remarkably quick. It's comparable to sequential RAM performance.

Single Threaded Comparison
Let this sink in. Random access into the cache has comparable performance to sequential access from RAM. The drop off from sub-L1 16 KB to L2-sized 256 KB is 2x or less.

I think this has profound implications.

Linked Lists Considered Harmful

Pointer chasing is bad. Really, really bad. Just how bad is it? I made an extra test that wraps matrix4x4 in std::unique_ptr. Each access has to go through a pointer. Here's the terrible, horrible, no good, very bad result.

1 Thread | matrix4x4 | unique_ptr | diff |
--------------------|---------------|------------|--------|
Large Block - Seq | 14.8 GB/s | 0.8 GB/s | 19x |
16 KB - Seq | 31.6 GB/s | 2.2 GB/s | 14x |
256 KB - Seq | 22.2 GB/s | 1.9 GB/s | 12x |
Large Block - Rand | 2.2 GB/s | 0.1 GB/s | 22x |
16 KB - Rand | 23.2 GB/s | 1.7 GB/s | 14x |
256 KB - Rand | 15.2 GB/s | 0.8 GB/s | 19x |

6 Threads | matrix4x4 | unique_ptr | diff |
--------------------|---------------|------------|--------|
Large Block - Seq | 34.4 GB/s | 2.5 GB/s | 14x |
16 KB - Seq | 154.8 GB/s | 8.0 GB/s | 19x |
256 KB - Seq | 111.6 GB/s | 5.7 GB/s | 20x |
Large Block - Rand | 7.1 GB/s | 0.4 GB/s | 18x |
16 KB - Rand | 95.0 GB/s | 7.8 GB/s | 12x |
256 KB - Rand | 58.3 GB/s | 1.6 GB/s | 36x |
Sequentially summing values behind a pointer runs at less than 1 GB/s. Random access, which misses the cache twice, runs at just 0.1 GB/s.

Pointer chasing is 10 to 20 times slower. Friends don't let friends used linked lists. Please, think of the children cache.
hardware  intel  memory  benchmark  performance  test  cpu 
13 days ago by some_hren
Memory Bandwidth Napkin Math
fb 
13 days ago by jubois
An exploration into C++ memory throughput performance.
hardware  memory  benchmark  cpu 
14 days ago by krzak
An exploration into C++ memory throughput performance.
bandwidth  benchmark  cpu  hardware  memory  benchmarking  cache  perf  performance  scale 
15 days ago by xer0x
Conclusion

A co-worker shared a way of thinking about programming problems I hadn't considered. That sent me on a journey to explore modern memory performance.

For the purpose of napkin math here's some ballpark figures for a modern desktop PC.

RAM Performance
Upper Limit: 45 GB/s
Napkin Estimate: 5 GB/s
Lower Limit: 1 GB/s
Cache Performance — L1/L2/L3 (per core)
Upper Limit (w/ simd): 210 GB/s / 80 GB/s / 60 GB/s
Napkin Estimate: 25 GB/s / 15 GB/s / 9 GB/s
Lower Limit: 13 GB/s / 8 GB/s / 3.5 GB/s
CPU  hardware  memory  performance 
15 days ago by euler
Memory Bandwidth Napkin Math
from twitter_favs
15 days ago by randallr
An exploration into C++ memory throughput performance.
perf 
15 days ago by nham
RT : Memory Bandwidth Napkin Math
from twitter
15 days ago by mht