Strangely, Matrix Multiplications on GPUs Run…

Horace He

Apr 29, 2024

125

Great minds discuss flops per watt.

Read →

22 Comments

Dylan Patel

Apr 29, 2024Edited

Benchmark performance by input type chart for H100, H200, MI300X pls. Make it something I gotta pay for lol. It would be cool to see FLOPS/Watt across all those HW too, cause rn all we see is some random MFU across a whole model which isnt apples to apple or theoretical peaks.

Expand full comment

Reply (1)

Amit

Apr 30, 2024

I also second this request… though my expectations are:

For MI300X, you can almost guarantee it’ll be worse flops/W due to it using a slightly worse process 5nm vs 4nm and more importantly the crazy use of chiplets.

H200 will be similar to H100 for such a test where BW doesn’t matter.

Expand full comment

Reply (2)

Dylan Patel

Apr 30, 2024

Don't think the MI300X has worse process node, that is just marketing from Nvidia. The H100 uses a customized version of TSMC N5P which they branded as 4N. AMD uses a customized version of N5P for the GPU compute stacked on top of N6 based active interposer. The chiplets/packaging may lead to worse efficiency as you stated.

As far as H200 vs H100, I heard there was a new stepping, but I was unable to verify if that's true or not, so may be slightly more efficient. Same with AMD's MI350X, there is a new stepping there (I was able to verify), so it is more efficient than their MI300X

Expand full comment

etiennemlb

Sep 1

On CDNA2 (MI250X), you should get:

Matrix size: 8192x8192x8192

Samples: 20, Repeats per sample: 4

InitType::Zero

Min TFLOPS: 43.6801

Q2 TFLOPS: 43.6177

InitType::Uniform

Min TFLOPS: 39.2816

Q2 TFLOPS: 39.2104

InitType::Normal

Min TFLOPS: 36.0061

Q2 TFLOPS: 35.7632

On CDNA3 (MI300A not X):

Matrix size: 8192x8192x8192

Samples: 50, Repeats per sample: 4

InitType::Zero

Min TFLOPS: 105.391

Q2 TFLOPS: 104.471

InitType::Uniform

Min TFLOPS: 80.3456

Q2 TFLOPS: 79.4675

InitType::Normal

Min TFLOPS: 78.3632

Q2 TFLOPS: 77.5272

This shows the real energy issue of the mi300a... While a discrete GPU suffers also from the limited power budget, the mi300a looses more.

Expand full comment

Amit

Apr 30, 2024

Very nice article, Horace.

It’s something many of us have long suspected, but this is the first time I am seeing actual experimental results showing that peak flops are really just pie in the sky — even if memory bandwidth was infinite.

Expand full comment

Reply (1)

Horace He

Apr 30, 2024

Thanks! Indeed, I was surprised how constrained modern GPUs are by power.

Expand full comment

The Microprocessor Guy

Sep 7, 2024

Nice article from a high level PoV and impact of real-world data. The impact of data patterns i.e. the number if 1s and 0s flipped between successive computations whether in logic or in the wires that connect the gates is actually a well understood problem. In the simplest form the larger the Hamming distance, larger is the switching power where the capacitance of the gates and wires have to switch and burn power. Modulating voltage and thus reducing performance by doing so is just a way of managing the power problem, but the root of the issue is typically not addressed efficiently in GPus where some of these data dependent effects are exacerbated due to the hardware architecture. Solutions to explore are well known from information theory, entropy, encoding to reduce Hamming distance, power efficient compression, etc). It had been often an afterthought in GPu hardware design to pursue such power efficient design studies. With AI trends, power efficiency will put a cap on how much we can scale GPu performance with current design approaches.

Expand full comment

Uma Kelkar

May 1, 2024

Horace, great article!

Power has 2 elements for flops/circuits like you pointed out. Static power and dynamic power. dynamic power = c.V^2.f where V is the power voltage which is also changed along with clock freq for power consumption reduction. You stated the voltage dependence, but I wanted to highlight that it has more power over power than clock freq. Pun intended. I would expect Nvidia to play with voltage too since major hardware engineers from Transmeta who invented LongRun and LongRun2 technology joined their team 24 years ago. LongRun basically throttles clock frequency along with head voltage (supply voltage for circuits). The reason to point out this is: when the power is throttled to 100 Watts, performance drop could be because clock freq throttle s happening now and until then it was voltage scaling. Do we have any evidence of what kicks in first?

Secondly, static power = leakage power as you note in the article. At slower clock speeds, the balance of power shifters to static power (as dynamic power becomes a smaller percentage of total). So, it doesn't matter how much you are switching and what.

The anomaly I see is between all zero and all one input multiplications. To have all zeros at the input of multiplier, there are other circuits - almost half of total circuits, I would expect to be held at 1. Vice-versa being true for all 1's at the input. So, the difference is because of either: (1) The threshold of all cells is not at the center of the supply rails (Vdd and ground i.e. 0 Volt) which it cannot be precisely since I suspect that are changing supply voltage dynamically. Threshold is the voltage level beyond which an input will be construed to be 1 and below which an input will be construed 0. Or, (2) There is a particular circuit in the matmul pathway that was computationally burdensome - often latency - that was skewed to favor a 1 or 0 input to have better overall performance.

[Total power = static power + CV^2.f x activity factor (times each bit changes /clock cycle)]

Expand full comment

Reply (1)

Horace He

May 2, 2024

> when the power is throttled to 100 Watts, performance drop could be because clock freq throttle s happening now and until then it was voltage scaling

This is a good point - and probably explains explains why I see the reversal of the trend at 100W. Will add to the article.

> The anomaly I see is between all zero and all one input multiplications.

Is your question why there's a performance discrepancy between all-zero and all-one input multiplications?

Expand full comment

Reply (1)

Uma Kelkar

May 9, 2024

Re: article change, that will be awesome.

Re: discrepancy between all-zero and all-one input yes. If you have independent voltage control and clock control, I can suggest experiments that can suss out which of my guesses as to why this happens may be true. Thanks for the article once again.

Expand full comment

Amogh Jayant Dabholkar

May 13, 2024

Thanks, this is interesting, you have not tested all the way to 700 Watt, how do you necessarily know that it won't hit the 1.83GHz clock frequency?

Expand full comment

Reply (1)

Horace He

May 14, 2024

Haha, that's a fair question - I knew it from previous testing. I'll update the post with results on a proper h100 soon.

Expand full comment

nazago

I would expect that gaussian distribution is more efficient than uniform, as more numbers are closer to 0 so there are less flips, no?

Expand full comment

Aleksa Gordic - AI Epiphany

Sep 7Edited

https://arxiv.org/pdf/2402.13499 <- these guys previously described the phenomenon on Hopper

Expand full comment

etiennemlb

Sep 1Edited

On CDNA2 (MI250X), you should get:

Matrix size: 8192x8192x8192

Samples: 20, Repeats per sample: 4

InitType::Zero

Min TFLOPS: 43.6801

Q2 TFLOPS: 43.6177

InitType::Uniform

Min TFLOPS: 39.2816

Q2 TFLOPS: 39.2104

InitType::Normal

Min TFLOPS: 36.0061

Q2 TFLOPS: 35.7632

On CDNA3 (MI300A not X):

Matrix size: 8192x8192x8192

Samples: 50, Repeats per sample: 4

InitType::Zero

Min TFLOPS: 105.391

Q2 TFLOPS: 104.471

InitType::Uniform

Min TFLOPS: 80.3456

Q2 TFLOPS: 79.4675

InitType::Normal

Min TFLOPS: 78.3632

Q2 TFLOPS: 77.5272

This shows the real energy issue of the mi300a... While a discrete GPU suffers also from the limited power budget, the mi300a looses more.

Expand full comment

Xela

May 11

Practically speaking, for matrices with many zeros, there are techniques that work better (especially with approximate calculations) on them.

Expand full comment

Devansh Varshney

Apr 9

That tweet from Roon was the cherry on top.

Expand full comment

Yaroslav Bulatov

Mar 12

So does this mean we can estimate total number of transistor flips in a run by aggregating SM clock speed + power?

Expand full comment

Vriska

Jun 12, 2024

Fundamentally minimising load operations would save power and speed up the process. But how can you tell that energy throttling is what's causing performance decrease instead of caching which would also result in the same behavior?

Expand full comment

Reply (1)

Horace He

Jun 13, 2024

The fact that there `randn` results in worse performance than `rand` seemed pretty indicative to me that the primary effect is power and not some kind of caching effect. Just like `randn`, `rand` has fully unique elements at all points. The only difference is that `randn` includes both positive and negative values, while `rand` only has positive values. And I'm not sure what kind of plausible caching mechanism could result in a performance change in that case :P

Expand full comment

Steve Madere

May 1, 2024

So can aggressive cooling allow one to access previously inaccessible performance domains or does the GPU flat-out limit the power regardless of the cooling rate?

Expand full comment

Reply (1)

Horace He

May 1, 2024

My understanding is that this is BIOS limited - you can't sit a higher power limit with `nvidia-smi` at least. The power limit is typically set based off of cooling considerations though.

For example, if you look at https://www.nvidia.com/en-us/data-center/a100/, you'll see

> 400W TDP for standard configuration. HGX A100-80GB custom thermal solution (CTS) SKU can support TDPs up to 500W

Expand full comment

Thonk From First Principles

Strangely, Matrix Multiplications on GPUs Run…