14 Comments
Apr 29·edited Apr 29Liked by Horace He

Benchmark performance by input type chart for H100, H200, MI300X pls. Make it something I gotta pay for lol. It would be cool to see FLOPS/Watt across all those HW too, cause rn all we see is some random MFU across a whole model which isnt apples to apple or theoretical peaks.

Expand full comment
Apr 30Liked by Horace He

I also second this request… though my expectations are:

For MI300X, you can almost guarantee it’ll be worse flops/W due to it using a slightly worse process 5nm vs 4nm and more importantly the crazy use of chiplets.

H200 will be similar to H100 for such a test where BW doesn’t matter.

Expand full comment

Don't think the MI300X has worse process node, that is just marketing from Nvidia. The H100 uses a customized version of TSMC N5P which they branded as 4N. AMD uses a customized version of N5P for the GPU compute stacked on top of N6 based active interposer. The chiplets/packaging may lead to worse efficiency as you stated.

As far as H200 vs H100, I heard there was a new stepping, but I was unable to verify if that's true or not, so may be slightly more efficient. Same with AMD's MI350X, there is a new stepping there (I was able to verify), so it is more efficient than their MI300X

Expand full comment
Apr 30Liked by Horace He

Very nice article, Horace.

It’s something many of us have long suspected, but this is the first time I am seeing actual experimental results showing that peak flops are really just pie in the sky — even if memory bandwidth was infinite.

Expand full comment
author

Thanks! Indeed, I was surprised how constrained modern GPUs are by power.

Expand full comment

Thanks, this is interesting, you have not tested all the way to 700 Watt, how do you necessarily know that it won't hit the 1.83GHz clock frequency?

Expand full comment
author

Haha, that's a fair question - I knew it from previous testing. I'll update the post with results on a proper h100 soon.

Expand full comment
May 1Liked by Horace He

Horace, great article!

Power has 2 elements for flops/circuits like you pointed out. Static power and dynamic power. dynamic power = c.V^2.f where V is the power voltage which is also changed along with clock freq for power consumption reduction. You stated the voltage dependence, but I wanted to highlight that it has more power over power than clock freq. Pun intended. I would expect Nvidia to play with voltage too since major hardware engineers from Transmeta who invented LongRun and LongRun2 technology joined their team 24 years ago. LongRun basically throttles clock frequency along with head voltage (supply voltage for circuits). The reason to point out this is: when the power is throttled to 100 Watts, performance drop could be because clock freq throttle s happening now and until then it was voltage scaling. Do we have any evidence of what kicks in first?

Secondly, static power = leakage power as you note in the article. At slower clock speeds, the balance of power shifters to static power (as dynamic power becomes a smaller percentage of total). So, it doesn't matter how much you are switching and what.

The anomaly I see is between all zero and all one input multiplications. To have all zeros at the input of multiplier, there are other circuits - almost half of total circuits, I would expect to be held at 1. Vice-versa being true for all 1's at the input. So, the difference is because of either: (1) The threshold of all cells is not at the center of the supply rails (Vdd and ground i.e. 0 Volt) which it cannot be precisely since I suspect that are changing supply voltage dynamically. Threshold is the voltage level beyond which an input will be construed to be 1 and below which an input will be construed 0. Or, (2) There is a particular circuit in the matmul pathway that was computationally burdensome - often latency - that was skewed to favor a 1 or 0 input to have better overall performance.

[Total power = static power + CV^2.f x activity factor (times each bit changes /clock cycle)]

Expand full comment
author

> when the power is throttled to 100 Watts, performance drop could be because clock freq throttle s happening now and until then it was voltage scaling

This is a good point - and probably explains explains why I see the reversal of the trend at 100W. Will add to the article.

> The anomaly I see is between all zero and all one input multiplications.

Is your question why there's a performance discrepancy between all-zero and all-one input multiplications?

Expand full comment

Re: article change, that will be awesome.

Re: discrepancy between all-zero and all-one input yes. If you have independent voltage control and clock control, I can suggest experiments that can suss out which of my guesses as to why this happens may be true. Thanks for the article once again.

Expand full comment

Fundamentally minimising load operations would save power and speed up the process. But how can you tell that energy throttling is what's causing performance decrease instead of caching which would also result in the same behavior?

Expand full comment
author

The fact that there `randn` results in worse performance than `rand` seemed pretty indicative to me that the primary effect is power and not some kind of caching effect. Just like `randn`, `rand` has fully unique elements at all points. The only difference is that `randn` includes both positive and negative values, while `rand` only has positive values. And I'm not sure what kind of plausible caching mechanism could result in a performance change in that case :P

Expand full comment

So can aggressive cooling allow one to access previously inaccessible performance domains or does the GPU flat-out limit the power regardless of the cooling rate?

Expand full comment
author

My understanding is that this is BIOS limited - you can't sit a higher power limit with `nvidia-smi` at least. The power limit is typically set based off of cooling considerations though.

For example, if you look at https://www.nvidia.com/en-us/data-center/a100/, you'll see

> 400W TDP for standard configuration. HGX A100-80GB custom thermal solution (CTS) SKU can support TDPs up to 500W

Expand full comment