<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Thonk From First Principles]]></title><description><![CDATA[ML Systems from first principles. Aims to be better than a ChatGPT summary.]]></description><link>https://www.thonking.ai</link><image><url>https://substackcdn.com/image/fetch/$s_!WNJs!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55e3b22f-cc6b-438a-be3d-8d17cc97c2f9_750x750.png</url><title>Thonk From First Principles</title><link>https://www.thonking.ai</link></image><generator>Substack</generator><lastBuildDate>Wed, 06 May 2026 10:21:45 GMT</lastBuildDate><atom:link href="https://www.thonking.ai/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Horace He]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[thonking@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[thonking@substack.com]]></itunes:email><itunes:name><![CDATA[Horace He]]></itunes:name></itunes:owner><itunes:author><![CDATA[Horace He]]></itunes:author><googleplay:owner><![CDATA[thonking@substack.com]]></googleplay:owner><googleplay:email><![CDATA[thonking@substack.com]]></googleplay:email><googleplay:author><![CDATA[Horace He]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Why PyTorch is an amazing place to work... and Why I'm Joining Thinking Machines]]></title><description><![CDATA[In which I convince to you to join either PyTorch or Thinking Machines!]]></description><link>https://www.thonking.ai/p/why-pytorch-is-an-amazing-place-to</link><guid isPermaLink="false">https://www.thonking.ai/p/why-pytorch-is-an-amazing-place-to</guid><dc:creator><![CDATA[Horace He]]></dc:creator><pubDate>Tue, 04 Mar 2025 10:31:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bUrf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: This is a revised (and extended) version of the note I posted internally when I decided to leave PyTorch. I&#8217;ve gotten a lot of questions, and I hope that this post will both 1. answer those questions, 2. convince you that working at PyTorch or Thinking Machines would be a great idea :)</em></p><p>Well, there&#8217;s no easy way to say this. After about 4 years of working at PyTorch, I&#8217;ve decided to leave PyTorch to be a founding engineer at <a href="https://thinkingmachines.ai/">Thinking Machines</a>. In that sentence, I would emphasize &#8220;be a founding engineer at Thinking Machines&#8221; far more than &#8220;leave PyTorch&#8221; - I have (and continue) to enjoy working on PyTorch, and would have gladly worked here another 4 years.</p><p>At several points over the last several years, I&#8217;ve talked to folks that have expressed surprise that I&#8217;m still at PyTorch. Not to brag, but it certainly wasn&#8217;t for lack of opportunity - I&#8217;ve been offered roles at OpenAI/Anthropic, I was recruited to be a founding engineer at {xAI, SSI, Adept, Inflection, etc.}, I&#8217;ve been offered many other startup roles at other startups you likely know, etc. With hindsight, many of these opportunities would have led to much greater compensation, but I&#8217;ve never regretted staying at PyTorch.</p><p>Let&#8217;s talk about why I&#8217;ve enjoyed working at PyTorch for 4 years, and what compelled me to go to Thinking Machines. </p><p>Apologies in advance for this note, it&#8217;s a self-indulgent and personal post. But I only get to write this once! </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bUrf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bUrf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 424w, https://substackcdn.com/image/fetch/$s_!bUrf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 848w, https://substackcdn.com/image/fetch/$s_!bUrf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 1272w, https://substackcdn.com/image/fetch/$s_!bUrf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bUrf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png" width="586" height="781.1991758241758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1941,&quot;width&quot;:1456,&quot;resizeWidth&quot;:586,&quot;bytes&quot;:2427573,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.thonking.ai/i/158277004?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dd99a48-d622-41da-8b2f-32712110ce82_2048x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!bUrf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 424w, https://substackcdn.com/image/fetch/$s_!bUrf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 848w, https://substackcdn.com/image/fetch/$s_!bUrf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 1272w, https://substackcdn.com/image/fetch/$s_!bUrf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5289e993-9fe1-47f7-8f86-a3e97a30ced1_1536x2048.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">It&#8217;s tradition at Meta to post a picture of your badge when leaving. Sadly, I couldn&#8217;t find my official badge, so a temp badge will have to do.</figcaption></figure></div><h2>How I ended up at PyTorch</h2><p>I think it&#8217;s fair to call me an AI &#8220;true believer&#8221;. Ever since I saw AlphaGo in high school and read the <a href="https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html">WaitButWhy AI post </a>(not totally sure it holds up 10 years later), I was convinced that AI would be the most important technology of my lifetime. Correspondingly, from the time that I started college in 2016, most of what I did was AI-related. I took ML classes, I founded an <a href="https://cuai.github.io/">undergraduate ML research club</a>, I published papers, I even met my girlfriend (now fiancee!) through doing ML research together!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yovf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yovf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 424w, https://substackcdn.com/image/fetch/$s_!yovf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 848w, https://substackcdn.com/image/fetch/$s_!yovf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!yovf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yovf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png" width="508" height="414.9651162790698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1124,&quot;width&quot;:1376,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yovf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 424w, https://substackcdn.com/image/fetch/$s_!yovf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 848w, https://substackcdn.com/image/fetch/$s_!yovf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!yovf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029a8090-a302-4e3e-bca2-d6e0a27d48e3_1376x1124.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The figure that convinced a 17 year old me. I also gave a deeply embarrassing presentation to one of my high school classes about AI risk entitled &#8220;The Robots are Coming&#8221; that included this slide.</figcaption></figure></div><p>However, there were several things that didn&#8217;t totally satisfy me about just doing ML research.</p><p>For one, although I was publishing papers and such, even back then, it wasn&#8217;t super clear to me whether any research I did was actually &#8220;meaningful&#8221;. In research, one demoralizing aspect is that, looking back, 99% of papers don&#8217;t end up on the &#8220;critical path&#8221; to what actually works. Cynically, any PhD spending their time on n-gram models wasted their time - their papers and theses relegated to the dustbin of history. Although it&#8217;s true that even papers that aren&#8217;t on the critical path can still be useful and necessary (e.g. demonstrating limitations of existing approaches, setting up baselines for new approaches to surpass), this worry constantly nagged at me.</p><p>Second, I was never able to adapt well to the &#8220;running experiments&#8221; mode of ML experimentation - my working style is somewhat irregular, with lots of thinking punctuated by lots of coding. On the other hand, I think tremendous discipline is required to be a great ML experimentalist - it&#8217;s a constant loop of &#8220;come up with hypothesis&#8221; =&gt; &#8220;run experiment&#8221; =&gt; &#8220;get result of last experiment back&#8221; =&gt; &#8220;come up with new hypothesis&#8221;, often pipelined several stages deep. In ML research, you have a physical constraint (GPUs) and to be a good researcher you must get good experiment utilization out of it.</p><p>Overall, I ended up gravitating much more to &#8220;systems&#8221;. Not only was it an area that I thought played well to my strengths, I had always admired the impact of systems. Instead of needing to directly deliver the impact, you could instead improve the impact of thousands or millions of people by 5% instead! </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Oe_-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Oe_-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 424w, https://substackcdn.com/image/fetch/$s_!Oe_-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 848w, https://substackcdn.com/image/fetch/$s_!Oe_-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 1272w, https://substackcdn.com/image/fetch/$s_!Oe_-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Oe_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png" width="1456" height="931" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:931,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Oe_-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 424w, https://substackcdn.com/image/fetch/$s_!Oe_-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 848w, https://substackcdn.com/image/fetch/$s_!Oe_-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 1272w, https://substackcdn.com/image/fetch/$s_!Oe_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99f2af4-3fdf-481e-8fdb-e7f6a14a159e_1600x1023.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">This logic was also how I ended up maintaining the VSCodeVim extension for some time&#8230;</figcaption></figure></div><p>So, I ended up with my career plan - instead of directly working on advancing ML development, I would instead work on building infra to help <em>others</em> advance ML development. A bunch of other things happened in-between, but that&#8217;s how I ended up on PyTorch.</p><h1>Why PyTorch is an amazing place to work</h1><h2>PyTorch&#8217;s Impact on the Industry</h2><p>As the field (and money!) has exploded over the last 10 years, I think it&#8217;s easy to lose track of just how much impact PyTorch has had. Perhaps the most obvious tracker of the money in the field is Nvidia&#8217;s stock price, primarily driven by growth in server-side GPU sales.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gV9Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gV9Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 424w, https://substackcdn.com/image/fetch/$s_!gV9Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 848w, https://substackcdn.com/image/fetch/$s_!gV9Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!gV9Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gV9Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png" width="1456" height="1049" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1049,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gV9Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 424w, https://substackcdn.com/image/fetch/$s_!gV9Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 848w, https://substackcdn.com/image/fetch/$s_!gV9Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!gV9Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5227980-3fab-4da5-90ac-669813475ce4_1532x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I think it&#8217;s reasonable to guess that at least 75% of these GPUs are running some kind of PyTorch code.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> That&#8217;s insane. Nvidia has gained something like 3T dollars of market cap, and PyTorch was crucial for much of it.</p><p>Moreover, among the ML community broadly, PyTorch remains the lingua-franca. <a href="https://paperswithcode.com/trends">59% of research papers </a>tracked by Papers With Code use PyTorch (with 29% not using any ML framework), <a href="https://huggingface.co/models?sort=trending">the vast majority of models on Huggingface</a> (90%+?) are on top of PyTorch, and the most popular inference servers are also built on top of PyTorch (<a href="https://github.com/vllm-project/vllm">vllm</a> and <a href="https://github.com/sgl-project/sglang">sglang</a>).</p><p>Even at the leading AI labs, pretty much all of the companies using GPUs use PyTorch. OpenAI, Mistral, Deepseek, and Meta primarily use PyTorch (and GPUs). Anthropic also primarily uses PyTorch for GPUs, and xAI (which uses Jax for training on GPUs) also uses PyTorch for inference (through <a href="https://github.com/sgl-project/sglang">sglang</a>)!</p><p>In high school, one of the things I feared most was that I would work on some project for 10 years and eventually realize that I&#8217;ve wasted my life improving something that nobody cared about. One of the greatest things about working on PyTorch is the certainty that I haven&#8217;t.</p><h2>PyTorch&#8217;s Impact on Me</h2><p>I&#8217;ve spent my entire career (thus far) on PyTorch, and so, outside of the overall impact of PyTorch, I want to talk about why I&#8217;ve enjoyed the day-to-day so much.</p><h3>Mission Alignment</h3><p>One of the greatest things about startups is &#8220;mission alignment&#8221;. Because so much of your comp is tied to equity upside, there&#8217;s no difference between &#8220;my coworker wildly succeeded&#8221; and &#8220;we all wildly succeeded&#8221;. On the other hand, at a large tech company, people&#8217;s compensation is primarily tied to their individual performance rating (and promos). So, if you start working on an approach, and somebody else comes up with a different approach that&#8217;s wildly successful (and supersedes yours), your performance rating is likely to suffer, and you probably won&#8217;t get promoted.</p><p>At PyTorch however, many of the people on the project <em>are</em> mission aligned - they do genuinely care about the overall success of PyTorch and its impact on the ML ecosystem. I certainly wouldn&#8217;t say it&#8217;s 100% of people on the team, but it&#8217;s enough (especially among more senior folk) to make it a much more pleasant experience.</p><h3>A True Commitment to Open-Source</h3><p>Soumith (and other folks in leadership) have done an exceptional job in cultivating a culture where OSS is valued at Pytorch. There are many other projects that happen to be OSS, but you can often only get promoted and have impact through prioritizing internal projects. This isn&#8217;t true at PyTorch - I would say that I&#8217;ve spent my entire time here primarily focused on OSS impact, and I&#8217;ve been successful in terms of ratings and promotions. (Of course, there are other folks primarily focused on internal impact who have also been very successful).</p><p>There are other aspects in which valuing OSS leads to a much healthier project.</p><h3>Ungameable Impact</h3><p>One phenomenon at large tech companies that I don&#8217;t really like is what I call &#8220;roadmap-driven adoption&#8221;. This is where two managers/directors/VPs get together, agree that X should be used (potentially killing other project Y), and then the adoption of a project is laid out in roadmaps of several teams.</p><p>While this certainly has its advantages (and in some cases is entirely necessary), I find that projects adopted this way are often &#8230; subpar. Moreover, it&#8217;s not uncommon for the success of these projects to be a facade - they continue as long as some VP is sponsoring the project, but eventually people get fed up with it, the VP loses a political fight, or the VP simply changes their mind. Basically, in roadmap-driven development, the most important component is convincing some &#8220;key stakeholders&#8221; that your project should be adopted.</p><p>On the other hand, OSS impact is truly a free market. Open source users couldn&#8217;t care less if <strong>Mark Zuckerberg</strong> is throwing his entire support behind a project. OSS users only care that 1. you&#8217;re solving a problem they have, and 2. They like using your software.</p><p><a href="https://x.com/schrep/status/1781905719983530395">Mike Schroepfer (former CTO at Meta)</a> provided a similar sentiment here. I can&#8217;t even imagine how hard it is to get &#8220;real&#8221; feedback as a CTO, where everybody you talk to knows that you can be single-handedly responsible for their promotion or bonus. OSS is refreshingly ungameable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MyUL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MyUL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 424w, https://substackcdn.com/image/fetch/$s_!MyUL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 848w, https://substackcdn.com/image/fetch/$s_!MyUL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 1272w, https://substackcdn.com/image/fetch/$s_!MyUL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MyUL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png" width="1088" height="982" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:982,&quot;width&quot;:1088,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MyUL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 424w, https://substackcdn.com/image/fetch/$s_!MyUL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 848w, https://substackcdn.com/image/fetch/$s_!MyUL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 1272w, https://substackcdn.com/image/fetch/$s_!MyUL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ffc831-702c-4775-8bc6-a6a49acdd1ce_1088x982.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Amazing Opportunities for Career Growth</h3><p>Even from a purely selfish perspective, it&#8217;s hard to imagine a place with more opportunities for career growth. Folks can often get tunnel vision on internal promotions. <em>But</em>, a much bigger driver of career growth is often &#8220;what do folks across the industry think of you&#8221;. Objectively speaking, I&#8217;ve gotten incredible offers as founding engineers at startups (SSI, xAI, Thinking Machines) as well as great opportunities for (fairly) high profile talks (<a href="https://www.youtube.com/watch?v=139UPjoq7Kw">Jane Street</a>, a couple lectures at universities, etc.). Even among big tech companies, at some point a director offered me a 2.5 level jump to go work at Google.</p><p>Frankly, it&#8217;s embarrassing to write the above paragraph - there are definitely other folks on PyTorch who are just as good or better than me, but due to my focus on OSS and public presence (twitter/blogs/etc.), I&#8217;ve become fairly well known. But&#8230; this is somewhat of a &#8220;sell-post&#8221; on PyTorch, so I&#8217;ll allow it :) </p><p>I would say that just working <strong>on</strong> PyTorch alone is a great opportunity.<a href="https://www.alignmentforum.org/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers"> Andy Jones (at Anthropic) once wrote an essay</a> whose top line was &#8220;If you think you could write a substantial pull request for a major machine learning library, then major AI safety labs want to interview you today.&#8221; Obviously, if you work on PyTorch, you qualify for this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2xqZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2xqZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 424w, https://substackcdn.com/image/fetch/$s_!2xqZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 848w, https://substackcdn.com/image/fetch/$s_!2xqZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 1272w, https://substackcdn.com/image/fetch/$s_!2xqZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2xqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png" width="1456" height="837" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:837,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2xqZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 424w, https://substackcdn.com/image/fetch/$s_!2xqZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 848w, https://substackcdn.com/image/fetch/$s_!2xqZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 1272w, https://substackcdn.com/image/fetch/$s_!2xqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23cc54cc-683f-46c9-bd77-8149ab670d21_1600x920.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But where Pytorch truly excels is the opportunity for engineers to have industry-wide impact (and to get recognized for it!). It&#8217;s very rare that most people in the most important field of our time use your software - I&#8217;d recommend taking advantage of it.</p><p>Last note - even for internal work, I find broad OSS impact is very helpful. Especially when it comes to cross-team collaboration, one of the most important currencies is &#8220;legitimacy&#8221;, and OSS impact is far more likely to confer legitimacy than internal impact. I&#8217;ve benefited significantly from this.</p><h3>Interesting Technical Work</h3><p>One fear of many engineers is that they won&#8217;t be able to solve interesting technical problems - there&#8217;s no shortage of that on PyTorch. There&#8217;s projects that have implemented a python bytecode interpreter JIT for machine learning (<a href="https://pytorch.org/docs/stable/torch.compiler_dynamo_overview.html">TorchDynamo</a>), projects on reaching speed-of-light for matrix multiplications, projects where you <a href="https://github.com/yifuwang/symm-mem-recipes/blob/main/triton_all_gather_matmul.py#L68">regularly need to dive into PTX documentation</a> (perhaps this one isn&#8217;t appealing hahaha), projects <a href="https://github.com/pytorch/pytorch/blob/f4e4cfcb91d9db966633fdb1828ada725369296b/torch/fx/experimental/symbolic_shapes.py#L3680">all about reasoning over symbolic shapes (sympy, z3, etc.)</a>, and so many more.</p><p>There&#8217;s also no shortage of problems to be solved &#128539;</p><h2>Consider Working on PyTorch!</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z0wJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z0wJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 424w, https://substackcdn.com/image/fetch/$s_!z0wJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 848w, https://substackcdn.com/image/fetch/$s_!z0wJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!z0wJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z0wJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg" width="444" height="406.4032258064516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:496,&quot;resizeWidth&quot;:444,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;How lucky I am to have something that makes saying goodbye so hard\&quot; Winnie  the Pooh/A. A. Milne [496 x 454] : r/QuotesPorn&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How lucky I am to have something that makes saying goodbye so hard&quot; Winnie  the Pooh/A. A. Milne [496 x 454] : r/QuotesPorn" title="How lucky I am to have something that makes saying goodbye so hard&quot; Winnie  the Pooh/A. A. Milne [496 x 454] : r/QuotesPorn" srcset="https://substackcdn.com/image/fetch/$s_!z0wJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 424w, https://substackcdn.com/image/fetch/$s_!z0wJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 848w, https://substackcdn.com/image/fetch/$s_!z0wJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!z0wJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9836c3f3-6044-4c0c-8139-e882a7c61985_496x454.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Perhaps a year ago I was talking with Adam Paszke about working on ML frameworks, and we agreed that it was surprising to us that given how sweet the gig is, it was surprising to us that more people didn&#8217;t want to work on ML frameworks. This post is (partially) my attempt to remedy that.</p><p>If any of the above sounds appealing to you, consider working on the Pytorch Core<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>team (primarily under Gregory Chanan)! I would recommend contacting/emailing Soumith Chintala at <a href="mailto:soumith@meta.com">soumith@meta.com</a>. Although I will (sadly) no longer be on the team, there are many other amazing people to work with (and I plan to continue to be in regular contact with the team).</p><p>From my perspective, an ideal candidate would:</p><ol><li><p><strong>Have an inherent curiosity in the field of machine learning</strong>: Machine learning moves very quickly, and one characteristic I&#8217;ve found most valuable is a general knowledge of what&#8217;s going to be useful next. For me, I found the years of just doing pure ML research (using PyTorch) to be extremely helpful in understanding what should be built.</p></li><li><p><strong>Be highly agentic: </strong>Although having ideas on what should be done is an important characteristic, it&#8217;s perhaps even more valuable to just &#8220;do it&#8221;. PyTorch, in many ways, is a very &#8220;bottom-up&#8221; org. Although there&#8217;s always roadmapping and such, the most impactful projects were never on the roadmap.</p></li></ol><h1>Why I&#8217;m Excited About Thinking Machines</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8jwG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8jwG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 424w, https://substackcdn.com/image/fetch/$s_!8jwG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 848w, https://substackcdn.com/image/fetch/$s_!8jwG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 1272w, https://substackcdn.com/image/fetch/$s_!8jwG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8jwG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png" width="654" height="282.77161862527714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:390,&quot;width&quot;:902,&quot;resizeWidth&quot;:654,&quot;bytes&quot;:93067,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.thonking.ai/i/158277004?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8jwG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 424w, https://substackcdn.com/image/fetch/$s_!8jwG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 848w, https://substackcdn.com/image/fetch/$s_!8jwG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 1272w, https://substackcdn.com/image/fetch/$s_!8jwG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febea000e-fc92-4bdc-a075-dccc350fe54a_902x390.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Given that I just wrote way too much about why I loved working on PyTorch, why did I join Thinking Machines? Moreover, why was <em>Thinking Machines</em> the opportunity that convinced me?</p><h3>A Group of People I Would Very Much Like to Work With</h3><p>As everyone knows, a startup is nothing without its people. And Thinking Machines sure has some pretty good people!</p><p>You have the folks responsible for <strong>the</strong> <a href="https://openai.com/index/chatgpt/">&#8220;research preview&#8221;</a> that&#8217;s kickstarted this situation (John Schulman, Barrett Zoph, Luke Metz), you have folks leading pretraining efforts at Meta, OpenAI, Character.AI, etc (Sam Schoenholz, Naman Goyal, Myle Ott, Jacob Menick), you have folks leading multimodal efforts at OpenAI/Mistral (Alexander Kirillov, Rowan Zellers, Devendra Chaplot), you have extremely good infra people (Andrew Tulloch, Yinghai Lu, Ian O&#8217;Connell, etc.), and of course you have the former CTO (and brief CEO :^)) of the biggest AI company in the world (Mira Murati).<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>However, perhaps even more so than the strength of the team, I was impressed by the <em>friendliness</em> of the team.  Certainly, the fact that I had previously enjoyed working with several of these people helped. Of the top 4 people who I&#8217;ve been saddest about leaving Meta, 2 of them (Andrew Tulloch<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> and Yinghai Lu) are at Thinking Machines.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><h3>An Amazing (and Asymmetrical) Opportunity</h3><p>One unfair advantage about being a founding engineer at a startup (especially one that&#8217;s so clearly a &#8220;good&#8221; opportunity) is the asymmetrical opportunity cost. <a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a></p><p>For example, if I join Thinking Machines as a founding engineer, and then in a year decide I was massively mistaken and go to another lab, it&#8217;s not likely that my role would change all that much! I would be joining an established company, and the role would likely be fairly similar to what it is today. Somewhat macabrely, just like the prodigal son, it might even be <em>beneficial</em> to have left.</p><p>However, if I declined now but then joined Thinking Machines in a year, my role would be drastically different. Of course, my compensation would change, but more importantly, I would have far less <em>legitimacy</em> and influence. The culture and direction of a company is largely set by the founding team, and that&#8217;s something I don&#8217;t have the opportunity for at OpenAI or Anthropic.</p><h3>An Approach to Positive AI Outcomes that Resonated with Me</h3><p>Perhaps most importantly, however, was that Thinking Machines&#8217; approach to positive AI outcomes - research/product codesign and open science - resonated with me. As mentioned above, I&#8217;ve been convinced since high school that AI would be the most <em>important</em> technology of our lifetime. However, this is not the same thing as saying it&#8217;ll be the most <em>beneficial</em>.</p><p>Note: Don&#8217;t take the below statements as speaking on behalf of Thinking Machines - although I&#8217;ve talked to members of the team about these various topics, this is entirely &#8220;my stance&#8221; and &#8220;why does my stance lead to me joining Thinking Machines&#8221;. I also don&#8217;t have the space to give this topic a more in-depth treatment, so view it more as an &#8220;explanation of my beliefs&#8221; as opposed to a watertight argument.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SDjs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SDjs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 424w, https://substackcdn.com/image/fetch/$s_!SDjs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 848w, https://substackcdn.com/image/fetch/$s_!SDjs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 1272w, https://substackcdn.com/image/fetch/$s_!SDjs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SDjs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png" width="606" height="605.1675824175824" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1454,&quot;width&quot;:1456,&quot;resizeWidth&quot;:606,&quot;bytes&quot;:2988822,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.thonking.ai/i/158277004?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SDjs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 424w, https://substackcdn.com/image/fetch/$s_!SDjs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 848w, https://substackcdn.com/image/fetch/$s_!SDjs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 1272w, https://substackcdn.com/image/fetch/$s_!SDjs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad10c696-20d5-4e03-9ffa-213d1b947060_1472x1470.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Generally speaking, I would consider myself a <a href="https://www.noahpinion.blog/p/thoughts-on-techno-optimism">techno-optimist</a>. That is, I believe that humans&#8217; lives have gotten drastically better over the last 1000 years, and that this has been largely driven by technological innovation. I&#8217;m thankful for the <a href="https://x.com/boywaif/status/1602301368705978369">modern supermarket</a>, I&#8217;m astounded by <a href="https://www.youtube.com/watch?v=dX9CGRZwD-w">semiconductor manufacturing</a>, and I&#8217;m especially grateful for <a href="https://x.com/robkhenderson/status/1655221465409544192">air conditioning</a>. In many ways, AI is the most techno-accelerationist technology the world has ever seen - a single technology that has the potential to solve every other technical challenge we face. Because of this, the potential positive impacts of AI are worth pursuing - I like Dario&#8217;s piece &#8220;<a href="https://darioamodei.com/machines-of-loving-grace#5-work-and-meaning">Machines of Loving Grace</a>&#8221; for a more in-depth look on what that might look like.</p><p>Of course, bad outcomes <em>are</em> possible, and due to the potential impact of AI, bad outcomes seem far worse than with other technologies.</p><p>In general, I&#8217;ve categorized bad AI outcomes as:</p><ul><li><p>Misuse: Bad people use AI to do something bad<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p></li><li><p>Misalignment: Good people use AI, but the AI itself ends up doing something bad.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p></li><li><p>Societal Impacts: People are good, AI is good, but we end up in a bad outcome anyways.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a></p></li></ul><p>Although all 3 are reasonable concerns, I would say I&#8217;m most concerned with &#8220;Societal Impacts&#8221;.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a></p><p>The primary reason for this is that society naturally has a strong &#8220;immune response&#8221; to Misalignment and Misuse. When it comes to potentially harmful technologies, society has a clear playbook - if something bad happens, increase the restrictions (e.g. regulate GPUs) or regulations (e.g. mandate more safety oversight).</p><p>Of course, AI is not a normal technology, but concretely speaking, I think there will be plenty of warning signs before truly catastrophic misuse or misalignment. Even if the AI bides its time before misalignment (e.g. deceptive misalignment or a <a href="https://www.lesswrong.com/w/treacherous-turn">treacherous turn</a>), I find it unlikely that the first AI system to do so will succeed - it would need to be drastically more powerful than humans and other AIs.</p><p>On the other hand, negative Societal Impacts seems much more straightforwardly plausible. Imagine a world where ChatGPT was never released, but instead hypothetical AI lab ClosedAI releases AI Job Replacement Agent 3000 in 2030, instantly replacing 50% of all human jobs and turning ClosedAI into a quadrillion dollar company. </p><p>Even in today&#8217;s world, the secrecy of the top AI labs definitely rubs me the wrong way (although I understand why it&#8217;s done) - I don&#8217;t know how much more vagueposting I can take. Moreover, the ideological and geographical concentration of AI knowledge doesn&#8217;t seem ideal - as AI expertise becomes more and more in-demand, the fact that the vast majority of AI secrets are contained in a 50 mile radius around San Francisco leads to both power-imbalance and a monoculture.</p><p>If we need to align AI to human values, should all of those humans live in San Francisco?</p><h3>Why I&#8217;m Compelled by Thinking Machines&#8217; Mission</h3><p>Broadly speaking, there are 2 main aspects of Thinking Machines&#8217; mission that were compelling to me.</p><h4>A Focus on Product and Broad AI Diffusion</h4><p>In my opinion, one of the most important aspects for broader societal stability is how smoothly society transitions to using AI systems. Just as important as the outcome is how people feel we arrived at that outcome.</p><p>For example, ChatGPT did not really blow many ML researchers&#8217; minds - they&#8217;d seen GPT-3, they&#8217;d seen what GPT-3 prompting could do, ChatGPT was just a convenience feature. However, ChatGPT absolutely blew the rest of society away. This was the first time broader society became aware of all the things a SOTA LLM could do, and society was shocked. However, since, ChatGPT has become much more normalized among broader society - folks have a bit of a hedonic treadmill.</p><p>But, there is much more that can be done. Even today, there is a vast gap between what the layman encountering ChatGPT for the first time can do vs. those who have deeply integrated AI into their workflow.</p><p>Moreover, I believe there is a lot of potential towards building AI products that can help out people collaboratively as opposed to fully autonomous AI agents. One cute way I thought about it was &#8220;Maximize the value of labor instead of capital&#8221;.</p><h4>Open Science and Systems</h4><p>As mentioned above, it doesn&#8217;t seem good for society for the knowledge of how these AI systems are built to be so secretive. Not only does it create resentment towards these AI labs, it also makes it much more difficult to society to build on top of these AI systems! For example, Deepseek&#8217;s recent releases (both their papers and code) have helped the broader community develop a much better understanding of what will be useful moving forward (e.g. Online RL). </p><p>Personally, of course, this was a big part of my motivation for PyTorch. Good open-source systems help out the entire ecosystem, enabling many more people to participate in building AI systems.</p><p>I also want to note that although open science/systems is certainly a nice ideal, there&#8217;s obviously also economic realities at play. In my opinion, this is where the focus on product is useful. Companies like Meta or Google don&#8217;t need to be very secretive about the actual techniques being used - more or less, most of their core systems/approaches are widely known by the community. On the other hand, if your only product is just an API endpoint with tokens in and tokens out, your only edge is your model&#8217;s exact capabilities.</p><p>The culture and defaults of a company also matter a ton. There are many things at these AI labs that <em>could</em> be OSS&#8217;ed without affecting their competitive edge - they just don&#8217;t do it because the default is closed and they&#8217;d need to argue why it <em>should</em> be open.</p><p>For example, PyTorch is the opposite here. All of our code is open-source, our roadmaps are open, some of our design meetings are open too. So, one must argue why something should be <em>closed</em> if you don&#8217;t want it to be open.</p><p>Based off of Sam Altman&#8217;s comment <a href="https://www.reddit.com/r/LocalLLaMA/comments/1if43uf/sam_altman_openai_has_been_on_the_wrong_side_of/">here</a>, he thinks that OpenAI <em>should</em> be open-sourcing more things. However, it&#8217;s &#8220;not [the] current highest priority&#8221;. </p><h3>Overall Thoughts on Positive AI Outcomes</h3><p>Overall, I think Thinking Machines mission of broad AI diffusion and collaborative open-science seems like a compelling strategy to help address Societal Impacts. Of course, there are other essential approaches (like Policy), but Thinking Machines&#8217; mission personally resonates and is an area I believe I can contribute to.</p><h2>Final Thoughts</h2><p>The opportunity to join Thinking Machines as a founding engineer hit basically all of my checkboxes.</p><ol><li><p>An extremely strong team, with folks I&#8217;ve personally enjoyed working with before and other folks I think I&#8217;ll enjoy working with.</p></li><li><p>The opportunity to be there from the beginning and have a say in the direction and culture of a very exciting company.</p></li><li><p>A mission (product-focus + open science) that was uniquely compelling to me in leading to better AI outcomes.</p></li><li><p>Finally, as a bit of a gut-instinct thing, the open science/systems aspect allows me to continue some of what I enjoy about my role at PyTorch - talking to people about AI systems and having broad impact with open-source code.</p></li></ol><p>Almost none of my previous opportunities hit even 2 of these boxes, let alone all 4. One point I distinctly remember when considering this was: &#8220;If even <em>this</em> opportunity doesn&#8217;t compel me to leave PyTorch, I should/would probably work on PyTorch forever&#8221;. </p><p>Although it was a very difficult decision, I&#8217;m very excited to build some cool stuff at Thinking Machines!</p><h2>Come work at Thinking Machines!</h2><p>If any of the above sounds compelling to you (or if you want to work with me :P), please email me at horace@thinkingmachines.ai or DM me on twitter! </p><p>At the very least, I think it&#8217;ll be a lot of fun!</p><h3>Acknowledgements</h3><p>I&#8217;d like to thank everyone I talked to while making this decision, all of which provided me a lot of insight. I&#8217;d also like to thank my coworkers at PyTorch for making my last 4 years so enjoyable. </p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The rest would be a combination of A. other ML frameworks, B. workloads running raw CUDA (like HPC often), and C. pure C++ offerings like TensorRT</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>There are other fun teams under the broader PyTorch org, but the one I worked on is PyTorch Core (specifically PyTorch Core Compilers). While some aspects apply on other teams as well, I think the OSS presence is strongest on the PyTorch Core team.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>There are other folks at Thinking Machines who I hear are also extremely strong (e.g. Alex Gartrell was lead of Operating Systems at Meta and Sam Schleifer was one of the first employees at Character.ai) - this is mostly just the folks that I&#8217;d heard about prior to joining.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>A funny story about Andrew Tulloch leaving Meta to go to OpenAI is that in a weird way, it actually reassured me about staying at Meta. I thought Andrew Tulloch was one of the strongest engineers at Meta, so if he went to OpenAI and just ended up being a run-of-the-mill engineer there, it would have really made me feel like Meta was a small pond. However, by all accounts, he was one of the best engineers at OpenAI too.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>For what it&#8217;s worth, one of the other ones (Natalia Gimelshein!) is back at PyTorch :) </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Of course, this is not the first time I&#8217;ve heard (or thought) about this argument. There are 3 main counterarguments for why this may not be a compelling argument:</p><ol><li><p>Although the project (PyTorch) will almost certainly still be around in a year, the specific projects and directions I&#8217;m pushing for will not. Concretely, one of the things that made me most hesitant to leave PyTorch was that there were a couple of directions I was strongly pushing for that where I&#8217;m more uncertain about their success with me leaving.</p></li><li><p>Although moving around may certainly lead to <em>short-term</em> career (and knowledge) gains, ownership and legitimacy can only be obtained through sticking with a project for a long period of time. Sometimes I think about how it would be fun to spend 6 months at each AI company to understand how they operate. However, this wouldn&#8217;t work out well in the long run - mercenaries may be valued but never respected.</p></li><li><p>Perhaps most sappily, it does feel a bit like a relationship breakup. I know people say that you should just treat a job like a job, but I can&#8217;t help it haha. I really like the people I&#8217;ve worked with.</p></li></ol><p>Basically, for me, this argument was most compelling for why I should join Thinking Machines now compared to OpenAI/Anthropic/xai. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Misuse is probably the most straightforward bad outcome from AI - a bad person uses AI to do something bad. Just like guns enable a single human to cause far more harm than if they only had knives, a very capable AI system may enable a single human (or terrorist group) to enact far more damage than they otherwise would have. </p><p>Others also worry about authoritarian countries leveraging AI to make a super-surveillance state.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Misalignment is when the folks using AI may have good intentions, but due to reward hacking/goodharting/instrumental convergence, the AI ends up achieving some other goal that might be bad. This is also often called &#8220;<a href="https://www.decisionproblem.com/paperclips/index2.html">paperclipping</a>&#8221;.</p><p>Although there are certainly versions of this that may seem extremely &#8220;sci-fi&#8221;, there are plenty of more prosaic examples. For example, humans certainly didn&#8217;t <em>intend</em> to pollute the environment, but nevertheless, it occurred as a side-effect of other things we <em>did</em> intend (factories).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>This is essentially any negative impact on society that doesn&#8217;t come from the other two - job loss, power-imbalance, economic disempowerment, monocultures, etc. Perhaps you could call this &#8220;human misalignment&#8221; as opposed to the above &#8220;AI misalignment&#8221; :)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>If you forced me to give a number, I would say 50% Social Impacts, 25% Misuse, and 25% Misalignment</p></div></div>]]></content:encoded></item><item><title><![CDATA[FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention [external]]]></title><description><![CDATA[Freeing users from the software lottery tyranny of fused attention implementations.]]></description><link>https://www.thonking.ai/p/pytorch-blog-flexattention-the-flexibility</link><guid isPermaLink="false">https://www.thonking.ai/p/pytorch-blog-flexattention-the-flexibility</guid><dc:creator><![CDATA[Horace He]]></dc:creator><pubDate>Wed, 07 Aug 2024 21:59:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SMw5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I have a blog post up on the PyTorch blog (as part of my day job) on FlexAttention: <a href="https://pytorch.org/blog/flexattention/">https://pytorch.org/blog/flexattention/ </a>(work done with <a href="https://twitter.com/drisspg">Driss Guessous</a>, <a href="https://twitter.com/yanboliang">Yanbo Liang</a>, and <a href="https://joydddd.github.io/">Joy Dong</a>). And here&#8217;s a <a href="https://twitter.com/cHHillee/status/1821253769147118004">tweet thread</a>.</p><p>The beginning excerpted here:</p><blockquote><p>In theory, Attention is All You Need. In practice, however, we also need optimized attention implementations like FlashAttention.</p><p>Although these fused attention implementations have substantially improved performance and enabled long contexts, this efficiency has come with a loss of flexibility. You can no longer try out a new attention variant by writing a few PyTorch operators - you often need to write a new custom kernel! This operates as a sort of &#8220;software lottery&#8221; for ML researchers - if your attention variant doesn&#8217;t fit into one of the existing optimized kernels, you&#8217;re doomed to slow runtime and CUDA OOMs.</p><p>For some examples of attention variants, we have Causal, <a href="https://paperswithcode.com/method/relative-position-encodings">Relative Positional Embeddings</a>, <a href="https://paperswithcode.com/method/alibi">Alibi</a>, <a href="https://mistral.ai/news/announcing-mistral-7b/">Sliding Window Attention</a>, <a href="https://twitter.com/andersonbcdefg/status/1800907703688339569">PrefixLM</a>, <a href="https://github.com/pytorch/torchtune/pull/875">Document Masking/Sample Packing/Jagged Tensors</a>, <a href="https://twitter.com/LysandreJik/status/1807779471891538199">Tanh Soft-Capping</a>, <a href="https://arxiv.org/abs/2309.06180">PagedAttention</a>, etc. Even worse, folks often want combinations of these! Sliding Window Attention + Document Masking + Causal + Context Parallelism? Or what about PagedAttention + Sliding Window + Tanh Soft-Capping?</p><p>The left picture below represents the state of the world today - some combinations of masking + biases + setting have existing kernels implemented. But the various options lead to an exponential number of settings, and so overall we end up with fairly spotty support. Even worse, new attention variants researchers come up with will have <em>zero</em> support.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SMw5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SMw5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 424w, https://substackcdn.com/image/fetch/$s_!SMw5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 848w, https://substackcdn.com/image/fetch/$s_!SMw5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!SMw5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SMw5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg" width="1456" height="1328" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1328,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Attention variant support diagram&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Attention variant support diagram" title="Attention variant support diagram" srcset="https://substackcdn.com/image/fetch/$s_!SMw5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 424w, https://substackcdn.com/image/fetch/$s_!SMw5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 848w, https://substackcdn.com/image/fetch/$s_!SMw5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!SMw5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d634a46-d108-4237-a68a-887109a072c3_1600x1459.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To solve this hypercube problem once and for all, we introduce <strong>FlexAttention</strong>, a new PyTorch API.</p><ol><li><p>We provide a flexible API that allows implementing many attention variants (including all the ones mentioned in the blog post so far) in a few lines of idiomatic PyTorch code.</p></li><li><p>We lower this into a fused FlashAttention kernel through <code>torch.compile</code>, generating a FlashAttention kernel that doesn&#8217;t materialize any extra memory and has performance competitive with handwritten ones.</p></li><li><p>We also automatically generate the backwards pass, leveraging PyTorch&#8217;s autograd machinery.</p></li><li><p>Finally, we can also take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.</p></li></ol><p>With FlexAttention, we hope that trying new attention variants will only be limited by your imagination.</p></blockquote><p>My next post (~70% done) will also be about attention! This one will be a historical retrospective on why Tri Dao was the one to invent FlashAttention, and not any of the large tech companies.</p><p>There&#8217;s some other interesting topics I&#8217;m considering writing about.</p><p>In particular, some potential titles:</p><ol><li><p>What&#8217;s the point of ML compilers when Attention is All You Need? (how I think about building ML systems in a world where everybody uses one ML architecture)</p></li><li><p>Performance Metrics Were Made for Man, not Man for Performance Metrics (How do you choose the right performance metric?. In particular, flop counting)</p></li><li><p>My ML framework isn&#8217;t Obeying Mathematics! (a primer on floating point &#8220;nondeterminism&#8221; for machine learning settings)</p></li></ol><div class="poll-embed" data-attrs="{&quot;id&quot;:201032}" data-component-name="PollToDOM"></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thonking.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thonk From First Principles is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]]]></title><description><![CDATA[Great minds discuss flops per watt.]]></description><link>https://www.thonking.ai/p/strangely-matrix-multiplications</link><guid isPermaLink="false">https://www.thonking.ai/p/strangely-matrix-multiplications</guid><dc:creator><![CDATA[Horace He]]></dc:creator><pubDate>Mon, 29 Apr 2024 18:51:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It&#8217;s 2022. I check out this cool new project, <a href="https://github.com/NVIDIA/cutlass">CUTLASS</a>, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.</p><pre><code>python mm_bench.py
&gt; CuBLAS: 258 Teraflops</code></pre><p>Not bad, 83% flop utilization. Now let&#8217;s check out Cutlass&#8217;s performance using their profiler.</p><pre><code>./cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192
&gt; CUTLASS: 288  Teraflops</code></pre><p>!!! 10% higher perf? That&#8217;s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and somehow CUTLASS + autotuning is outperforming it by 10%? We gotta start using these matmuls yesterday.</p><p>The next step is to bind the CUTLASS kernels into Python and compare against CuBLAS using my previous script.</p><pre><code><code>python cutlass_mm_bench.py
&gt; CuBLAS: 258 Teraflops
&gt; CUTLASS: 257 Teraflops</code></code></pre><p>Somehow, in the light of Python, all of CUTLASS&#8217;s performance gains disappear. This in of itself is not shocking - it&#8217;s notoriously difficult to ensure consistent benchmarking across setups.</p><p>I tediously ablate the two benchmark scripts, until finally, I find that CUTLASS&#8217;s profiler, by default, actually initializes the values in a fairly strange way - it only initializes the inputs with integers. Confused about whether this matters, I try:</p><pre><code>zero_inputs = torch.zeros(N, N)
randn_inputs = torch.randn(N, N)
benchmark(zero_inputs) # 295 Teraflops
benchmark(randn_inputs) # 257 Teraflops</code></pre><p>What? How could the <em>values</em> of the matrix affect the runtime of the model? I know Nvidia has some weird <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compressible-memory">data compression</a> thing on A100s, but I wouldn&#8217;t have expected that to be on in matmuls. Let&#8217;s try some other data distributions, like an uniform distribution [0,1].</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q_rq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q_rq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 424w, https://substackcdn.com/image/fetch/$s_!q_rq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 848w, https://substackcdn.com/image/fetch/$s_!q_rq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 1272w, https://substackcdn.com/image/fetch/$s_!q_rq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q_rq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin" width="1397" height="1132" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1132,&quot;width&quot;:1397,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Output image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Output image" title="Output image" srcset="https://substackcdn.com/image/fetch/$s_!q_rq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 424w, https://substackcdn.com/image/fetch/$s_!q_rq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 848w, https://substackcdn.com/image/fetch/$s_!q_rq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 1272w, https://substackcdn.com/image/fetch/$s_!q_rq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75646311-1680-4c0c-b35e-d7736683e34f_1397x1132.bin 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This was &#8230; confusing, to say the least. Somehow, the actual content of the tensors being multiplied is leading to different matmul performance.</p><p>There certainly are cases where the runtime depends on the content of the tensor &#8212; indirect indexing (e.g. <code>A[b]</code>), or things like sparsity.</p><p>But matrix multiplications have nothing like that at all! No matter what the contents of the matrix contain, the matrix multiplication kernel will 1. perform the same number of computations, 2. perform the same computations in the same order, 3. access the same memory addresses, and 4. access the same memory addresses in the same order.</p><p>Nowhere did my mental model of matrix multiplications and GPU hardware allow for the values in the matrix to influence matmul performance. And yet, here we are.</p><p>As it turns out, the culprit is &#8230;&#8230;. dynamic/switching power in semiconductors!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thonking.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thonk From First Principles is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Power Usage in Semiconductors</h1><p>An Nvidia A100 GPU has a power limit of 400W<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. However, as the phrase &#8220;power limit&#8221; may hint, the GPU doesn&#8217;t always use all 400W. For example, when the GPU is fully idle, nvidia-smi tells me that it&#8217;s only pulling 88W of power.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J_3Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J_3Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 424w, https://substackcdn.com/image/fetch/$s_!J_3Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 848w, https://substackcdn.com/image/fetch/$s_!J_3Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 1272w, https://substackcdn.com/image/fetch/$s_!J_3Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J_3Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png" width="1456" height="419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88756,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J_3Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 424w, https://substackcdn.com/image/fetch/$s_!J_3Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 848w, https://substackcdn.com/image/fetch/$s_!J_3Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 1272w, https://substackcdn.com/image/fetch/$s_!J_3Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23857db1-9818-4b69-ab6f-0a799563dab8_1614x464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But when the GPU is running under load, that power usage will spike considerably, typically to around the power limit.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B0kO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B0kO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 424w, https://substackcdn.com/image/fetch/$s_!B0kO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 848w, https://substackcdn.com/image/fetch/$s_!B0kO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 1272w, https://substackcdn.com/image/fetch/$s_!B0kO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B0kO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:92706,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B0kO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 424w, https://substackcdn.com/image/fetch/$s_!B0kO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 848w, https://substackcdn.com/image/fetch/$s_!B0kO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 1272w, https://substackcdn.com/image/fetch/$s_!B0kO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f209576-4ea4-4de7-918f-c146e99a3065_1662x496.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In order to stay under the power limit, a piece on the chip called the Voltage Regulator Module reduces the voltage supplied to the GPU, &#8212; throttling the clock frequency and reducing its performance.</p><p>In other words, if our GPU ends up using enough power to hit the power limit, our performance will become capped.</p><p>Most of us take it for granted that &#8220;GPU does something, power consumption goes up&#8221;. But there are actually two distinct mechanisms through which power gets consumed.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D6Kx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D6Kx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 424w, https://substackcdn.com/image/fetch/$s_!D6Kx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 848w, https://substackcdn.com/image/fetch/$s_!D6Kx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!D6Kx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D6Kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg" width="589" height="220.72745490981964" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:187,&quot;width&quot;:499,&quot;resizeWidth&quot;:589,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D6Kx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 424w, https://substackcdn.com/image/fetch/$s_!D6Kx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 848w, https://substackcdn.com/image/fetch/$s_!D6Kx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!D6Kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8379a7fe-2edc-4030-97c1-b81c919a45ef_499x187.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Dynamic/switching power on the left, static/leakage power on the right. Taken from https://semiengineering.com/knowledge_centers/low-power/low-power-design/power-consumption/</figcaption></figure></div><p>The first one is static/leakage power. You can think of this as the power that inevitably gets lost by just flowing power through your circuits. The amount of static power used is proportional the amount of silicon that is powered. As GPUs don&#8217;t do much <a href="https://en.wikipedia.org/wiki/Power_gating">power gating</a>, this is essentially the amount of power used at idle (88W in the above photo).</p><p>However, the second one, <strong>dynamic</strong> <strong>(or switching) power, </strong>is the culprit. Specifically, a small amount of power is consumed whenever a transistor <em>switches states</em>. If the transistor never needs to switch states, it doesn&#8217;t consume any extra power. On the other hand, if it&#8217;s rapidly flipping, then it consumes a ton of dynamic/switching power.  Multiply that by the billions of transistors in your GPU, and you get the overall increase in power consumption.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sxRm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sxRm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sxRm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sxRm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sxRm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sxRm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg" width="647" height="242.4629258517034" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:187,&quot;width&quot;:499,&quot;resizeWidth&quot;:647,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sxRm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 424w, https://substackcdn.com/image/fetch/$s_!sxRm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 848w, https://substackcdn.com/image/fetch/$s_!sxRm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!sxRm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e6764f5-44d9-4cc9-a32b-70c56647a81b_499x187.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>In other words, the reason why matrix multiplications are faster when passed zeros is that <strong>this reduces the &#8220;flipping&#8221; of enough transistors in the chip to stay under the power limit! </strong></p><p>So, this (mostly) explains what we saw previously<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. All zeros are probably the fastest since every single bit of each computation is a zero and the accumulator remains at zero. All ones is probably still quite fast since every single tensor-core instruction results in exactly the same values. The uniform distribution probably is a little bit faster than the normal distribution since the accumulator never needs to flip-flop between positive and negative. The normal distribution probably has the worst performance since it leads to pretty high randomness among all transistors involved in the computation(?).</p><p>Here&#8217;s the results on a number of fun distributions I tried:</p><ol><li><p>Randn: Normally distributed</p></li><li><p>Checkerboard: Normal distribution, but we have zeros in a checkerboard shape.</p></li><li><p>Rand: Uniform distribution</p></li><li><p>Sparse: Normal distribution, but a (random) 75% of the elements are masked. </p></li><li><p>Ternary: Every value is 1, -1, or 0.</p></li><li><p>One Bit: Only one bit set in every single value (the 4th bit)</p></li><li><p>All Pies: Every single value is the mathematical constant PI.</p></li><li><p>Twos: Every value in the matrices are 2</p></li><li><p>Zeros: Every value in the matrices are 0</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2cYi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2cYi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 424w, https://substackcdn.com/image/fetch/$s_!2cYi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 848w, https://substackcdn.com/image/fetch/$s_!2cYi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 1272w, https://substackcdn.com/image/fetch/$s_!2cYi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2cYi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png" width="1392" height="844" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:844,&quot;width&quot;:1392,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:615669,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2cYi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 424w, https://substackcdn.com/image/fetch/$s_!2cYi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 848w, https://substackcdn.com/image/fetch/$s_!2cYi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 1272w, https://substackcdn.com/image/fetch/$s_!2cYi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e6999-afc8-47c6-8cc9-c81aa145ab98_1392x844.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Who says unstructured sparsity isn&#8217;t efficient with tensor-cores? :)</p><h1>How power limit and clock speed affects this</h1><p>Here&#8217;s another piece of evidence that dynamic/switching power is responsible.</p><p>Roughly, the power we&#8217;re using is proportional to the clock speed multiplied by amount of transistor flips we&#8217;re doing.</p><p> <code>power ~= clock speed * &#8220;transistor flips per clock&#8221;.</code> </p><p>We run into throttling when the power we use surpasses the power limit we&#8217;ve set. Thus:</p><ol><li><p>If we reduce our power limit we exacerbate this effect.</p></li><li><p>If we reduce the clock speed we reduce this effect.</p></li></ol><p>Let&#8217;s show that in action! To do so, I&#8217;ll compare the relative performance of a very predictable input (zeros) vs a very unpredictable input (randn).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CgZB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CgZB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 424w, https://substackcdn.com/image/fetch/$s_!CgZB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 848w, https://substackcdn.com/image/fetch/$s_!CgZB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!CgZB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CgZB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin" width="1387" height="947" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:947,&quot;width&quot;:1387,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Output image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Output image" title="Output image" srcset="https://substackcdn.com/image/fetch/$s_!CgZB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 424w, https://substackcdn.com/image/fetch/$s_!CgZB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 848w, https://substackcdn.com/image/fetch/$s_!CgZB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!CgZB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5effdde-4b30-4ebb-a910-537c459d8e6b_1387x947.bin 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As expected, we see that as the power limit decreases from 330W down to 100W (the minimum), the relevant performance improvement from using predictable inputs increases. Interestingly, at the lowest power limit (100W), the trend reverses. I&#8217;m guessing that the GPU is so power constrained that even using all zeros still results in too much power usage. [<a href="https://www.thonking.ai/p/strangely-matrix-multiplications/comment/55271443">Uma Kelkar in the comments</a> suggests that this is because throttling the power limit also just directly throttles the clock speeds]. Remember that the switching power is coming from <em>every transistor in the GPU</em>, not just the ones holding data! So that includes the transistors for say, keeping track of the current program counter, keeping track of how many loop iterations you need to perform, the ones signaling other transistors to perform operations, pretty much everything that a GPU can possibly be doing.</p><p>Now, to test the effect of GPU clocks on using predictable vs. unpredictable inputs, I&#8217;ll use a a power limit of 200 and vary the GPU clock limit.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-nnn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-nnn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 424w, https://substackcdn.com/image/fetch/$s_!-nnn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 848w, https://substackcdn.com/image/fetch/$s_!-nnn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!-nnn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-nnn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Output image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Output image" title="Output image" srcset="https://substackcdn.com/image/fetch/$s_!-nnn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 424w, https://substackcdn.com/image/fetch/$s_!-nnn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 848w, https://substackcdn.com/image/fetch/$s_!-nnn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!-nnn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0441dd3d-137c-410d-bba2-f081e70300b1_1697x947.bin 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We see that at the top range of the clock frequency, there is nearly no change in the ratio, as presumably even with predictable inputs, we&#8217;re still getting throttled. Then, as we decrease the clock speed, the relative gap shrinks as predictable inputs are affected by the clock speed limit, but not unpredictable inputs. Finally, at the very left of the chart, both using predictable and unpredictable inputs have identical performance, as both become completely limited by our manual clock speed limit and don&#8217;t do any power throttling.</p><p>Another interesting thing we can test is, for a given input and power limit, what is the maximum clock speed the GPU can sustain for a matmul?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8wPi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8wPi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 424w, https://substackcdn.com/image/fetch/$s_!8wPi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 848w, https://substackcdn.com/image/fetch/$s_!8wPi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!8wPi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8wPi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin" width="1418" height="947" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:947,&quot;width&quot;:1418,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Output image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Output image" title="Output image" srcset="https://substackcdn.com/image/fetch/$s_!8wPi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 424w, https://substackcdn.com/image/fetch/$s_!8wPi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 848w, https://substackcdn.com/image/fetch/$s_!8wPi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!8wPi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882c3bb6-610b-41f0-a134-97b6780f781d_1418x947.bin 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Marketing vs &#8220;Real&#8221; Performance</h1><p>This observation that GPUs are unable to sustain their peak clock speed due to power throttling is one of the primary factors that separates &#8220;real&#8221; matmul performance from Nvidia&#8217;s marketed specs.</p><p>The figure that Nvidia provides for marketing is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{FLOPS} = \\text{Tensor Cores on GPU} \\cdot \\text{Max Clock Speed} \\cdot \\text{FLOP per Tensor Core Instruction}&quot;,&quot;id&quot;:&quot;QUCZOPKZJT&quot;}" data-component-name="LatexBlockToDOM"></div><p>For example, on an H100, there are <a href="https://resources.nvidia.com/en-us-tensor-core">528 tensor cores per GPU</a> (4 per SM), the max clock speed for these is 1.830 Ghz, and the FLOP per tensor-core instruction is 1024. Thus, we have <code>1.830e9 * 528 * 1024 = 989 TFLOPS</code>, exactly Nvidia&#8217;s listed number.</p><p>However, you can only achieve this number by sustaining 1.83 Ghz clocks, and as we&#8217;ve seen above, the GPU just doesn&#8217;t have enough power to do that!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m2rl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m2rl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 424w, https://substackcdn.com/image/fetch/$s_!m2rl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 848w, https://substackcdn.com/image/fetch/$s_!m2rl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!m2rl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m2rl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin" width="1456" height="906" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:906,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Output image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Output image" title="Output image" srcset="https://substackcdn.com/image/fetch/$s_!m2rl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 424w, https://substackcdn.com/image/fetch/$s_!m2rl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 848w, https://substackcdn.com/image/fetch/$s_!m2rl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 1272w, https://substackcdn.com/image/fetch/$s_!m2rl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85158cd4-ff77-4420-a60b-6bb639d69309_1522x947.bin 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Do note that in both of these cases, the GPUs have a higher power limit than I can test (400W on A100 and 700W on H100 respectively), so it&#8217;s able to sustain a higher clock speed than what&#8217;s charted here. But we can see that, especially on the H100, the max sustainable clock speed is much lower than the theoretical one! In other words, matmuls on the H100 are primarily not compute or bandwidth limited, they are <strong>power limited.</strong></p><p>As <a href="https://twitter.com/dwarkesh_sp/status/1780990840179187715">many have noted</a>, power is increasingly a crucial constraint. So, although the H100 theoretically had 3x more FLOPS than the A100, its &#8220;real&#8221; performance has usually been closer to 2x due to the power throttling we&#8217;ve been discussing, and its &#8220;flops per watt&#8221; is even less than that.</p><p><strong>Update (6/25): </strong>Nvidia recently released their <a href="https://developer.nvidia.com/blog/nvidia-sets-new-generative-ai-performance-and-scale-records-in-mlperf-training-v4-0/">MLPerf submission</a> which largely confirms the hypothesis in this blog post. Thanks Sophia Wisdom for the pointer!</p><blockquote><p>This leads to high Tensor Core utilization and can result in scenarios where Tensor Core throughput is constrained by the power available to the GPU.&nbsp;</p><p>In the submission with 512 H100 GPUs, we improved end-to-end performance by redirecting power from the L2 cache memory on each H100 GPU to the streaming multiprocessor (SM), which houses, among other units, NVIDIA Hopper fourth-generation Tensor Cores.</p></blockquote><h1>Conclusion</h1><p>All of this should make you exceedingly curious to see the actual performance improvement on the B100, which has a 1.75x theoretical increase in FLOPS with the same power usage as the H100. I have a <a href="https://manifold.markets/chilli/what-will-be-the-maximum-achievable">Manifold market about guessing the max FLOPS utilization on a B100</a>.</p><p>I&#8217;ll leave you with a slightly modified version of this tweet from roon.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dKL3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dKL3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 424w, https://substackcdn.com/image/fetch/$s_!dKL3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 848w, https://substackcdn.com/image/fetch/$s_!dKL3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 1272w, https://substackcdn.com/image/fetch/$s_!dKL3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dKL3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png" width="1210" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1210,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dKL3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 424w, https://substackcdn.com/image/fetch/$s_!dKL3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 848w, https://substackcdn.com/image/fetch/$s_!dKL3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 1272w, https://substackcdn.com/image/fetch/$s_!dKL3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6037e109-8dbf-4d8e-9570-5c8dc23eefc5_1210x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Thanks to Sophia Wisdom, Vijay Thakkar, Philippe Tillet, Dylan Patel, and Natalia Gimelshein who have helped me understand this phenomenon.</p><p>Also, be aware that you need to set the `scale` parameter on the CUTLASS profiler if you wanna compare its FLOPS numbers to other benchmarks!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thonking.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thonk From First Principles is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The A100 GPU I&#8217;m testing this on happens to have a power limit of 330W.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>It&#8217;s hard for me to say for sure, since I can&#8217;t count how many times each individual transistor flips :)</p></div></div>]]></content:encoded></item><item><title><![CDATA[Solutions: What Shapes Do Matrix Multiplications Like?]]></title><description><![CDATA[Companion to https://www.thonking.ai/p/what-shapes-do-matrix-multiplications]]></description><link>https://www.thonking.ai/p/answer-key-what-shapes-do-matrix</link><guid isPermaLink="false">https://www.thonking.ai/p/answer-key-what-shapes-do-matrix</guid><dc:creator><![CDATA[Horace He]]></dc:creator><pubDate>Mon, 08 Apr 2024 14:43:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!OfQb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Note: The answer to question 1 is publicly available, but the answers to the rest are paywalled. However, if you write up your solutions to eaach question and message me, I&#8217;ll send you the answer key for free. </p><h1><strong>Question 1</strong></h1><p>Let's say I have a <code>[M x K] @ [K x N]</code> matmul. Which one of these configurations will have the best perf? Think about the actual ramifications of tiling! Both matrices are in row-major layout (i.e. K and N are the innermost dimensions)<br>A: M=2047, K=N=2048<br>B: K=2047, M=N=2048<br>C: N=2047, M=K=2048</p><h2>Answer</h2><p>The correct answer is <strong>A</strong>. The key to this question is understanding the point about tiling and memory layouts. </p><p>When it comes to memory layouts, <em>not all dimensions are created equal</em>. Matrix multiplications are not inherently allergic to odd numbers - the poor performance is for a very specific reason.</p><p>In this case, that specific reason, as described <a href="https://www.thonking.ai/i/142904770/memory-layout-of-tiling">here</a>, is that odd shapes lead to <strong>unaligned memory layouts</strong>. Crucially, however, it must be an odd innermost dimension that leads to unaligned memory layouts.</p><p>Before explaining it, let&#8217;s have a practical demonstration that A (i.e. M=2047, K=N=2048) is over 2x faster than either alternative. Code can be found <a href="https://gist.github.com/Chillee/abc38703f88fcb64683b6ccb0ae9d8ba">here</a>.</p><pre><code>import torch
from triton.testing import do_bench
torch.set_default_device('cuda')

for M, K, N in [(2047, 2048, 2048), (2048, 2047, 2048), (2048, 2048, 2047)]:
    A = torch.randn(M, K, dtype=torch.bfloat16)
    B = torch.randn(K, N, dtype=torch.bfloat16)
    print(f"M={M}, K={K}, N={N}")
    print(do_bench(lambda: torch.mm(A, B)))</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LF9W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LF9W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 424w, https://substackcdn.com/image/fetch/$s_!LF9W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 848w, https://substackcdn.com/image/fetch/$s_!LF9W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 1272w, https://substackcdn.com/image/fetch/$s_!LF9W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LF9W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png" width="394" height="208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/822a1929-fd92-486e-bae9-19148dc5c333_394x208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:394,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28207,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LF9W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 424w, https://substackcdn.com/image/fetch/$s_!LF9W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 848w, https://substackcdn.com/image/fetch/$s_!LF9W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 1272w, https://substackcdn.com/image/fetch/$s_!LF9W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F822a1929-fd92-486e-bae9-19148dc5c333_394x208.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The first option (A) is &gt;2x faster!</figcaption></figure></div><p>To understand why this occurs, let&#8217;s understand how the logical layout and the physical layout look like.</p><p>First off, let&#8217;s say we have an 8x8 matrix, nicely aligned. Pretend that each cache line is 4 elements long.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OfQb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OfQb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 424w, https://substackcdn.com/image/fetch/$s_!OfQb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 848w, https://substackcdn.com/image/fetch/$s_!OfQb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 1272w, https://substackcdn.com/image/fetch/$s_!OfQb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OfQb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png" width="727" height="238.6717032967033" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1456,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:800760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OfQb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 424w, https://substackcdn.com/image/fetch/$s_!OfQb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 848w, https://substackcdn.com/image/fetch/$s_!OfQb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 1272w, https://substackcdn.com/image/fetch/$s_!OfQb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This is our ideal situation. The physical layout lines up with the cache lines, each load perfectly uses up every element in our cache line, and the world is at harmony.</p><p>However, let&#8217;s say we introduce an extra element per row, resulting in an 8x9 matrix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ee6m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ee6m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 424w, https://substackcdn.com/image/fetch/$s_!Ee6m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 848w, https://substackcdn.com/image/fetch/$s_!Ee6m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!Ee6m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ee6m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png" width="1456" height="1013" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1013,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1661205,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ee6m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 424w, https://substackcdn.com/image/fetch/$s_!Ee6m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 848w, https://substackcdn.com/image/fetch/$s_!Ee6m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!Ee6m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1a3ad05-2be8-462f-b01e-aa8b96a0563f_1822x1268.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This one measly element per row throws everything out of balance. Each row no longer starts on a cache line, and when issuing our loads, we can no longer simply load from a single cache line to obtain all the elements we need. (This is all just a restatement of the explanation from <a href="https://www.thonking.ai/i/142904770/memory-layout-of-tiling">here</a>)</p><p>However, what happens if instead of adding an extra element per row, we simply add an extra row, resulting in a 9x8 matrix?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e9A6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e9A6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 424w, https://substackcdn.com/image/fetch/$s_!e9A6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 848w, https://substackcdn.com/image/fetch/$s_!e9A6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 1272w, https://substackcdn.com/image/fetch/$s_!e9A6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e9A6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png" width="1456" height="1483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1483,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2305182,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e9A6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 424w, https://substackcdn.com/image/fetch/$s_!e9A6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 848w, https://substackcdn.com/image/fetch/$s_!e9A6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 1272w, https://substackcdn.com/image/fetch/$s_!e9A6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a992ea-f47c-4ef9-b99b-6bea162a5202_1824x1858.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Unlike before, this does <em>not</em> affect the &#8220;alignedness&#8221; of each row! We do have an extra row at the bottom, and this may lead to some extra computation, but if our matrix was sufficiently large, that computation would be negligible. The important point, however, is that this extra row <em>does not affect the memory layout of the rest of the matrix</em>.</p><p>In other words, as long as the <strong>innermost size</strong> of your matrix is divisible by the cache line size, you&#8217;re good to go!</p><p>So, armed with our refined understanding of how shapes affect matrix multiplication performance (e.g. only the &#8220;evenness&#8221; of the innermost dimension matters for memory layouts), let&#8217;s look at the question again.</p><p>In a matmul, we have <code>A: [M x K] and B: [K x N]. </code>These are both row-major, which means that for K and N are the innermost dimensions of A and B respectively.</p><p><strong>A: M=2047, K=N=2048 </strong>(the right answer!)<br><s>B: K=2047, M=N=2048 </s>(Ruled out because K is the innermost dimension of A)<br><s>C: N=2047, M=K=2048</s> (Ruled out because N is the innermost dimension of B)</p><p>Interestingly, when I originally asked this on Twitter, most people got it wrong. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2HTZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2HTZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 424w, https://substackcdn.com/image/fetch/$s_!2HTZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 848w, https://substackcdn.com/image/fetch/$s_!2HTZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!2HTZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2HTZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png" width="473" height="410.8888888888889" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1032,&quot;width&quot;:1188,&quot;resizeWidth&quot;:473,&quot;bytes&quot;:150729,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2HTZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 424w, https://substackcdn.com/image/fetch/$s_!2HTZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 848w, https://substackcdn.com/image/fetch/$s_!2HTZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!2HTZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fb09bf-29b2-4ae7-9dbe-4a04011b6f83_1188x1032.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The substack readers fared much better.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZfYb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZfYb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 424w, https://substackcdn.com/image/fetch/$s_!ZfYb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 848w, https://substackcdn.com/image/fetch/$s_!ZfYb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfYb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZfYb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png" width="507" height="302.6610455311973" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:708,&quot;width&quot;:1186,&quot;resizeWidth&quot;:507,&quot;bytes&quot;:67054,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZfYb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 424w, https://substackcdn.com/image/fetch/$s_!ZfYb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 848w, https://substackcdn.com/image/fetch/$s_!ZfYb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfYb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10b7eed7-64a0-4f75-ba84-7dc047a5589c_1186x708.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One friend answered that they chose B because &#8220;A and C seemed symmetrical, and so B must be the right option, since A and C couldn&#8217;t both be right&#8221;. Sadly, especially in the world of systems, things that seem identical may not be identical in practice&#8230;</p><p>I added Question 4 because some people were getting the right answer for Question 1 but for the wrong reasons, so let&#8217;s jump ahead and see how our newfound knowledge applies to a slightly modified version.</p><h1>Question 4</h1><p>Similar to Question 1, let&#8217;s say we have a A: [M x K] @ B: [K x N] matmul. However, now, A is in column-major (i.e. <code>torch.randn(K, M).t()</code>) while B is still row-major. What is the best configuration now?<br><br>A: M=2047, K=N=2048<br>B: K=2047, M=N=2048<br>C: N=2047, M=K=2048</p><h2>Answer</h2>
      <p>
          <a href="https://www.thonking.ai/p/answer-key-what-shapes-do-matrix">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[What Shapes Do Matrix Multiplications Like? [medium]]]></title><description><![CDATA[Divining order from the chaos]]></description><link>https://www.thonking.ai/p/what-shapes-do-matrix-multiplications</link><guid isPermaLink="false">https://www.thonking.ai/p/what-shapes-do-matrix-multiplications</guid><dc:creator><![CDATA[Horace He]]></dc:creator><pubDate>Mon, 01 Apr 2024 16:01:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!v5MG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A while back, Karpathy tweeted that <em>increasing</em> the size of his matmul made it run faster. Surprisingly, it&#8217;s not just <em>relatively</em> faster, it takes less <em>absolute</em> time. In other words, despite doing more work, it is executing in less time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v5MG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v5MG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 424w, https://substackcdn.com/image/fetch/$s_!v5MG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 848w, https://substackcdn.com/image/fetch/$s_!v5MG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 1272w, https://substackcdn.com/image/fetch/$s_!v5MG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v5MG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png" width="1204" height="426" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:426,&quot;width&quot;:1204,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:119338,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v5MG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 424w, https://substackcdn.com/image/fetch/$s_!v5MG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 848w, https://substackcdn.com/image/fetch/$s_!v5MG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 1272w, https://substackcdn.com/image/fetch/$s_!v5MG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b4467f5-8525-4701-a2ab-330bf4c58ec9_1204x426.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://twitter.com/karpathy/status/1621578354024677377">https://twitter.com/karpathy/status/1621578354024677377</a></figcaption></figure></div><p>This may seem intuitively quite strange. Is cuBLAS just messing up somehow? Why doesn&#8217;t the matrix multiplication kernel just pad it to a larger shape? </p><p>It has become tribal knowledge that the particular shapes chosen for matmuls has a surprisingly large effect on their performance. But &#8230; why? Can this be understood by mere mortals?</p><p>Let&#8217;s take a crack at it.</p><p>First, let&#8217;s plot FLOPs achieved for square matmuls. By the end of this article, I will aim to explain all the strange squiggly lines. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!de3t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!de3t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 424w, https://substackcdn.com/image/fetch/$s_!de3t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 848w, https://substackcdn.com/image/fetch/$s_!de3t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!de3t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!de3t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg" width="1456" height="965" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:965,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:&quot;Image&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!de3t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 424w, https://substackcdn.com/image/fetch/$s_!de3t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 848w, https://substackcdn.com/image/fetch/$s_!de3t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!de3t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda2b6db5-95ca-47c9-adb5-6f5ca85c92f0_2418x1602.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 3 general concepts to understand that explain the majority of performance variation among matmul shapes.</p><ol><li><p>Compute Intensity/Parallelization: This explains the general upward trend</p></li><li><p>Tiling: This explains the multiple tiers of lines.</p></li><li><p>Wave Quantization: This explains the strange striped lines.</p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thonking.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thonk From First Principles is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Compute Intensity and More Parallelism</h2><p>First of all, as we move along the x-axis, the matrix multiplications generally get more performant. There&#8217;s two primary reasons for this. </p><p>The first one is simply &#8220;more work/more parallelism&#8221;. There are a large number of fixed overheads that come with launching a kernel (e.g. creating new SMs, waiting for all SMs to finish, etc.), and so, the more work we have to do, the less important those fixed overheads are. Along with more work comes more parallelism, and since GPUs have a ton of parallel cores, you need a surprising amount of work in order to fill a GPU up with enough parallelism.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6S0a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6S0a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 424w, https://substackcdn.com/image/fetch/$s_!6S0a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 848w, https://substackcdn.com/image/fetch/$s_!6S0a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 1272w, https://substackcdn.com/image/fetch/$s_!6S0a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6S0a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png" width="478" height="363.09615384615387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1106,&quot;width&quot;:1456,&quot;resizeWidth&quot;:478,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6S0a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 424w, https://substackcdn.com/image/fetch/$s_!6S0a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 848w, https://substackcdn.com/image/fetch/$s_!6S0a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 1272w, https://substackcdn.com/image/fetch/$s_!6S0a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5b2ae02-19be-4af7-8fc4-037f7476761b_2196x1668.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Data movement is expensive!</figcaption></figure></div><p>The second one is &#8220;arithmetic intensity&#8221;. As I&#8217;ve <a href="https://horace.io/brrr_intro.html">written about before</a>, memory accesses are much more expensive than compute. So, since a square matmul performs 3N^2 memory accesses and 2N^3 FLOPs, at a very minimum, N needs to be in the hundreds before we start spending more time on compute than memory!</p><p>The desire for sufficient Arithmetic Intensity and Parallelism also compound. For example, let&#8217;s say you have your output matrix is <code>1024 x 1024</code>. If you let each SM compute a <code>128 x 128 </code>slice of the output, that&#8217;s only 64 pieces of &#8220;work&#8221; for your GPU, not even enough for each one of an A100&#8217;s 108 SMs ! If you decrease your output slice size to 64 x 64, we now have 256 pieces of &#8220;work&#8221; for our GPU, but our arithmetic intensity has also decreased by a factor of 2.</p><p>With smaller matrix sizes, you need to worry about problems like this that don&#8217;t show up with larger matrices. </p><h2>Tiling</h2><p>Now that we understand the overall structure of the plot, the next question is: why is the plot all over the place? Why, even for very large matrices, do the TFLOPS jumping between &gt;250 and &lt;100?</p><p>To give a hint, let&#8217;s color-code each dot by the highest power of 2 it&#8217;s divisible by.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qMoV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qMoV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 424w, https://substackcdn.com/image/fetch/$s_!qMoV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 848w, https://substackcdn.com/image/fetch/$s_!qMoV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!qMoV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qMoV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg" width="1456" height="1087" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1087,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!qMoV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 424w, https://substackcdn.com/image/fetch/$s_!qMoV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 848w, https://substackcdn.com/image/fetch/$s_!qMoV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!qMoV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F245e1a67-e758-4561-9521-72191ef5992f_1513x1130.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As it turns out, the multiple &#8220;levels&#8221; of FLOPS are due to their shapes&#8217; divisibility. For example, when the shape is odd, the matmul performs significantly worse than when the shape is even. The matmul performs even better when the shape is divisible by 8, with even more performance gains when it&#8217;s divisible by 16 or 32.</p><p>Now, merely knowing about this effect is very practically useful, but what actually causes this effect? As it turns out, the answer is tiling. But, what even <em>is</em> tiling? And why does it cause such substantial performance issues?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qsho!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qsho!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 424w, https://substackcdn.com/image/fetch/$s_!qsho!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 848w, https://substackcdn.com/image/fetch/$s_!qsho!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 1272w, https://substackcdn.com/image/fetch/$s_!qsho!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qsho!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png" width="1456" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110115,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qsho!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 424w, https://substackcdn.com/image/fetch/$s_!qsho!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 848w, https://substackcdn.com/image/fetch/$s_!qsho!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 1272w, https://substackcdn.com/image/fetch/$s_!qsho!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5939a8fd-8356-45a6-8e83-db6fd48c014d_2116x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Taken from https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#tile-quant</figcaption></figure></div><p>Some online have mentioned <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#tile-quant">tile quantization</a> as the culprit. Tile quantization certainly can impact performance, but <em>only at tile boundary sizes</em>. Basically, tile quantization occurs when the size of your matrix multiplication increases such that the GPU needs to launch another &#8220;chunk&#8221; of work. For example, imagine that you could multiply 8 elements at a time with a SIMD instruction. Now, if you went from 32 elements to 33 elements (a 3% increase in problem size), you go from needing 4 SIMD instructions to 5 (a 25% increase). Note that crucially, when tile quantization is the culprit, your absolute runtime still grows monotonically, although your efficiency may drop.</p><p>However, in our above plot, we see much more drastic performance drops! Moreover, like in Karpathy&#8217;s original example, we see that the <em>absolute runtime decreases despite problem size increasing</em>. So, tile quantization cannot be the explanation here.</p><p>The true cause is that tiling is just fundamentally worse for certain memory layouts. In other words, by the time we&#8217;re trying to execute the matmul, you&#8217;ve already lost. The memory layout is poor and your performance will suffer.</p><p>Let&#8217;s look at some examples!</p><h3>Memory Layout of Tiling</h3><p>First, let&#8217;s think about how our matrix&#8217;s memory layout looks like when our size is a multiple of the cache line (pretend it&#8217;s 4 elements). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9AzH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9AzH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 424w, https://substackcdn.com/image/fetch/$s_!9AzH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 848w, https://substackcdn.com/image/fetch/$s_!9AzH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 1272w, https://substackcdn.com/image/fetch/$s_!9AzH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9AzH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png" width="520" height="371.55555555555554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/819f8836-c521-4363-b170-49ee039569c6_1170x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1170,&quot;resizeWidth&quot;:520,&quot;bytes&quot;:624255,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9AzH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 424w, https://substackcdn.com/image/fetch/$s_!9AzH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 848w, https://substackcdn.com/image/fetch/$s_!9AzH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 1272w, https://substackcdn.com/image/fetch/$s_!9AzH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F819f8836-c521-4363-b170-49ee039569c6_1170x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">I choose to show 3 &#8220;cache lines&#8221; per row because our matrix logically has 12 elements per row.</figcaption></figure></div><p>We see that each row starts on a cache line<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Among other advantages, this means that we don&#8217;t need to perform any &#8220;unnecessary&#8221; loads to load all yellow elements. We can just load the 3 cache lines that the yellow elements are part of.</p><p>However, what happens if we increase the number of elements per row from 12 to 13? </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9BJj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9BJj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 424w, https://substackcdn.com/image/fetch/$s_!9BJj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 848w, https://substackcdn.com/image/fetch/$s_!9BJj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 1272w, https://substackcdn.com/image/fetch/$s_!9BJj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9BJj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png" width="520" height="317.14701601164484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:838,&quot;width&quot;:1374,&quot;resizeWidth&quot;:520,&quot;bytes&quot;:759750,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9BJj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 424w, https://substackcdn.com/image/fetch/$s_!9BJj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 848w, https://substackcdn.com/image/fetch/$s_!9BJj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 1272w, https://substackcdn.com/image/fetch/$s_!9BJj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57fffe15-4659-44e1-b516-0944867b9f96_1374x838.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Each logical row (which have 13 elements) no longer starts aligned with a cache line.</figcaption></figure></div><p>With an unaligned layout, each row is misaligned relative to our cache line. In other words, if we start loading the beginning of the green row, we <em>must</em> redundantly load the last element of the blue row as well.</p><p>Now, let&#8217;s look at what happens when we actually try to load an entire &#8220;tile&#8221; from these memory layouts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jb1C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jb1C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 424w, https://substackcdn.com/image/fetch/$s_!jb1C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 848w, https://substackcdn.com/image/fetch/$s_!jb1C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!jb1C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jb1C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg" width="1456" height="623" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:623,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!jb1C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 424w, https://substackcdn.com/image/fetch/$s_!jb1C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 848w, https://substackcdn.com/image/fetch/$s_!jb1C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!jb1C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf2a2033-e9bf-44b8-8961-5b50c25eda91_2032x870.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Shaded regions = elements we&#8217;re trying to load. Crossed out regions = elements we don&#8217;t need but must load due to it being in the same cache line.</figcaption></figure></div><p>With the aligned layout, this is very clean! We issue one load per row. One for the 4 blue elements, one for the 4 green elements, one for the 4 yellow elements, and one for the 4 pink elements. </p><p>With the unaligned layout, things are much messier. For example, in order to load the first 4 green elements, we must issue 2 loads! One that gets the last blue element + the first 3 green elements, and one that gets the 4th green element. A similar pattern occurs with loading the 4 yellow elements as well as the 4 pink elements.</p><p>So, when our matrix size is divisible by the cache line (which is 32 elements on a GPU), tiling fits nicely within the cache line, and our memory loads are maximally efficient. When it&#8217;s not&#8230; the kernel needs many more workarounds in order to end up the proper alignment.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>This is why even very small changes in our matrix size can lead to substantially worsened performance.</p><h2>Wave Quantization</h2><p>Ok, so we&#8217;ve understood most of the variation in matmul performance. But what about these strange stripes up here? All of these points are with matmuls that are divisible by 32 already. Seeing that the peaks are separated by 256, our first guess might be that this is also memory-layout related, just at a larger scale.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qSyi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qSyi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 424w, https://substackcdn.com/image/fetch/$s_!qSyi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 848w, https://substackcdn.com/image/fetch/$s_!qSyi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 1272w, https://substackcdn.com/image/fetch/$s_!qSyi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qSyi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png" width="844" height="466" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:844,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!qSyi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 424w, https://substackcdn.com/image/fetch/$s_!qSyi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 848w, https://substackcdn.com/image/fetch/$s_!qSyi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 1272w, https://substackcdn.com/image/fetch/$s_!qSyi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040453ca-4ac6-4275-bc84-7765a0bc70c4_844x466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Some truly mysterious patterns&#8230;.</figcaption></figure></div><p>However, as it turns out, these peaks (2944 and 3120) do <em>not </em>occur when the matrix shapes are divisible by 256, but instead they&#8217;re at 128 mod 256! </p><p>As it turns out, these peaks are not caused by poor memory-layouts, they&#8217;re instead caused by a (neatly-named) phenomenon called <em>wave quantization</em>.</p><p>The main idea behind wave quantization is quite simple. </p><p>Let&#8217;s say we have N parallel tasks (which each take a second) and N CPUs. <br>Q: How long does it perform to take all tasks?<br>A: 1 second</p><p>Q: What about if we have (N+1) parallel tasks, and N CPUs?<br>A: 2 seconds(!) Now, one CPU must perform two tasks, taking a total of 2 seconds.</p><p>So, despite adding just one task, we&#8217;ve doubled our overall latency.</p><p>This is exactly what wave quantization is, except with CPUs =&gt; SMs and tasks =&gt; thread blocks.</p><p>As your matrix size increases, the total number of tiles/blocks increases. When this crosses a multiple of the # of SMs, your perf drops since you need to execute an additional "wave".</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fQa1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fQa1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 424w, https://substackcdn.com/image/fetch/$s_!fQa1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 848w, https://substackcdn.com/image/fetch/$s_!fQa1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 1272w, https://substackcdn.com/image/fetch/$s_!fQa1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fQa1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png" width="482" height="247.17948717948718" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:390,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!fQa1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 424w, https://substackcdn.com/image/fetch/$s_!fQa1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 848w, https://substackcdn.com/image/fetch/$s_!fQa1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 1272w, https://substackcdn.com/image/fetch/$s_!fQa1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efc0b0a-93c1-477f-9e88-f59a74cfe6a5_390x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Taken from <a href="https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/">here</a></figcaption></figure></div><p>Now, let's apply our newfound knowledge to actually explain these curves! Let&#8217;s try looking at this sudden drop in performance around 1792 first.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jeez!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jeez!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 424w, https://substackcdn.com/image/fetch/$s_!jeez!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 848w, https://substackcdn.com/image/fetch/$s_!jeez!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 1272w, https://substackcdn.com/image/fetch/$s_!jeez!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jeez!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png" width="244" height="429.5299539170507" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:764,&quot;width&quot;:434,&quot;resizeWidth&quot;:244,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!jeez!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 424w, https://substackcdn.com/image/fetch/$s_!jeez!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 848w, https://substackcdn.com/image/fetch/$s_!jeez!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 1272w, https://substackcdn.com/image/fetch/$s_!jeez!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d21d15-54c6-44ad-a522-1fa775badc59_434x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Since wave quantization depends a lot on the actual kernel parameters, we must look at what kernels are actually being run.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ceIo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ceIo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 424w, https://substackcdn.com/image/fetch/$s_!ceIo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 848w, https://substackcdn.com/image/fetch/$s_!ceIo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ceIo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ceIo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png" width="1456" height="173" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:173,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!ceIo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 424w, https://substackcdn.com/image/fetch/$s_!ceIo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 848w, https://substackcdn.com/image/fetch/$s_!ceIo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ceIo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81f276b8-3a39-418f-b586-8c016abb4285_2532x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Using the profiler, we see that we're running a CUTLASS-based matmul with a tile size of 256x128. Note that our matmul kernel *doesn't change at all*, but our perf drops from 60+ TF/s at N=1791 to 43 TF/s at N=1793. </p><p>Now, some basic arithmetic. Our tile grid has dimensions 1792/256 = 7 and 1792/128 = 14. That gives us 7 * 14 = 98 tiles. Since an A100 has 108 SMs, that's still one wave. However, with N=1793 we need to increase the size of our grid. (7+1)*(14+1) = 120 tiles, or 2 waves!</p><p>Now, let&#8217;s look at the previous (mysterious) stripes. Specifically, we&#8217;ll look at N=3200.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pblj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pblj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 424w, https://substackcdn.com/image/fetch/$s_!pblj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 848w, https://substackcdn.com/image/fetch/$s_!pblj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 1272w, https://substackcdn.com/image/fetch/$s_!pblj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pblj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png" width="476" height="262.81516587677726" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:844,&quot;resizeWidth&quot;:476,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!pblj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 424w, https://substackcdn.com/image/fetch/$s_!pblj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 848w, https://substackcdn.com/image/fetch/$s_!pblj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 1272w, https://substackcdn.com/image/fetch/$s_!pblj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c110a2b-9845-4305-b9f0-40c2d61b7831_844x466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Mysterious no more!</figcaption></figure></div><p>Profiling it, we see that the <em>proximal</em> cause is not actually wave quantization. Instead, CuBLAS decided to  change algorithms. But, why did CuBLAS decide to change algorithms?</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zEMy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zEMy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 424w, https://substackcdn.com/image/fetch/$s_!zEMy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 848w, https://substackcdn.com/image/fetch/$s_!zEMy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 1272w, https://substackcdn.com/image/fetch/$s_!zEMy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zEMy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png" width="1456" height="244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:&quot;Image&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!zEMy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 424w, https://substackcdn.com/image/fetch/$s_!zEMy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 848w, https://substackcdn.com/image/fetch/$s_!zEMy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 1272w, https://substackcdn.com/image/fetch/$s_!zEMy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ccbbbd2-2f3f-46a1-b542-e575decf2f6d_1792x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Well, (3200/128) * (3200/128) = 625. 625/108 = 5.8 waves. Thus, at N=3232 we would create another wave.</p><p>In this case, though, it seems that 160x128 still isn't a great tile size. Since the resulting grid (26x21) results in 5.05 waves...</p><p>Well, CuBLAS isn't perfect!</p><p>Beyond the obvious matrix multiplication shape issues, performance loss due to wave quantization often ends up being tricky to find, since it depends upon things like the batch size as well. However, if you take a closer look at each matmul, you might find that there&#8217;s another 10-15% performance you can squeeze out of it by choosing the shapes more carefully!</p><p>I will note that it&#8217;s possible that wave quantization effects may soon be a thing of the past. New matrix multiplication technology like <a href="https://arxiv.org/abs/2301.03598">stream-k</a> allow us to completely bypass wave quantization effects. Perhaps I&#8217;ll explain the basic idea behind matmul implementation strategies someday.</p><h2>Why doesn&#8217;t torch.compile just fix my problems so I don&#8217;t have to think about this?</h2><p>As it turns out, torch.compile does try and pad your matmuls to have the right shape! See the code <a href="https://github.com/pytorch/pytorch/blob/6b1f13ea2f3b1bcd575620eecd7d84a4d2e3eb76/torch/_inductor/fx_passes/pad_mm.py#L90">here</a>, or try this benchmark.</p><pre><code>import torch
torch.set_default_device('cuda')
from triton.testing import do_bench

def f(a, b):
    return torch.mm(a, b)

a = torch.randn(4096, 4096, dtype=torch.bfloat16)
b = torch.randn(4096, 4095, dtype=torch.bfloat16)
print("eager: ", do_bench(lambda: f(a, b)))
cf = torch.compile(f)
print("compiled: ", do_bench(lambda: cf(a, b)))
&gt;&gt; eager: 1.4077268838882446
&gt;&gt; compiled: 0.6021425127983093</code></pre><p>However, there are still limitations that mean it makes sense for users to manually pad their shapes. </p><p>For one, padding requires a full copy! Although torch.compile can often fuse this into a preceding op, in the case where the matrix being padded comes from the input (like a weight matrix), there&#8217;s no way to avoid this copy.</p><p>Second, resolving wave quantization is far more difficult.</p><h2>Conclusion</h2><p>Overall, I hope the topic of "how do I squeeze the most out of my matmuls" is an interesting one. There's still many more intricacies in matmul perf that I didn&#8217;t have the time to get to, as well (I&#8217;m sure) many more intricacies that I don&#8217;t know! Here&#8217;s the <a href="https://gist.github.com/Chillee/f86675147366a7a0c6e244eaa78660f7#file-4-matmul-bench-py">main code</a> to replicate the results.</p><p>Also, here&#8217;s some quiz questions to test your understanding! I will publish a brief explanation of the answers at some later point.</p><h3>Quiz Questions</h3><p><strong>1:</strong> Let's say I have a <code>[M x K] @ [K x N]</code> matmul. Which one of these configurations will have the best perf? Think about the actual ramifications of tiling! Both matrices are in row-major layout (i.e. K and N are the innermost dimensions)<br>A: M=2047, K=N=2048 <br>B: K=2047, M=N=2048 <br>C: N=2047, M=K=2048</p><div class="poll-embed" data-attrs="{&quot;id&quot;:161737}" data-component-name="PollToDOM"></div><p><strong>2: </strong>Let&#8217;s say I have an A100 with 108 SMs, and I want to benchmark a number of matmuls with no wave quantization. How would I go about constructing the shapes for these matmuls?</p><p><strong>3: </strong>Based off this post, would you expect that making your batch size a power of 2 leads to more efficient performance?</p><div class="poll-embed" data-attrs="{&quot;id&quot;:161740}" data-component-name="PollToDOM"></div><p><strong>4: </strong>Similar to Question 1, let&#8217;s say we have a A: [M x K] @ B: [K x N] matmul.  However, now, A is in column-major (i.e. <code>torch.randn(K, M).t()</code>) while B is still row-major. What is the best configuration now?<br><br>A: M=2047, K=N=2048 <br>B: K=2047, M=N=2048 <br>C: N=2047, M=K=2048</p><p><strong>5: </strong>Let&#8217;s say that we have this code.</p><pre><code>A = torch.randn(4096, 4096)
B = torch.randn(4096, 4096)
B = B[:, :4095] # B now has shape [4096, 4095]</code></pre><p>Would you expect that we have good performance on a matmul between A and B?</p><p>Solutions can be found below<br></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1b75e369-7d31-4ed7-858b-2db589bdc13c&quot;,&quot;caption&quot;:&quot;Note: The answer to question 1 is publicly available, but the answers to the rest are paywalled. However, if you write up your solutions to eaach question and message me, I&#8217;ll send you the answer key for free. Question 1 Let's say I have a [M x K] @ [K x N]&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Answer Key: What Shapes Do Matrix Multiplications Like?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:1514868,&quot;name&quot;:&quot;Horace He&quot;,&quot;bio&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc8ac060-6949-4a91-b6b5-88460efd08bc_144x144.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-08T14:43:31.466Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bd02621-bbe2-44c9-8fa3-4809a5087400_1828x600.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.thonking.ai/p/answer-key-what-shapes-do-matrix&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143205705,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:2,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Thonk From First Principles&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55e3b22f-cc6b-438a-be3d-8d17cc97c2f9_750x750.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thonking.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thonk From First Principles is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>A cache line is a block of memory that&#8217;s usually something like 128 bytes long, although in our examples, we&#8217;re pretending that it&#8217;s 4 elements long. You can pretend that a cache line is the &#8220;minimum memory access size&#8221;. In other words, in order to load any of the elements from a cache line, you must load the entire cache line.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In fact, earlier versions of CuBLAS (back when tensor cores were new) didn&#8217;t even <em>use</em> tensor-cores unless the shapes were divisible by 8.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Supporting Mixtral in gpt-fast through torch.compile [short] ]]></title><description><![CDATA[Long-form version of this tweet thread: https://twitter.com/cHHillee/status/1762269069351461196]]></description><link>https://www.thonking.ai/p/short-supporting-mixtral-in-gpt-fast</link><guid isPermaLink="false">https://www.thonking.ai/p/short-supporting-mixtral-in-gpt-fast</guid><dc:creator><![CDATA[Horace He]]></dc:creator><pubDate>Mon, 26 Feb 2024 21:09:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fQsW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>About 2 months after the work was actually done, we finally merged mixtral support into gpt-fast! Check the code out here: <a href="https://github.com/pytorch-labs/gpt-fast/tree/main/mixtral-moe">https://github.com/pytorch-labs/gpt-fast/tree/main/mixtral-moe </a></p><p>Featuring:</p><ul><li><p>(!) no custom kernels</p></li><li><p>int8 and tensor-parallelism support</p></li><li><p>still very simple (&lt;150 LOC to support)</p></li><li><p><strong>faster decoding</strong> than any (non-Groq) API endpoint, at up to 220 tok/s/user with A100s.</p></li></ul><p>Bonus: Running on a H100 node, we can get closer to 300 tok/s/user!</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;00e7d7bc-f403-4110-b8ee-0e75fe51195b&quot;,&quot;duration&quot;:null}"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fQsW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fQsW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 424w, https://substackcdn.com/image/fetch/$s_!fQsW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 848w, https://substackcdn.com/image/fetch/$s_!fQsW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 1272w, https://substackcdn.com/image/fetch/$s_!fQsW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fQsW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a01ed310-0512-49a6-8555-904455e36202_3212x1790.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1290872,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fQsW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 424w, https://substackcdn.com/image/fetch/$s_!fQsW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 848w, https://substackcdn.com/image/fetch/$s_!fQsW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 1272w, https://substackcdn.com/image/fetch/$s_!fQsW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01ed310-0512-49a6-8555-904455e36202_3212x1790.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Faster decoding than any non-Groq endpoint! (at Feb 26th 2024)</figcaption></figure></div><p>We thought it might be interesting to talk a bit about why this is a bit tricky and how we solved it.</p><h3>Mixture of Experts (MoE) vs. Dense Transformers</h3><p>Unlike Llama, Mixtral is a sparse architecture. The main idea here is that instead of a single dense layer with eight thousand parameters, we split it into 8 &#8220;experts&#8221;, where each expert is a dense layer with only one thousand parameters. Then, depending on some dynamic information, we only &#8220;activate&#8221; 2 out of the 8 experts for each layer.</p><p>Morally, this is easy to implement for each token. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Fqb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Fqb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 424w, https://substackcdn.com/image/fetch/$s_!6Fqb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 848w, https://substackcdn.com/image/fetch/$s_!6Fqb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 1272w, https://substackcdn.com/image/fetch/$s_!6Fqb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Fqb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png" width="1302" height="894" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:894,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:228990,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Fqb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 424w, https://substackcdn.com/image/fetch/$s_!6Fqb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 848w, https://substackcdn.com/image/fetch/$s_!6Fqb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 1272w, https://substackcdn.com/image/fetch/$s_!6Fqb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9505e92e-752a-471c-9b5c-856cbf74c760_1302x894.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Crucially, note that out of the 8 experts that constitute our weights, 6 of them do nothing for any token, making mixture of experts a &#8220;sparse&#8221; model.</p><p>For those who have experience with performant PyTorch, you may be wincing. Using a tensor to index into a python list is a cardinal performance sin (it induces a cuda sync, where the CPU waits for the GPU).</p><p>This is the difficulty with running Mixtral efficiently.</p><p>The runtime advantage of MoE comes from dynamic sparsity. But if this dynamism isn't handled efficiently, you might end up slower than if you had no sparsity to begin with.</p><h3>Moving the Dynamism &#8220;onto&#8221; the GPU</h3><p>Luckily, there&#8217;s another option that works well for BS=1 decoding. Instead of doing the indexing in <em>Python, </em>let&#8217;s do the indexing on the GPU. In other words, let&#8217;s do the indexing using a &#8220;gather&#8221; operation.</p><p>A gather operation occurs when you decide to load from a tensor using another tensor. In PyTorch, this is often done using what&#8217;s called &#8220;advanced indexing&#8221;. For example:</p><pre><code>primes = torch.tensor([2, 3, 5, 7, 11])
b = torch.tensor([1, 3, 0])
primes[b] # tensor([3, 7, 2])</code></pre><p>So, putting it all together, our full MoE layer looks morally like this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4fyL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4fyL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 424w, https://substackcdn.com/image/fetch/$s_!4fyL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 848w, https://substackcdn.com/image/fetch/$s_!4fyL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 1272w, https://substackcdn.com/image/fetch/$s_!4fyL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4fyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png" width="1302" height="744" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:206874,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4fyL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 424w, https://substackcdn.com/image/fetch/$s_!4fyL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 848w, https://substackcdn.com/image/fetch/$s_!4fyL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 1272w, https://substackcdn.com/image/fetch/$s_!4fyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f131d9a-e7d8-42bc-b88f-8d96367174c0_1302x744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The full FFN layer looks a bit more different since we need to handle multiple tokens as well as multiple FFNs in a row, but it&#8217;s morally the same idea. Note: this is the primary implementation difference between regular dense transformers and Mixture of Experts!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hvd6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hvd6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 424w, https://substackcdn.com/image/fetch/$s_!hvd6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 848w, https://substackcdn.com/image/fetch/$s_!hvd6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 1272w, https://substackcdn.com/image/fetch/$s_!hvd6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hvd6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png" width="1456" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hvd6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 424w, https://substackcdn.com/image/fetch/$s_!hvd6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 848w, https://substackcdn.com/image/fetch/$s_!hvd6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 1272w, https://substackcdn.com/image/fetch/$s_!hvd6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80a2e11a-5315-4f3c-b2a2-81cc0b58fb5b_1782x688.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This implementation approach has two main advantages - it doesn't require any synchronizations and only uses weights required for computation. However, if we run this normally, we have another issue. Both the gather operation and the actual linear layer itself requires us to touch DRAM with all of the weights we&#8217;re using. This is a factor of 3 slowdown.</p><p>Luckily, PyTorch now has a compiler.</p><h3>Torch.compile to the rescue!</h3><p>Torch.compile can <em>fuse</em> the gather + gemv into one kernel, allowing us to obtain our theoretical speedups.</p><p>If you&#8217;re curious to look at the Triton kernel generated by torch.compile, you can see it <a href="https://pastebin.com/EM8k7bUG">here</a>.</p><p>Concretely, this is the indirect access/gather. Bolding added to emphasize the main operations involved.</p><pre><code><strong>tmp0 = tl.load(in_ptr0 + (r2 + (4096*x1)), None, eviction_policy='evict_last').to(tl.float32)</strong>
tmp1 = tmp0.to(tl.float32)
tmp3 = tmp2 + 8
tmp4 = tmp2 &lt; 0
<strong>tmp5 = tl.where(tmp4, tmp3, tmp2)</strong>
tmp6 = tl.load(in_ptr2 + (r2 + (4096*(x0 % 14336)) + (58720256*tmp5)), None, eviction_policy='evict_first')
tmp7 = tmp6.to(tl.float32)
tmp8 = tl.load(in_ptr3 + (<strong>(14336*tmp5)</strong> + (x0 % 14336)), None, eviction_policy='evict_first').to(tl.float32)</code></pre><p>Let&#8217;s also validate the performance with a <a href="https://gist.github.com/Chillee/f86675147366a7a0c6e244eaa78660f7#file-5-moe-poc-py">benchmark</a>.</p><pre><code>def cuda_indexing(W, score_idxs, x):
    return W[score_idxs] @ x

def python_indexing(W, score_idxs, x):
    return W[score_idxs[0]] @ x, W[score_idxs[1]] @ x

W = torch.randn(E, D, D)
x = torch.randn(D)
score_idxs = torch.tensor([3, 5])</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1DpT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1DpT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 424w, https://substackcdn.com/image/fetch/$s_!1DpT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 848w, https://substackcdn.com/image/fetch/$s_!1DpT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 1272w, https://substackcdn.com/image/fetch/$s_!1DpT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1DpT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin" width="1456" height="868" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Output image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Output image" title="Output image" srcset="https://substackcdn.com/image/fetch/$s_!1DpT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 424w, https://substackcdn.com/image/fetch/$s_!1DpT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 848w, https://substackcdn.com/image/fetch/$s_!1DpT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 1272w, https://substackcdn.com/image/fetch/$s_!1DpT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb529e91b-ffdb-49bc-96ef-efb95caaf7bf_1980x1180.bin 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note that at (smaller) dimension sizes, cuda indexing + torch.compile substantially outperforms python indexing. At larger dimensions the overhead starts to matter less and less, but torch.compile still stays quite performant. </figcaption></figure></div><h3>End to End Benchmarks</h3><p>Combining this altogether into an E2E benchmark, we see that for int8 on a single A100, we run at 98 tok/s. Note that if this were a dense model, we would effectively be running at 4.55 TB/s of bandwidth, which is higher than the theoretical limit!</p><p>Of course, combining it with tensor-parallelism, we can get up to 280 tok/s! Note that this is all for BS=1, so this is &#8220;tok/s/user for only output tokens&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wKPH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wKPH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 424w, https://substackcdn.com/image/fetch/$s_!wKPH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 848w, https://substackcdn.com/image/fetch/$s_!wKPH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 1272w, https://substackcdn.com/image/fetch/$s_!wKPH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wKPH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png" width="946" height="260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:260,&quot;width&quot;:946,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45275,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wKPH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 424w, https://substackcdn.com/image/fetch/$s_!wKPH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 848w, https://substackcdn.com/image/fetch/$s_!wKPH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 1272w, https://substackcdn.com/image/fetch/$s_!wKPH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08c1597-376c-4ae7-b82e-c631e502c4bb_946x260.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Considering that this is faster tokens/s than any (non-Groq) API provider, we think this is pretty impressive.</p><p>Moreover, since we codegen into Triton, we should also be able to run on AMD as well. We&#8217;ll update this post when we get those results.</p><p>Of course, we will also mention the typical gpt-fast caveats. This is optimized for latency and not throughput. In this particular case, the strategy we use for BS=1 codegen scales <em>very poorly</em> to larger batch sizes. </p><p>Nevertheless, we think that this continues to demonstrate the gpt-fast ethos. Simple, native PyTorch, and very fast!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.thonking.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thonk From First Principles is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>