GTC 2024 Keynote Irrational Recap
Jensen wipes out countless competitors, from startups to cloud hyperscaler internal projects.
IMPORTANT:
Irrational Analysis is heavily invested in the semiconductor industry.
Please check the ‘about’ page for a list of active positions.
Positions will change over time and are regularly updated.
Opinions are authors own and do not represent past, present, and/or future employers.
All content published on this newsletter is based on public information and independent research conducted since 2011.
This newsletter is not financial advice and readers should always do their own research before investing in any security.
The only reason I am not buying more NVDA 0.00%↑ is because it is already half my portfolio. Yes, I know this is not healthy. Still refuse to sell a single share.
Everyone has biases. You know mine. Make up your own mind.
I believe that on March 18th, 2024 Jensen Huang (praise be upon him) obliterated the hopes and dreams of nearly every AI chip startup and several cloud hyperscaler internal projects.
Timestamps relative to the above video.
[31:00] “No memory locality issues. 10 TB/s D2D bandwidth”
He is claiming that there are no NUMA problems. Software sees one logic chip with unified memory and latency models. Big if true. Need to see independent benchmarks to verify.
[35:00] "Recast numerical formats dynamically in the new transformer engine.”
Interesting detail but need to wait for a uarch whitepaper to understand fully. Website blurbs don’t say much.
[36:00] “We now have a 5th generation NVLink, twice as fast as Hopper, and has [new] computation in the network.”
Current generation Hopper has all-reduce within InfiniBand switches. Looks like NVLink is getting that and expanded functionality. Significantly more compute in network. Wish there were more details but oh well.
[37:00] Detect weak chip/node. Reliability. New RAS (reliability) engine does 100% self-test, 100% bit test of memory [SRAM] on Blackwell chip and all the [HBM] memory connected to it. Almost as if we ship each chip with it’s own advanced tester. First time we have done this.”
Jensen is claiming that a new on-die block, the RAS, can continuously monitor the health of every single bit SRAM and HBM, just like a full ATE probe system.
ATE probing machines exist to check the health of individual silicon die on the wafer before packaging. Packaging is expensive so need to use known-good die.
This is incredible. Almost too good to be true. RAS is the coolest, most technically impressive innovation from the entire keynote.
So… are each of the 72 ports two lanes of 212G-class SerDes? Is Nvidia first to market with 212G?
[45:00-49:00] Direct drive to copper. No DSP/re-timers. No optics. 5000 NVLink cables. Total of 2 miles. Save 20KW with passive copper cables for Blackwell DGX. Rack is 120KW. Water going in is 25C. Hot water outlet is at 45C. 2 liters per second.
So this is very impressive, but I have to be a party pooper on this one. A 20C delta between inlet and outlet water is insane. 120KW in a single rack is insane from a thermal density perspective. Power deliver is doable but basically no existing facilities can handle this system. Too thermally dense. Water heat gradient too high for the backend datacenter cooling system to handle. DGX Blackwell is very cool but needs fully custom build greenfield datacenters.
Jensen claims they have a compiler that can intelligently search through a massive configuration space (tensor/pipeline/expert/data slice) and optimize model inference for throughput and user latency.
Ball is in your court, Groq! 🤡
This chart is frankly misleading. the uplift (vertical green line) should be on the far left. So measure from the peak of blue line (~30 tokens/sec/GPU) to the green line (~135 tokens/sec/GPU) at comparable latency. So 4.5x performance uplift Blackwell and Hopper. Curious what the performance of B200 FP4 would be. Somewhere between the red and green lines presumably.
Ah yes, the moat just got 10 feet deeper.
I believe there was a line somewhere on the presentation identifying the Bluefield module as having RAS responsibilities, which totally makes sense.
Looking at the rack design, no-one in their right mind is going to spend serious money putting B100 into H100 installations unless Nvidia puts them of some miserable allocation. Which they won't because Nvidia wants to own the design space for the servers. And they have crushed it with that networking design putting 72 chips on an all-to-all connection of the same quality that just 8 chips get in the DGX architecture for H100.
Besides, clouds hate to upgrade servers in the field. It causes chaos, potentially destabilizing a productive resource still earning big bucks. That upgrade model mostly makes sense for quickly piloting some Blackwell platforms to get the software tuned, not full production, while the DC guys (who most likely have been under NDA with the general requirements for planning) finish building the real thing.
That networking is amazing. 72 chips on an all to all over 1m of microcoax and 2 sockets in the line, with no retimers. At 1.8TB (900 each way). Competing with that for leading edge training will be tough. Nvidia has set a really high bar.
I understood the RAS thing to more likely be a self test mode than continuous monitoring. Likely you offline the GPU for a short interval (let others handle the work) then bring it back when all checks out. Only IBM get close to continuous monitoring in their Z-series mainframes and it is significant overhead.
AI data centers are mostly greenfield anyway. The new DC capacity required is larger than the opportunities to recycle old ones (remember how much smaller the cloud was when those few old centers were built?). The clouds have been working on high power densities for a while. They can even retrofit some of the older facilities, though 2MW is a unit size for a section of a data center set largely by equipment in the industrial electrical industry, things like redundant distribution switches and the size of diesel generators. Jensen cited 2000 B200s needing 4MW to run training (not just the GPU chips, but all in as a supercomputer with networking, storage, etc.) but individual racks at 120kWh have been on the road map for years at data centers, and getting to clusters up to 8MW will likely fit existing infrastructure patterns, just with a more concentrated floor plan.