Discussion about this post

User's avatar
Tanj's avatar

I believe there was a line somewhere on the presentation identifying the Bluefield module as having RAS responsibilities, which totally makes sense.

Looking at the rack design, no-one in their right mind is going to spend serious money putting B100 into H100 installations unless Nvidia puts them of some miserable allocation. Which they won't because Nvidia wants to own the design space for the servers. And they have crushed it with that networking design putting 72 chips on an all-to-all connection of the same quality that just 8 chips get in the DGX architecture for H100.

Besides, clouds hate to upgrade servers in the field. It causes chaos, potentially destabilizing a productive resource still earning big bucks. That upgrade model mostly makes sense for quickly piloting some Blackwell platforms to get the software tuned, not full production, while the DC guys (who most likely have been under NDA with the general requirements for planning) finish building the real thing.

That networking is amazing. 72 chips on an all to all over 1m of microcoax and 2 sockets in the line, with no retimers. At 1.8TB (900 each way). Competing with that for leading edge training will be tough. Nvidia has set a really high bar.

Expand full comment
Tanj's avatar

I understood the RAS thing to more likely be a self test mode than continuous monitoring. Likely you offline the GPU for a short interval (let others handle the work) then bring it back when all checks out. Only IBM get close to continuous monitoring in their Z-series mainframes and it is significant overhead.

AI data centers are mostly greenfield anyway. The new DC capacity required is larger than the opportunities to recycle old ones (remember how much smaller the cloud was when those few old centers were built?). The clouds have been working on high power densities for a while. They can even retrofit some of the older facilities, though 2MW is a unit size for a section of a data center set largely by equipment in the industrial electrical industry, things like redundant distribution switches and the size of diesel generators. Jensen cited 2000 B200s needing 4MW to run training (not just the GPU chips, but all in as a supercomputer with networking, storage, etc.) but individual racks at 120kWh have been on the road map for years at data centers, and getting to clusters up to 8MW will likely fit existing infrastructure patterns, just with a more concentrated floor plan.

Expand full comment
4 more comments...

No posts