Oh boy, I have so many thoughts ... Let me see if I can organize it sensibly.
To start out, Jim Keller is one reason not to count out Tenstorrent too soon, at least for technical reasons (business/market issues are different). As you said he saved AMD. Not once but twice IMO. K8 and AMD64 (x86-64) could have almost killed Intel in the early 2000s but for the resurrection of P6/Pentium M arch as Core 2.
As for the Tenstorrent architecture, I think there are similarities with the Intel Larrabee approach which had many simple x86 P54C type cores and 512 bit vector units. Although it's possible the baby RISC-V core is a simple MMU less in-order single pipeline CPU core which would certainly take less area than an original Pentium dual pipelined in-order core with the x86 decode overhead. Even if the area is a low 1%, that doesn't quite tell you the runtime implications. The whole point of a GPU was to eschew the whole complex control structure of CPUs, take advantage of memory streaming on very fast HBM type memory by gathering data there and just computing as fast as possible for data parallel workloads.
If a Larrabee like approach didn't work for graphics (and even HPC applications later) then will Tenstorrent work now? Either Larrabee was ahead of its time or there are some fundamental issues. Surely during the graphics wars some might have tried such an approach or even AMD and nVidia might have thought of it?
A GPU can be thought of as a throughput compute device. A crude analogy is that a GPU is like a big Amtrak train. You gather lots of folks onto a station with flexible point to point cars/vans (CPUs) then transport them en masse on dedicated tracks (SM cores/HBM) and then scatter them back at the destination station. To torture the analogy further does it become better if light vans hookup together in a train like formation or if each train car has a small engine? I don't know... I suspect a lot will boil detailed analysis of the software framework and how much the Tenstorrent architecture can allow it to keep up with the yearly cadence of nVidia hardware releases and flops increase per year.
Full Disclosure: I may or may not have positions in AI hardware startups (no easy way to know) due to a semi-index like VC investing approach via Vested Inc.
Yep he did save twice. Should have mentioned both AMD64 and Zen.
Very interesting link to Larabee! Great connection. My understanding is the baby RISC-V are much weaker than the Larrabee x86 cores even when adjusting to modern process nodes. Runtime implications are something I can't really figure out myself. TT uses the Tracy open-source debug tool and they showed me some in-progress optimization projects. Looked like there were a lot of bottleneck opportunities.
I believe TT has a strong chance of surviving the upcoming AI hardware downturn, but personally, I'm more excited about its architecture and its potential for advancing spatial computing (not the Apple VR version...). In my opinion, a true breakthrough will come from distributed computing that mirrors the brain, with long-distance asynchronous messaging and local compute + storage serving as the nuclei of intelligence. TT is particularly well-positioned for this, unlike other architectures
TensTorrent probably chose GCC over LLVM because GCC has global + local register variable support (6.50.5) and LLVM's support for that is more limited (non allocable registers).
LLVM is very well structured as a pipeline of compiler passes (think big loops) and it is very hackable by mortal graduate students. You wanna try something cheesy? You can probably get it done in LLVM with a simple IR pass.
OTOH, GCC is monolithic. It wasn't really designed to be hacked or retargeted. But it can be with significant effort. Arguably it was designed (and, cough cough, licensed) not to be hacked+retargeted. I digress.
Monolithic means brittle yet still monolithic can mean fast. And IBM, Intel, … have been pouring solid engineering into GCC for decades. GCC has a ton of functionality that LLVM has yet to match. Which brings us to ...
Registers. Registers are a LOT faster than L1 cache. So if you go back to your "register spilling" section, TT basically crashes the GCC compiler if you spill. Dude, don't spill. I told you once, nicely, don't spill and now I'm just going to kill myself. That change cost TT all of about 1 line of code.
Back in the old school daze, C had the register keyword but that's pretty much ignored nowadays by compilers. But with GCC you can order it to use this register for that variable:
register int *foo asm ("a5");
You can do this globally (QEMU dyngen (2003!) has used this feature to great effect) or locally within a function. (Fellow compiler nerds, global+local here are GCC's words and not mine.) LLVM doesn't really offer as much control over this. They have been talking about it for quite some time but it's still not there and it is for GCC.
So if you're writing a kernel and you absolutely positively want control over the registers (but still not write in assembly) then you use GCC's Variables in Specified Registers.
Thomas sohmers.... An annual tax on land to improve liquidity would unleash a lot of growth. Also, trade wars are class wars by klein and pettis convinced me that capital flows cab drive trade flows.
Besides that, I'm bullish tenstorrent. Didn't realise they were the only start up left in the inference game
I'm not a full georgeist either. I think the world has tried to boost growth through housing wealth effects for eighty years and it's starting to creak, let's try something new. Been watching a lot of meateater recently and been learning about how sprawl is the biggest problem for wildlife and nature conservation - Lvt + yimbyism solves more problems than I thought!
How do you do matmuls efficiently with many separate cores and no L2? You need to do a ton of caching of shared data for matmuls and unless either A) the cores can talk to each other or B) you have an L2, your arithmetic intensity is terrible.
I like their super conservative approach of keeping the chip as flexible as possible, but IMO I really don't think the groq compiler issues are as much of a problem as you may think. Nearly all model architectures are the same. Even if groq needed to put an engineer on babying the compiler for a week for each arch, they'd still easily keep up with every SOTA arch out there.
One thing I've noticed in industry is CPU guys really are biased towards pushing CPUs into ML workloads even when there is no good reason to. Keller is obviously a legend, but his background and previous experience seem like they might bias him.
TBH we haven't seen even academic ML workloads that would use general purpose compute in a highly integrated way. Most workloads involving CPUs are like tool-calling ones where a round trip to a seperate CPU is completely acceptable since the round trips are very rare (super small percent of execution time).
My point was that the short term ROI calculation that people have been asking (Sequoia’s “$600B question” article) is flawed, and that the ROI bet that the big players are making right now is that getting to “infinite labor” is the largest ROI opportunity ever.
How are you guys thinking about the dynamic vs static tradeoff? TT and Nvidia fall very much on the dynamic side (future proof against future dynamic architectures) whereas groq and TPUs are taking the static side (better perf due to more aggressive scheduling, simpler control logic)?
I like the focus you have on memory bandwidth, it's likely going to be an even bigger issue as model parameters become more saturated (higher perf in the same # of params) where the avg entropy of the parameters is higher, making the models less amenable to aggressive quantization.
Oh boy, I have so many thoughts ... Let me see if I can organize it sensibly.
To start out, Jim Keller is one reason not to count out Tenstorrent too soon, at least for technical reasons (business/market issues are different). As you said he saved AMD. Not once but twice IMO. K8 and AMD64 (x86-64) could have almost killed Intel in the early 2000s but for the resurrection of P6/Pentium M arch as Core 2.
As for the Tenstorrent architecture, I think there are similarities with the Intel Larrabee approach which had many simple x86 P54C type cores and 512 bit vector units. Although it's possible the baby RISC-V core is a simple MMU less in-order single pipeline CPU core which would certainly take less area than an original Pentium dual pipelined in-order core with the x86 decode overhead. Even if the area is a low 1%, that doesn't quite tell you the runtime implications. The whole point of a GPU was to eschew the whole complex control structure of CPUs, take advantage of memory streaming on very fast HBM type memory by gathering data there and just computing as fast as possible for data parallel workloads.
If a Larrabee like approach didn't work for graphics (and even HPC applications later) then will Tenstorrent work now? Either Larrabee was ahead of its time or there are some fundamental issues. Surely during the graphics wars some might have tried such an approach or even AMD and nVidia might have thought of it?
A GPU can be thought of as a throughput compute device. A crude analogy is that a GPU is like a big Amtrak train. You gather lots of folks onto a station with flexible point to point cars/vans (CPUs) then transport them en masse on dedicated tracks (SM cores/HBM) and then scatter them back at the destination station. To torture the analogy further does it become better if light vans hookup together in a train like formation or if each train car has a small engine? I don't know... I suspect a lot will boil detailed analysis of the software framework and how much the Tenstorrent architecture can allow it to keep up with the yearly cadence of nVidia hardware releases and flops increase per year.
Full Disclosure: I may or may not have positions in AI hardware startups (no easy way to know) due to a semi-index like VC investing approach via Vested Inc.
Yep he did save twice. Should have mentioned both AMD64 and Zen.
Very interesting link to Larabee! Great connection. My understanding is the baby RISC-V are much weaker than the Larrabee x86 cores even when adjusting to modern process nodes. Runtime implications are something I can't really figure out myself. TT uses the Tracy open-source debug tool and they showed me some in-progress optimization projects. Looked like there were a lot of bottleneck opportunities.
Great train analogy!
I believe TT has a strong chance of surviving the upcoming AI hardware downturn, but personally, I'm more excited about its architecture and its potential for advancing spatial computing (not the Apple VR version...). In my opinion, a true breakthrough will come from distributed computing that mirrors the brain, with long-distance asynchronous messaging and local compute + storage serving as the nuclei of intelligence. TT is particularly well-positioned for this, unlike other architectures
I like that you are willing to post such a non-consensus view. Personally, don't agree with it but interesting perspective.
TensTorrent probably chose GCC over LLVM because GCC has global + local register variable support (6.50.5) and LLVM's support for that is more limited (non allocable registers).
Can you please explain this to someone who only knows basic MATLAB and Python scripting? (asking for a friend)
Sure.
LLVM is very well structured as a pipeline of compiler passes (think big loops) and it is very hackable by mortal graduate students. You wanna try something cheesy? You can probably get it done in LLVM with a simple IR pass.
OTOH, GCC is monolithic. It wasn't really designed to be hacked or retargeted. But it can be with significant effort. Arguably it was designed (and, cough cough, licensed) not to be hacked+retargeted. I digress.
Monolithic means brittle yet still monolithic can mean fast. And IBM, Intel, … have been pouring solid engineering into GCC for decades. GCC has a ton of functionality that LLVM has yet to match. Which brings us to ...
Registers. Registers are a LOT faster than L1 cache. So if you go back to your "register spilling" section, TT basically crashes the GCC compiler if you spill. Dude, don't spill. I told you once, nicely, don't spill and now I'm just going to kill myself. That change cost TT all of about 1 line of code.
Back in the old school daze, C had the register keyword but that's pretty much ignored nowadays by compilers. But with GCC you can order it to use this register for that variable:
register int *foo asm ("a5");
You can do this globally (QEMU dyngen (2003!) has used this feature to great effect) or locally within a function. (Fellow compiler nerds, global+local here are GCC's words and not mine.) LLVM doesn't really offer as much control over this. They have been talking about it for quite some time but it's still not there and it is for GCC.
So if you're writing a kernel and you absolutely positively want control over the registers (but still not write in assembly) then you use GCC's Variables in Specified Registers.
What are your thoughts on d-matrix and their chip architecture ? They recently announced they have samples of their first chip
Thomas sohmers.... An annual tax on land to improve liquidity would unleash a lot of growth. Also, trade wars are class wars by klein and pettis convinced me that capital flows cab drive trade flows.
Besides that, I'm bullish tenstorrent. Didn't realise they were the only start up left in the inference game
While I wouldn’t call myself a Georgist, a land value tax is one of the few taxes I would say I would be generally in support of.
I'm not a full georgeist either. I think the world has tried to boost growth through housing wealth effects for eighty years and it's starting to creak, let's try something new. Been watching a lot of meateater recently and been learning about how sprawl is the biggest problem for wildlife and nature conservation - Lvt + yimbyism solves more problems than I thought!
How do you do matmuls efficiently with many separate cores and no L2? You need to do a ton of caching of shared data for matmuls and unless either A) the cores can talk to each other or B) you have an L2, your arithmetic intensity is terrible.
Regarding IP: don't count out other RISC-V vendors like Andes, they are getting good SIMD performance and advancing fast
I like their super conservative approach of keeping the chip as flexible as possible, but IMO I really don't think the groq compiler issues are as much of a problem as you may think. Nearly all model architectures are the same. Even if groq needed to put an engineer on babying the compiler for a week for each arch, they'd still easily keep up with every SOTA arch out there.
One thing I've noticed in industry is CPU guys really are biased towards pushing CPUs into ML workloads even when there is no good reason to. Keller is obviously a legend, but his background and previous experience seem like they might bias him.
TBH we haven't seen even academic ML workloads that would use general purpose compute in a highly integrated way. Most workloads involving CPUs are like tool-calling ones where a round trip to a seperate CPU is completely acceptable since the round trips are very rare (super small percent of execution time).
Sohmers: "you shouldn't worry about [ROI]"
You're not talking about capitalism if you're ignoring ROI.
My point was that the short term ROI calculation that people have been asking (Sequoia’s “$600B question” article) is flawed, and that the ROI bet that the big players are making right now is that getting to “infinite labor” is the largest ROI opportunity ever.
How are you guys thinking about the dynamic vs static tradeoff? TT and Nvidia fall very much on the dynamic side (future proof against future dynamic architectures) whereas groq and TPUs are taking the static side (better perf due to more aggressive scheduling, simpler control logic)?
I like the focus you have on memory bandwidth, it's likely going to be an even bigger issue as model parameters become more saturated (higher perf in the same # of params) where the avg entropy of the parameters is higher, making the models less amenable to aggressive quantization.