Gotta admit - as a casual reader, I do not have the mental context to appreciate the scope of what, "REAL CLOCK FOWARDING OVER OPTICS!!!!!!!!!!!!!!!!!!!!!!!!!", implies.
Thanks for sharing. There is another start up called Tensordyne which is about to reveal their tape out in April. Tensordyne uses logarithmic math turning expensive multiplications into simple additions and thereby simplifying the data flow/making it more efficient/faster. Tensordyne is positioned as a purer, more radical version of the dataflow future Nvidia is buying. While Nvidia is integrating Groq’s deterministic networking to save its dominant position, Tensordyne is betting that even a "Groq-ified" Nvidia will eventually be too power-hungry compared to a system that combines dataflow with logarithmic math.
Amazing post, ambitious in a very good way. I’m looking forward to the full article, and thank you for all the links, some of them I've found very useful.
I also think the future of AI compute is more dataflow-style, and NVIDIA probably knows it too, otherwise they would not buy Groq license and talents.
My main concern about Groq is silicon area and its utilization. I can believe Groq can be a leader in latency per token (maybe except Taalas - but it is another league, not programmable at all and we still need to see how it scales). But good utilization of the chip looks hard, especially with MoE and sparsity, I think utilization may become very low. Purely aesthetically I like Tenstorrent’s approach more - it feels more PPA balanced, much better for MoE and sparsity (including activation sparsity!). I would also be very interested to read your perspective on Tenstorrent in a final article.
Hennessy Patterson must be revolving in graves indeed. But I thought inference is memory bound, not compute bound, so why adding more compute units, with shallow memory hierarchy, delivers the performance?
1. With SRAM being so area intensive, I wonder why no one in the accelerator space (that I am aware of) has yet in-licensed IBM's DRAM-based cache technology? IBM has used DRAM instead of SRAM at pretty much all levels of cache in a number of their CPU designs, and, according to the numbers they shared (at Hot Chips and other meetings), both throughput and latencies are quite competitive with SRAM caches. Why use DRAM? Because it's much less area intensive, which is the key limit for adding ever more SRAM cache to chip designs.
2. Based on a blast from the past: anyone remembers Transmeta's VLIW processors, like the Crusoe CPU?
They were quite disappointing at the time (I had one 🙈), but the idea they pitched is still one I find fascinating.
Groq's need for very high quality code to not falter made me think of code morphing software again, and how that approach might help. And I know I am probably utterly wrong about this 😜.
Gotta admit - as a casual reader, I do not have the mental context to appreciate the scope of what, "REAL CLOCK FOWARDING OVER OPTICS!!!!!!!!!!!!!!!!!!!!!!!!!", implies.
Fun enthusiasm, though
Thanks for sharing. There is another start up called Tensordyne which is about to reveal their tape out in April. Tensordyne uses logarithmic math turning expensive multiplications into simple additions and thereby simplifying the data flow/making it more efficient/faster. Tensordyne is positioned as a purer, more radical version of the dataflow future Nvidia is buying. While Nvidia is integrating Groq’s deterministic networking to save its dominant position, Tensordyne is betting that even a "Groq-ified" Nvidia will eventually be too power-hungry compared to a system that combines dataflow with logarithmic math.
does this clock forwarding over optics thing have any implications on LITE and SITM?
Two birds in one stone?
It sure sounds like it!
Amazing post, ambitious in a very good way. I’m looking forward to the full article, and thank you for all the links, some of them I've found very useful.
I also think the future of AI compute is more dataflow-style, and NVIDIA probably knows it too, otherwise they would not buy Groq license and talents.
My main concern about Groq is silicon area and its utilization. I can believe Groq can be a leader in latency per token (maybe except Taalas - but it is another league, not programmable at all and we still need to see how it scales). But good utilization of the chip looks hard, especially with MoE and sparsity, I think utilization may become very low. Purely aesthetically I like Tenstorrent’s approach more - it feels more PPA balanced, much better for MoE and sparsity (including activation sparsity!). I would also be very interested to read your perspective on Tenstorrent in a final article.
Hennessy Patterson must be revolving in graves indeed. But I thought inference is memory bound, not compute bound, so why adding more compute units, with shallow memory hierarchy, delivers the performance?
Truly excellent post with great explanations!
Some questions and speculations of mine:
1. With SRAM being so area intensive, I wonder why no one in the accelerator space (that I am aware of) has yet in-licensed IBM's DRAM-based cache technology? IBM has used DRAM instead of SRAM at pretty much all levels of cache in a number of their CPU designs, and, according to the numbers they shared (at Hot Chips and other meetings), both throughput and latencies are quite competitive with SRAM caches. Why use DRAM? Because it's much less area intensive, which is the key limit for adding ever more SRAM cache to chip designs.
2. Based on a blast from the past: anyone remembers Transmeta's VLIW processors, like the Crusoe CPU?
They were quite disappointing at the time (I had one 🙈), but the idea they pitched is still one I find fascinating.
Groq's need for very high quality code to not falter made me think of code morphing software again, and how that approach might help. And I know I am probably utterly wrong about this 😜.