Mon Jan 13 2025
Comparing AWS Trainium 2 chip/server with Nvidia GB200 NVL72 in details, some findings not yet discussed by Street :
NVDA GB200 NVL72 vs AWS Trainium 2 Ultra server
· Total computing power TFLOPS: 4.3X
· Power content value: 9X (Traditional datacenter about 50MW - can only put in 400units GB200 NVL 72)
· Thermal content value : 10X ( NVDA Liquid coolingvs Trainium 2 Air cooling)
· Total rack cost: 15~37X
--> Price per TFLOPS: 3.5~8.7X . (NVDA more expensive )
****
AWS in Trainium 2 server also pays tribute to Nvidia, adding in-house interconnect technology named Neuronlink, based on PCIe. Before mass rollout and mass connection, we can only give benefit of doubt on the feasibility of the cluster. But 3 things we can be certain:
1) One traditional server is about 50MW - can only put in 400units GB200 NVL 720. But new greenfield datacenter does not come online that fast. Amazon’s large cluster Project Rainier for Trainium 2, which chooses air cooling in incumbent datacenter, is a practical route.
2) AWS/Anthropic LLM is weaker than the other 2 hyperscalers (Meta, Google, MSFT/OpenAI) , but total infrastructure cost is likely the cheapest among, and will also likely be the cheapest across all US including leasing/colo etc. 2nd cheapest is likely to be Google. Lastly are Microsoft and Meta who need to rely on Nvidia, especially Microsoft whose hardware networking technology is weaker than Meta.
3) Moving toward Inferencing, the growth of Power and Thermal will get slower and miss original expectation