GPU.ai now offers custom GPU buildouts →
All Dispatches
ResearchFebruary 21, 20263 min read

DeepSeek and the open-source inference shift: where the bottleneck moves next

DeepSeek V3 trained for $5.6M. R1 added another $294K. When frontier-quality models cost single-digit millions and ship under open weights, the real moat moves from the model to the inference fleet — and that fleet has to be elastic.

AR
Aditya ReddyCTO

DeepSeek has changed the economics of AI. Their V3 base model — a 671B-parameter Mixture of Experts that competes with GPT-4 class systems — was trained for approximately $5.6 million in compute. That's 55 days on 2,048 H800 GPUs, plus about $294K for R1's reinforcement learning phase.

When frontier-quality models cost single-digit millions to train and are released as open weights, the competitive landscape shifts. The question is no longer "can we access a good enough model?" — it's "can we serve it efficiently at scale?"

The architecture behind the efficiency

DeepSeek's approach is worth understanding because it directly shapes the hardware you need to serve it:

  • Mixture of Experts (MoE): 671B total parameters, but only 37B activated per token. This dramatically reduces compute per inference call while maintaining model quality.
  • Multi-head Latent Attention (MLA): Compresses the KV cache, reducing memory requirements for long-context inference.
  • DeepSeek Sparse Attention (DSA): Fine-grained sparse attention that improves long-context efficiency while maintaining output quality.
These choices maximize throughput per GPU dollar — exactly the metric that matters once you're serving production traffic instead of running benchmarks.

What this means for hardware

Open-source models running on your own infrastructure vs. API calls to frontier providers is increasingly a real choice. The economics now favor self-hosting for teams with sustained inference volume.

But self-hosting MoE models at scale demands specific hardware characteristics:

Memory capacity matters. DeepSeek R1's full 671B parameters need to live somewhere. With Blackwell Ultra's 288GB HBM3e per GPU, fewer GPUs are needed for the full model, reducing inter-GPU communication overhead.

Memory bandwidth matters more. Inference on large models is memory-bandwidth bound, not compute bound. The B300's 8TB/s bandwidth translates directly to higher tokens per second.

Interconnect still matters. Expert routing in MoE models creates communication patterns that benefit from NVLink's GPU-to-GPU bandwidth. Running on isolated GPUs without high-speed interconnect creates bottlenecks at the expert dispatch layer.

The benchmark that's actually relevant

NVIDIA's own numbers show the GB300 NVL72 delivering approximately 1,000 tokens per second on DeepSeek R1-671B — a 10x improvement over Hopper-generation hardware. An NVL72 rack achieves 30x more inference performance than a comparable Hopper configuration.

For teams evaluating self-hosted inference, these numbers move the cost-per-token calculation by an order of magnitude. Which means the question stops being "should we self-host?" and starts being "how do we make self-hosting elastic?"

The aggregation argument for inference

Here's the part most "self-host vs. API" debates miss: production inference traffic is bursty, and your fleet has to match it.

The economic argument for self-hosting on a fixed reserved cluster only works if you can saturate that cluster. Most real applications can't. Traffic is diurnal. Demos go viral. Model launches spike load 50x for 48 hours and then settle back down.

That's exactly the workload pattern aggregation is built for: a baseline reserved fleet for steady-state traffic, plus on-demand burst capacity sourced from the cheapest supplier with available B-series or H-series instances at the moment you need them. You self-host the floor, you rent the ceiling, and the per-token economics finally make sense.

The frameworks for this — SGLang, TensorRT-LLM, vLLM, LMDeploy — are mature enough that running DeepSeek-class models on heterogeneous fleets is a tractable engineering problem rather than a research project.

Where this goes

A guess, openly: by the end of next year, the cheapest place to run an open-weight 600B-class model will not be a hyperscaler API. It will be a self-hosted base fleet, burst-augmented through an aggregation layer, with the model running on a mix of last-gen H200s for cost and current-gen B-series for tail latency. The teams that win at inference economics are the ones that treat the fleet as a market, not a contract.

If you're building toward that, we'd like to compare notes.

Written by

Aditya Reddy

CTO