NVIDIA Blackwell Ultra: what the GB300 NVL72 actually changes

NVIDIA shipped the B300 in January 2026, and the benchmarks are in. Blackwell Ultra represents a meaningful step forward from the already-impressive B200 — particularly for teams running long-context inference and large-scale training jobs.

Here's what the numbers actually mean for AI infrastructure decisions, with a working assumption that you care about cost-per-token and not just spec sheet bragging rights.

The specs

The B300 GPU features a dual-reticle design with 208 billion transistors and 160 Streaming Multiprocessors across two dies:

288 GB HBM3e memory (up from 192 GB on B200)
8 TB/s memory bandwidth
15 PetaFLOPS dense FP4 compute
1.5x FP4 compute over standard Blackwell
1,400 W TDP

At the rack level, the GB300 NVL72 — a full rack of 72 interconnected GPUs — achieves 1.1 ExaFLOPS in a single node.

Real-world inference performance

The most striking benchmark comes from DeepSeek R1-671B inference. Blackwell Ultra delivers approximately 1,000 tokens per second on this model, compared to Hopper's 100 tokens per second — a 10x increase in throughput.

For long-context workloads specifically, testing by LMSYS shows the GB300 securing a 1.4x to 1.5x lead over the GB200, with a 1.58x latency advantage in long-context inference scenarios.

Multi-Token Prediction (MTP) pushes this further, delivering a 1.87x user-perceived speed improvement through speculative decoding.

The interesting part isn't that Blackwell Ultra is fast. It's that the price-per-token curve just bent again — and most production inference fleets haven't repriced their offerings yet.

What this means at scale

An NVL72 rack delivers 30x more inference performance than a comparable Hopper configuration. Combined with the 50x throughput-per-megawatt improvement over Hopper platforms, this materially changes the economics of large-scale inference.

For teams running inference-heavy workloads — serving large models to millions of users — the cost-per-token improvement is substantial. For training, the 288 GB per-GPU memory reduces the need for aggressive model parallelism techniques that add communication overhead.

The agentic AI angle

NVIDIA is explicitly positioning Blackwell Ultra for agentic applications — workloads where models process long contexts, maintain state across extended interactions, and reason through multi-step problems. The architecture's focus on long-context performance and latency reduction directly serves this use case.

In practice, this is the workload pattern most production AI products are converging on: chatbots that need persistent memory, code agents that hold an entire repo in context, research assistants that synthesize long documents. The KV cache compression and memory bandwidth improvements are aimed squarely at that traffic.

The procurement reality

B300 supply is allocated, not bought. NVIDIA's largest customers — hyperscalers, foundation labs, sovereign-AI buildouts — have the first six quarters of rack-scale B300 production locked up. Lead times for a fresh NVL72 order, today, are realistically 9–14 months for any buyer not already in the queue.

For everyone else, the practical answer for the next year is aggregation across operators, not procurement. NovaCore's Hyderabad cluster is one of the earliest non-hyperscaler Blackwell deployments in production, and through our partnership, GPU.ai routes capacity there with the same provisioning workflow as our U.S. inventory. That's the path most teams should expect to use until B300 supply normalizes.

Our take

Blackwell Ultra is the first generation where inference cost stops being the limiting factor on what you can ship to production at frontier model size. Training is still expensive, training is still hard, and training is still a small number of buyers in the queue ahead of you — but serving has gotten dramatically cheaper, and that's the part that compounds in your product.

If you're planning infrastructure for B300-class workloads, talk to us. We'll tell you when it actually makes sense to wait for new silicon, and when an aggregated H200 fleet running at 60% utilization is the better economic answer for your traffic shape.