Why bare metal still matters for training (and where it doesn't)

When you're training foundation models across hundreds or thousands of GPUs, every percentage point of overhead matters. Virtualized GPU instances — the default offering from most cloud providers — introduce a measurable performance tax that compounds at scale.

Bare metal eliminates that tax. The trade-off is operational complexity: bare metal nodes don't auto-heal, don't auto-scale, and don't pretend to be elastic. For the right workload that's exactly what you want. For the wrong workload it's an expensive mistake.

Here's how to tell the difference.

The virtualization tax

Hypervisor-based GPU passthrough adds 10–15% overhead on memory-intensive workloads. For a single GPU instance running inference, this is negligible. For a 512-GPU training cluster running for months, it translates directly into longer training times and higher costs.

Bare metal eliminates this entirely. Your workloads run directly on the hardware with no abstraction layer between your code and the GPU.

Direct interconnect access

NVLink and InfiniBand were designed for direct hardware communication. Virtualization layers can interfere with RDMA operations and introduce latency in collective communication patterns like all-reduce — the backbone of distributed training.

On bare metal, you get the full bandwidth of NVLink between GPUs within a node and InfiniBand NDR/NDR400 between nodes, exactly as the hardware was designed to operate.

Predictable performance

Noisy-neighbor effects don't exist on dedicated hardware. Your training runs produce consistent throughput numbers from hour one to hour ten thousand. This predictability matters for capacity planning and cost modeling. It also matters for debugging — a five-percent throughput regression on a virtualized cluster might be your code or might be a co-tenant, and you'll spend a week figuring out which.

When bare metal is the right answer

Multi-week training runs at >64 GPUs. The overhead savings dwarf the operational tax.
Workloads that exercise NVLink or InfiniBand at line rate. Tensor parallelism inside nodes, pipeline or data parallelism across nodes.
Anything you'd describe as a "supercluster." If the failure of a single node is going to cause a checkpoint rollback rather than a graceful reschedule, you're already operating in the bare metal world.

When bare metal is the wrong answer

Inference serving with horizontal autoscaling. You want elasticity, not raw throughput. Virtualized capacity is the right tool.
Burst fine-tuning jobs of a few hours to a day. Spin up, run, tear down. The provisioning latency of bare metal isn't worth it for jobs that short.
Experimentation and prototyping. If you're going to delete the cluster in a week, you do not need to operate it like a fleet.

How we handle this on GPU.ai

We surface bare metal and virtualized capacity in the same availability view, flagged clearly, with the per-second price for each. You don't have to talk to a sales engineer to find out which option a supplier actually offers — every row in our availability feed tells you.

For larger training engagements — typically 64+ GPUs, multi-week runs — we route to bare metal partners with the right interconnect topology and operational track record. NovaCore's Hyderabad Blackwell cluster is one of those partners. Several U.S. east-coast operators are others.

Bare metal isn't a religion. It's a tool that's worth its operational cost on a specific shape of workload. Use it when the math works.