Emma Oosthuizen
MY PAGE

Local LLM work changes the hardware conversation fast. A machine that feels luxurious for gaming can still choke on real model loads, stall on prompt throughput, or fall apart the moment you try to keep a 70B system resident in memory. Once you move beyond toy demos and into daily use, the build stops being about average FPS and starts being about how much model you can keep close to the compute, how quickly data can move, and how long the machine stays stable under sustained load.

For a serious local AI rig, the target is simple, even if the price is not. You want enough VRAM to hold the model you actually plan to run, enough bandwidth to keep the GPU fed, enough CPU and system memory to manage the pipeline, and enough power delivery and cooling to survive long sessions without throttling. Anything less forces compromises that show up immediately in response time, load time, and model size.

VRAM decides what you can run

GPU memory is the first hard limit. A large model is not useful if it cannot fit where the inference engine needs it. A 70B parameter model in FP16 can ask for more than 140GB of VRAM, which puts it well outside the reach of a single consumer card. Quantization reduces the footprint, but the target is still large. Even a 70B model reduced to 4-bit can still sit around the 40GB mark, which is why so many local setups hit a wall long before the software does.

The practical result is that a 24GB card like the GeForce RTX 4090 becomes the ceiling for mainstream enthusiast builds. It is the strongest consumer option because it combines 24GB of GDDR6X memory with 1008 GB/s of bandwidth. That combination gives it enough room for many strong 7B and 13B workloads, plus some larger quantized models, without immediately falling into spillover territory.

Once you need more headroom, workstation cards enter the conversation. The RTX 6000 Ada Generation brings 48GB of GDDR6 ECC memory and 960 GB/s of bandwidth. The RTX A6000 also offers 48GB, with 768 GB/s of bandwidth. Those cards exist for the cases where the model, the batch size, or the working set is simply too large for a consumer board to handle cleanly.

Bandwidth is the difference between loaded and usable

Raw capacity alone is not enough. If the card cannot move data quickly enough, the cores wait while the memory subsystem catches up. The 4090’s 1008 GB/s helps it stay responsive under large tensor workloads, and that is one reason it remains the default consumer benchmark for this kind of build.

PCIe matters too, especially when the model spills outside a single GPU or when multiple cards need to cooperate. PCIe 4.0 x16 provides 64 GB/s of bidirectional bandwidth. PCIe 5.0 x16 doubles that to 256 GB/s bidirectional. In a multi-GPU workstation, that extra lane speed can reduce the pain of shuffling tensors between devices.

Professional NVIDIA platforms add another layer with NVLink. On supported cards, it can deliver up to 900 GB/s of total bandwidth between GPUs, which is far beyond standard PCIe. For distributed inference or larger training jobs, that interconnect can be the difference between a practical multi-card system and one that spends too much time waiting on transfers.

The rest of the system still matters

A powerful GPU cannot fix a weak platform. The CPU handles model loading, prompt preparation, scheduling, and the general bookkeeping that keeps the whole stack moving. High core count and strong clocks help when you run multiple requests, preprocess larger prompts, or juggle several services at once. Parts like the AMD Ryzen 9 7950X or Intel Core i9-13900K fit this role well because they bring both thread count and single-thread speed.

System memory is the next safeguard. 64GB is a sensible floor for a high-end local AI machine, but 128GB makes more sense once the goal is bigger models, heavier multitasking, or any workflow that might lean on RAM when VRAM runs out. DDR5 at 6000MHz or above gives the CPU more breathing room and keeps the rest of the platform from becoming the bottleneck.

Storage is less glamorous, but it matters every time you launch a model. A fast PCIe 4.0 or PCIe 5.0 NVMe SSD can read at more than 7000 MB/s, which keeps load times measured in seconds rather than a long, awkward pause. A SATA drive can turn the same process into a delay you notice every session.

Underpowered rigs expose themselves quickly

The difference shows up in tokens per second first. A strong build with multiple RTX 4090s or RTX 6000 Ada cards can push 50 to 100+ tokens per second on large models such as a 70B 4-bit setup. A machine stuck with 8GB or 12GB of VRAM may only manage 5 to 10 tokens per second, and in some cases it will fail before inference starts because the model will not fit cleanly.

Fine-tuning tells the same story in a harsher way. A capable system with 48GB or more of VRAM can finish a 7B tuning job over a few gigabytes of data in roughly 2 to 5 hours. A VRAM-limited box may not complete the task at all without aggressive parameter-efficient tricks, and even then the workflow becomes slow enough to break iteration.

Interactive latency is the last test. If replies arrive in 1 to 2 seconds, chat feels usable. Once responses slide into the 5 to 10 second range, the system feels stuck, even if the math is technically working.

The board, power, and cooling must match the ambition

A serious multi-GPU workstation needs a motherboard with the lane layout to support it. Look for workstation-class boards with multiple PCIe 4.0 or 5.0 x16 slots, ideally able to run at x8/x8 or x16/x16 electrical configurations. That is where platforms built around Threadripper Pro or top-end workstation Intel boards earn their keep.

Power delivery is just as unforgiving. Two to four RTX 4090-class cards plus a high-end CPU can call for a 1600W to 2000W PSU with 80+ Titanium or Platinum efficiency. Anything smaller invites instability under sustained load.

Cooling finishes the job. Multiple high-end GPUs dump a serious amount of heat into the chassis, so a large full-tower case, strong airflow, and often custom water cooling are the sane choices. A no-compromise local LLM machine should feel boring when it is under pressure. If it is loud, hot, or unstable, it is already failing the job.