Beginner's Guide.

Welcome to the absolute master guide for understanding every single setting, slider, feature tag, and color-coded badge inside our AI Hardware Calculator.

Artificial Intelligence operates on massive numerical thresholds. To make our platform accessible, we compress complex parameters into beautiful badges. In this documentation, we peel back the curtain. We will show you exactly what every absolute badge looks like on the site, what it means, and exactly what mathematical numbers trigger it.


Hardware Setup

Video RAM (VRAM) vs System RAM

System RAM is the slow filing cabinet inside your computer (e.g., 32GB DDR5). VRAM is the ultra-fast, dedicated super-memory physically bolted onto your graphics card (e.g., 24GB on an RTX 4090). To run an AI model fast, its entire "brain" must fit squarely inside your VRAM. If it overflows into your System RAM, the model slows down to a crawl (known as CPU Offloading).

Multi-GPU Arrays

If you own multiple graphics cards (e.g., 2× RTX 3090s), you effectively double your VRAM (24GB + 24GB = 48GB). However, slicing an AI's brain across two separate cards introduces a PCIe Bottleneck penalty. Because the two cards must constantly whisper over the motherboard, the overall Speed (Tokens/Second) is penalized by about 15% to 30%.

Apple Silicon (Unified Memory)

Apple computers (M1, M2, M3, M4 Max/Ultra) do not use separate VRAM. Instead, they use Unified Memory, meaning their colossal 64GB or 128GB of System RAM acts entirely like super-fast VRAM. This is why Macs are the undisputed kings of running massive AI models on a laptop. If you select a Mac on our site, the "System RAM" slider is completely disabled to reflect this architectural superpower.


Models & Features

AI models are labeled with exact visual tags representing their architecture. Below are the exact graphical badges you will see attached to models on the dashboard.

8B

The size of the brain (8 Billion connections). 8B is excellent for laptops. 70B+ requires heavy servers.

Reasoning

These models are explicitly trained to output a "chain of thought", thinking step-by-step.

Mixture Of Experts

Instead of firing all neurons at once, an MoE model routes your question to a specific tiny "expert" sub-brain.

Tool Use

The capability to hook securely into external APIs to browse the web or trigger functions.

Coding

These models achieve elite scores on SWE-Bench and can autonomously write and debug software architecture.

Mathematics

Highly specialized datasets allowing the model to flawlessly solve complex algebra, calculus, and logical proofs.

Vision

Multimodal architecture that allows the AI to natively "see" and interpret uploaded images, charts, and video frames.


Quantization & Compression

Quantization is the act of surgically compressing the neural parameters so they physically fit inside your GPU.

  • FP16
    Uncompressed Perfection. Massive file size. Highest possible quality. Used almost exclusively in enterprise environments where massive data center GPUs are available.
  • Q8_0
    Virtually Lossless. Cuts the file size exactly in half. Human reviewers cannot perceive a difference in reasoning quality compared to FP16. High recommendation.
  • Q6_K
    The Sweet Spot. Shaves it down further. Arguably the best overall balance of VRAM usage, inference speed, and retained intelligence.
  • Q4_K
    Aggressive Squeeze. Shrinks the model by a massive 75%. Perfect for running huge 70B models on heavily limited 24GB hardware. Math capability degrades slightly.
  • Q3_K
    Extreme Compression. Files are now ~25% of their original size. Substantial logic degradation begins to occur. Recommended only as a last resort for local hardware.
  • Q2_K
    Minimum Viable. The model is barely functional for complex reasoning but fits on almost any device. Use only for simple chat tasks where VRAM is critically scarce.

Performance & Speed

AI performance isn't just about fitting; it's about memory growth and typing speed.

The KV Cache (Context)

The "Context Window" is the model's short-term memory. As your conversation grows, the KV Cache expands, consuming more VRAM every few thousand words.

ℹ️ A 70B model uses ~1GB VRAM per 8K tokens. A 128K context requires 16GB VRAM just for memory!

Tokens per Second (TPS)

Writing speed is limited by your Memory Bandwidth (GB/s). The wider the lane, the faster the model can read its 30GB+ brain.

ℹ️ Rule: TPS = Bandwidth / Model_Size.
An RTX 4090 hit ~33 tok/s on a 30GB Llama-3 model.

Quality Scores & Tiers

Our engine parses industry tests (MMLU, SWE-Bench, Chatbot Arena ELO) into five strict graphical badges. Here is the *exact mathematical thresholds* for each tier natively inside the calculator:

Excellent

MMLU (General) 83.0+
SWE-B (Coding) 30.0+
LMSYS (ELO) 1250+

Great

MMLU (General) 75.0 - 82.9
SWE-B (Coding) 22.0 - 29.9
LMSYS (ELO) 1200 - 1249

Good

MMLU (General) 67.0 - 74.9
SWE-B (Coding) 15.0 - 21.9
LMSYS (ELO) 1150 - 1199

Fair

MMLU (General) 55.0 - 66.9
SWE-B (Coding) 8.0 - 14.9
LMSYS (ELO) 1100 - 1149

Basic

Models that fall below the Fair threshold. Usually obsolete.


Result Categories & Badges

The algorithm buckets model cards into three distinct visual containers. These are the exact headers and error texts you will see dynamically rendered based on the GPU math:

VRAM-NATIVE
Fits Well

The model's uncompressed weights plus the expected Context Window buffer fit strictly inside your physical GPU VRAM.

Under the Hood (The Math)

Trigger: Weights_GB + (KV_Cache_GB @ 32K) + 1GB_Overhead < Total_VRAM

When this condition is met, inference happens at Maximum Hardware Bandwidth. No slow system memory is used.

HYBRID-MODE
Tight Fit
⚠️ Offloads 8.4GB to RAM (~35% slower)
ℹ️ 32K on GPU, 16K extended via System RAM

Your VRAM is full, forcing the calculator to spill layers into your System RAM.

Under the Hood (The Math)

Trigger: VRAM_Saturation > 95% AND Weight_Offload < 50%

Tokens will generate at approximately 15-40% of native GPU speed depending on your PCIe lane bandwidth.

FATAL-OOM
Does Not Fit

The model architecture exceeds your total physical memory capacity (VRAM + System RAM). Local execution is mathematically impossible.

Under the Hood (The Math)

Trigger: Weights_GB + context > (VRAM + RAM)

When this red outline appears, the calculator automatically triggers the Cloud Rental Scanner.

EXTERNAL-COMPUTE
Cloud Rental Pill

Cheapest Fit Algorithm

The calculator scans an API of data-center GPUs to find the absolute lowest hourly price for a GPU that has enough VRAM (A6000, A100, or H100) to host the model at full 32K context.

Market Rate Estimate

The blue pill displays the live average market rate on platforms like RunPod or Lambda. This helps you decide if a 44¢ rental is better than a $2,000 upgrade.

Live UI Component Replic:
Rent Nvidia A6000 for ~$0.44/hr
© 2026 RunMyAIModel.com Data derived from Official Specs
GitHub