Benchmark Methodology

Workload

Single GPU, single stream. Model: Gemma 3 4B in the Arminius .qsam format (SNORM8 weights for FFN, FP16 for attention). Fixed prompt shipped inside the binary. 128 tokens generated per pass, greedy decoding.

Burst & Sustained

Burst is the throughput of a single 128-token decode on a freshly loaded GPU. Sustained is the average of four further 128-token decodes back-to-back (the first is treated as warm-up and discarded). The gap between Burst and Sustained shows how much thermal or power throttling kicks in.

Energy & cost

Power is read from nvidia-smi on NVIDIA, sysfs hwmon on AMD, rocm-smi as fallback. If none is available the score falls back to TDP × 0.15 and is marked as estimated. Wh/100K is energy per 100,000 generated tokens. ¢/100K uses a flat reference rate of $0.12/kWh so numbers are comparable across regions.

CUDA baseline

When the benchmark runs with PyTorch enabled, it downloads a pinned reference script from gamedev.tech/download/pytorch-baseline.py and runs the same model on the same GPU through PyTorch + CUDA. The reported speedup is simply burst_tps / cuda_tps — pure throughput ratio for this one workload.

Reproducibility

Download the binary for your OS, download the pinned model weights, run arminius-benchmark --tag your_tag. Results upload to the leaderboard automatically; pass --no-upload to keep them local.

Limits

This benchmark measures one model and one workload. Other models, other quantizations and other context lengths will produce different numbers. Driver versions matter. Identical GPUs in different chassis will sustain different scores. Vendor-tuned inference servers (TensorRT-LLM, vLLM) can beat the PyTorch baseline on some hardware — we use plain PyTorch because that is what most developers actually install first.

gamedev.tech Benchmark Impressum Datenschutz Discord