Throughput Estimator

Concurrent users + model → estimated TPS and number of GPUs.

MODEL SIZE

GPU

CONCURRENT USERS

TOKENS / REQUEST

GPUS REQUIRED ×

estimated

Single-stream TPS

Aggregate TPS

Response time

Users per GPU

MONTHLY CLOUD COST (24/7) $

GPU × $/hr × 720 hr

Estimates assume optimized serving (vLLM/TGI). PagedAttention, FlashAttention, speculative decoding can yield 30-100% better results. Vanilla Transformers will be slower.