AWS Bedrock vs self-hosted inference — the cost math at sub-1M tokens/day
We rebuilt the inference layer in March and the question we kept running into was: do we self-host an open model on a GPU instance, or stay on AWS Bedrock and pay per token? The answer was unambiguous at our scale, but the threshold where it flips is closer than I expected. Here's the math.
The setup
The pipeline does roughly 800K input tokens and 120K output tokens per day across:
- prospect classification (Haiku-class, ~150 tokens out)
- email generation (Sonnet-class, ~600 tokens out)
- response classification (Haiku-class, ~50 tokens out)
Mixed Haiku/Sonnet usage on Bedrock. For self-hosting we benchmarked Llama-3-70B-Instruct, which is the closest open equivalent to Sonnet for this kind of long-form generation work. Llama-3-8B-Instruct stands in for Haiku.
The Bedrock cost
At June 2025 list prices, per million tokens:
- Claude Haiku: $0.25 input / $1.25 output
- Claude Sonnet: $3.00 input / $15.00 output
Our monthly Bedrock spend at this volume:
Haiku : 600K in/d × 200K out/d × 30d × ($0.25/$1.25 per M)
= $4.50 + $7.50 = $12/mo
Sonnet : 200K in/d × 80K out/d × 30d × ($3.00/$15.00 per M)
= $18.00 + $36.00 = $54/mo
Total Bedrock : ≈ $66/mo
Sixty-six dollars a month. That's the number to beat.
The self-host cost (g5.12xlarge)
For Llama-3-70B you want at least an g5.12xlarge (4× A10G, 96 GiB combined VRAM) to fit the model with vLLM and meaningful batching. List on-demand pricing, us-east-1: $5.672/hr.
Run it 24/7:
g5.12xlarge × 730 hr/mo × $5.672/hr = $4,140/mo
A single instance, no redundancy, no auto-scaling buffer, no load balancer, no health-checks, no spot fallback. Self-host is 62× more expensive than Bedrock at our volume.
Even if you go full spot pricing (~70% discount, but with eviction risk you'd build dual-region failover for) you get to ~$1,240/mo. Still 19× Bedrock.
Where the line actually flips
Bedrock's marginal cost is constant — every extra million tokens of Sonnet costs exactly $18 more. Self-host's marginal cost is roughly zero per token until you saturate the GPU and have to add another instance.
The break-even at full utilization (which means saturating one g5.12xlarge for the entire month, ~25–30 generation tokens/sec sustained):
Bedrock cost at saturation throughput
= 25 tokens/s × 86,400 s/d × 30 d × $15/M
= ~$972/mo (output-only, ignoring input)
Self-host (on-demand)
= $4,140/mo
Break-even on-demand: ~110M output tokens/month
≈ 4M tokens/day sustained
Until you're sustaining about 4M generation tokens per day — i.e. roughly 30× our current volume — Bedrock wins on absolute dollars. With spot instances the break-even drops to about 1.2M tokens/day sustained, but you're now also paying with eviction risk and operational burn.
The other costs Bedrock spares you
Even if you cross the dollar break-even, the operational stack you don't pay for on Bedrock:
- Capacity planning. Auto-scaling on g5 instances is genuinely hard because cold-start of a 70B model is ~3–5 minutes from instance boot to ready-to-serve.
- Region/AZ resilience. Bedrock is multi-AZ by default. Self-host requires you to build it.
- Model upgrades. Anthropic ships new Claude versions. Self-hosting means you re-evaluate, re-fine-tune, re-deploy.
- Compliance. Bedrock inherits AWS's SOC 2, HIPAA, GDPR posture. Self-host inherits yours, which is harder.
- Data residency. Bedrock has EU regions. We need this for our Paris-based GDPR posture.
We priced these in at roughly 0.5 SRE FTE — about $80K/year — which alone would buy ~16,000 hours of g5.12xlarge time. Operationally, even at high volume, the math stays uncomfortable.
Where self-host genuinely wins
Three scenarios where I'd flip:
1. You're at >5M tokens/day sustained and have an SRE willing to own GPU ops full-time. The dollars start to matter at that volume. 2. You need to fine-tune on proprietary data and don't want it touching a third-party API. Bedrock supports custom models but with friction; self-host is cleaner. 3. You need sub-200ms first-token latency and you're willing to pay for it. Bedrock's Sonnet is great but isn't the lowest-latency option in the world.
Otherwise, sub-1M tokens/day on Bedrock costs less than my Paris coffee budget. Done.
The pragmatic mid-path
For teams with mixed needs, we use one trick: Haiku for the broad sweep, Sonnet for the high-confidence prospects. 80% of our generation is Haiku at $1.25/M output; only the top-decile prospects (where the email actually matters) get Sonnet at $15/M.
That's a 12× per-token cost difference for the same piece of email infrastructure. Spend the Sonnet budget where reply probability is highest. Spend the Haiku budget on the broad stroke.
The right cost-control move at our scale isn't moving off Bedrock. It's moving within Bedrock toward the right model for the right slice of work.
Want this on your prospect list? pierre@parisai.click.