AWS Bedrock vs self-hosted inference — the cost math at sub-1M tokens/day

2025-05-28 · technical · 5 min read · Pierre Richard

We rebuilt the inference layer in March and the question we kept running into was: do we self-host an open model on a GPU instance, or stay on AWS Bedrock and pay per token? The answer was unambiguous at our scale, but the threshold where it flips is closer than I expected. Here's the math.

The setup

The pipeline does roughly 800K input tokens and 120K output tokens per day across:

prospect classification (Haiku-class, ~150 tokens out)
email generation (Sonnet-class, ~600 tokens out)
response classification (Haiku-class, ~50 tokens out)

Mixed Haiku/Sonnet usage on Bedrock. For self-hosting we benchmarked Llama-3-70B-Instruct, which is the closest open equivalent to Sonnet for this kind of long-form generation work. Llama-3-8B-Instruct stands in for Haiku.

The Bedrock cost

At June 2025 list prices, per million tokens:

Claude Haiku: $0.25 input / $1.25 output
Claude Sonnet: $3.00 input / $15.00 output

Our monthly Bedrock spend at this volume:

Haiku  : 600K in/d × 200K out/d × 30d × ($0.25/$1.25 per M)
       =  $4.50 + $7.50 = $12/mo

Sonnet : 200K in/d × 80K out/d × 30d × ($3.00/$15.00 per M)
       = $18.00 + $36.00 = $54/mo

Total Bedrock : ≈ $66/mo

Sixty-six dollars a month. That's the number to beat.

The self-host cost (g5.12xlarge)

For Llama-3-70B you want at least an g5.12xlarge (4× A10G, 96 GiB combined VRAM) to fit the model with vLLM and meaningful batching. List on-demand pricing, us-east-1: $5.672/hr.

Run it 24/7:

g5.12xlarge × 730 hr/mo × $5.672/hr = $4,140/mo

A single instance, no redundancy, no auto-scaling buffer, no load balancer, no health-checks, no spot fallback. Self-host is 62× more expensive than Bedrock at our volume.

Even if you go full spot pricing (~70% discount, but with eviction risk you'd build dual-region failover for) you get to ~$1,240/mo. Still 19× Bedrock.

Where the line actually flips

Bedrock's marginal cost is constant — every extra million tokens of Sonnet costs exactly $18 more. Self-host's marginal cost is roughly zero per token until you saturate the GPU and have to add another instance.

The break-even at full utilization (which means saturating one g5.12xlarge for the entire month, ~25–30 generation tokens/sec sustained):

Bedrock cost at saturation throughput
  = 25 tokens/s × 86,400 s/d × 30 d × $15/M
  = ~$972/mo  (output-only, ignoring input)

Self-host (on-demand)
  = $4,140/mo

Break-even on-demand: ~110M output tokens/month
                     ≈ 4M tokens/day sustained

Until you're sustaining about 4M generation tokens per day — i.e. roughly 30× our current volume — Bedrock wins on absolute dollars. With spot instances the break-even drops to about 1.2M tokens/day sustained, but you're now also paying with eviction risk and operational burn.

The other costs Bedrock spares you

Even if you cross the dollar break-even, the operational stack you don't pay for on Bedrock:

Capacity planning. Auto-scaling on g5 instances is genuinely hard because cold-start of a 70B model is ~3–5 minutes from instance boot to ready-to-serve.
Region/AZ resilience. Bedrock is multi-AZ by default. Self-host requires you to build it.
Model upgrades. Anthropic ships new Claude versions. Self-hosting means you re-evaluate, re-fine-tune, re-deploy.
Compliance. Bedrock inherits AWS's SOC 2, HIPAA, GDPR posture. Self-host inherits yours, which is harder.
Data residency. Bedrock has EU regions. We need this for our Paris-based GDPR posture.

We priced these in at roughly 0.5 SRE FTE — about $80K/year — which alone would buy ~16,000 hours of g5.12xlarge time. Operationally, even at high volume, the math stays uncomfortable.

Where self-host genuinely wins

Three scenarios where I'd flip:

1. You're at >5M tokens/day sustained and have an SRE willing to own GPU ops full-time. The dollars start to matter at that volume. 2. You need to fine-tune on proprietary data and don't want it touching a third-party API. Bedrock supports custom models but with friction; self-host is cleaner. 3. You need sub-200ms first-token latency and you're willing to pay for it. Bedrock's Sonnet is great but isn't the lowest-latency option in the world.

Otherwise, sub-1M tokens/day on Bedrock costs less than my Paris coffee budget. Done.

The pragmatic mid-path

For teams with mixed needs, we use one trick: Haiku for the broad sweep, Sonnet for the high-confidence prospects. 80% of our generation is Haiku at $1.25/M output; only the top-decile prospects (where the email actually matters) get Sonnet at $15/M.

That's a 12× per-token cost difference for the same piece of email infrastructure. Spend the Sonnet budget where reply probability is highest. Spend the Haiku budget on the broad stroke.

The right cost-control move at our scale isn't moving off Bedrock. It's moving within Bedrock toward the right model for the right slice of work.

Want this on your prospect list? pierre@parisai.click.