From ₹39L to ₹23L/Month:
How a Series B AI Startup Cut
Their LLM Bill 41% in Two Weeks
Zero feature changes. Zero model quality degradation. Zero additional engineers. Just attribution data and two weeks of surgical fixes. Here is the full breakdown.
About this case study
This is a real TokenFin private beta engagement. The company name and identifiable details have been anonymised at their request. All numbers are real and unrounded. The CFO reviewed this post before publication.
The situation on day 1
45 people
Team size
Series B, B2B SaaS
₹39.2L
Monthly LLM spend
~$47,000 USD
0%
Attribution visibility
No breakdown by feature
The product is a B2B SaaS tool that uses LLMs across three distinct surfaces: document analysis for enterprise customers, a customer support chatbot, and an internal operations layer for their own team.
Their LLM bill had grown 4.3× in 8 months — from ₹9.1L/month at launch to ₹39.2L/month. The CEO was tracking revenue growth (healthy). The CFO was tracking cost growth (alarming). The gap between the two was closing.
When we asked their CTO to explain where the ₹39.2L was going, the honest answer was: "GPT-4o, mostly. Some Anthropic. Our document feature is the biggest driver, we think."
"We think" is not a FinOps posture.
Day 1: Instrumentation — 20 minutes
We added TokenFin instrumentation to every LLM call in their codebase. The team had 3 developers on a call. It took 20 minutes. Every existing call path was wrapped with attribution metadata:
// Before: no attribution
const result = await openai.chat.completions.create(({model, messages})
// After: 3 attribution tags added
const result = await track((openai.chat.completions.create(({model, messages})),
{ feature: 'document-analysis', env: process.env.NODE_ENV,
team: 'product' }
)
Within 48 hours of instrumentation, the attribution dashboard had real data on every call path. What it showed was not what anyone expected.
The 4 findings (and how much each cost)
Dev and staging were running production models
₹10.9L
monthly waste
The attribution data showed that 28% of total API spend was tagged with env: development or env: staging. These environments were calling gpt-4o — the most expensive model — with no rate limiting.
Cost breakdown: dev/staging vs production (before)
// The fix: 4 lines of config
const model =
process.env.NODE_ENV === 'production'
? 'gpt-4o'
: 'gpt-4o-mini' // 16.7× cheaper
GPT-4o-mini is 16.7× cheaper than GPT-4o on input tokens. Developers testing against it in dev/staging get functionally identical results for 95% of use cases. Saving: ₹10.9L/month.
A batch pipeline had been silently retrying for 11 weeks
₹4.7L
monthly waste
The document analysis pipeline had a JSON parsing bug. When the LLM returned malformed JSON (roughly 12% of calls due to context truncation), the retry logic fired immediately — up to 5 times — with the full context window each time. Each retry cost the same as the original call. The bug had been live for 11 weeks before anyone noticed, because nobody was looking at call volume vs expected call volume.
Pipeline retry analysis
Fixes: (1) Add explicit JSON mode to the API call to force valid JSON output. (2) Add exponential backoff with max 2 retries. (3) Log parse failure rate as a dashboard metric so it is visible. Result: parse failure rate dropped to 1.2%, retry cost dropped to near-zero.
Internal ops tooling ran GPT-4o for formatting tasks
₹1.8L
monthly waste
Their ops team used an internal tool to reformat customer data exports — a task that requires GPT-3.5-level intelligence at best. The tool was configured to use gpt-4o because that was the company default. No engineer had explicitly chosen it — it was inherited from a config file written in month 1. Downgrading this tool to gpt-4o-mini: zero functional change, ₹1.8L/month saving.
Customer support was already efficient — leave it alone
₹0
no change needed
The chatbot, which was the highest-volume feature, was already well-optimised. Short system prompts (280 tokens). Conversation history truncated to last 6 turns. Response caching for repeated intents (covering ~31% of calls). This was their most cost-efficient feature at ₹0.0019/conversation. The right call: document it, set a budget alert, and leave it alone.
This is an important lesson. Attribution data does not just reveal waste — it also reveals what your team is doing right. Knowing that your support feature is efficient lets you invest confidently in scaling it.
The outcome — day 14
BEFORE
₹39.2L
AFTER
₹23.0L
SAVED
₹16.2L/mo
-41.3%
| Finding | Monthly saving | Effort | Time to fix |
|---|---|---|---|
| Dev/staging model swap | ₹10.9L | 4 lines of config | 1 hour |
| Retry loop fix + JSON mode | ₹4.7L | ~50 lines of code | 1 day |
| Internal tooling downgrade | ₹1.8L | 1 config change | 20 minutes |
| Support feature (no change) | — | — | — |
| Total | ₹17.4L | ~1.5 engineer-days | 14 days to full effect |
Note: actual saving is ₹16.2L due to some natural volume growth during the period. The optimisations themselves saved ₹17.4L.
What they built after
With attribution data live, they built three operational practices they now run permanently:
Monthly model review
Every month, pull the cost-per-feature-per-model report. Ask: has this feature's quality requirement changed? Has a cheaper model improved enough to handle it? The LLM market moves fast — what was true in January may not be true in July.
Budget alerts per feature
Set a spend threshold for each feature in TokenFin. If document-analysis exceeds ₹12L in a month, the on-call engineer gets a Slack alert. Not the CFO asking questions after the invoice — an alert when there is still time to act.
Cost review in every feature spec
Before any new LLM feature ships, the spec now includes: estimated calls/day, model choice justification, expected monthly cost, and whether dev/staging will use a cheaper model. Takes 10 minutes to write. Prevents months of unattributed spend.
What this actually means
₹16.2L/month saved is ₹1.94 crore per year. That is a mid-level engineer in India. It is 3 months of runway for a seed-stage startup.
None of these optimisations required a new engineering hire, a vendor change, or a product decision. They required visibility. The waste was always there. Attribution just made it visible.
Every AI team we onboard finds something in the first 48 hours. The question is not whether the waste exists. The question is how long you want to keep funding it.
Start your free TokenFin trial — attribution in 5 minutes →