The hidden costs of API-based AI: what your CFO needs to know

The per-token price on an LLM API pricing page looks cheap. A few dollars per million tokens. At first glance, it seems like the most cost-efficient way to add AI capabilities to your bank or fintech. Then the invoices start arriving.

What most financial institutions discover after three to six months of production use is that API-based AI costs are unpredictable, difficult to forecast, and significantly higher than they appeared during the pilot phase. And the price per token is only the beginning.

But switching to on-premise infrastructure is not free either. It comes with its own set of costs that are easy to underestimate: hardware maintenance, engineering overhead, and the operational burden of running your own AI stack.

This article breaks down the full cost picture on both sides, so your institution can make an informed decision based on your actual workloads, regulatory obligations, and operational capacity.

How does API-based AI pricing actually work?

LLM API providers charge per token, with separate rates for input tokens (your prompts and context) and output tokens (the model's responses). Prices are quoted per million tokens. A token is roughly four characters of English text, or about three-quarters of a word.

The critical detail that most cost estimates miss is that output tokens are three to ten times more expensive than input tokens. A model that costs $3 per million input tokens might charge $15 per million output tokens. For a compliance screening system that generates detailed risk assessments, which are output-heavy by nature, the actual cost per query is far higher than a quick calculation based on the input price suggests.

At current 2026 pricing, flagship models from major providers cost roughly $5-10 per million input tokens and $15-30 per million output tokens. Mid-tier models run $1-3 input and $5-15 output. Budget models come in under $1 input but still charge $2-5 for output.

What does API-based AI actually cost at banking scale?

The pilot looked affordable. A few hundred queries a day, a manageable monthly bill. But banking workloads operate at a fundamentally different scale than pilot environments.

Transaction monitoring

A mid-size bank processes hundreds of thousands of transactions daily. If an AI system screens each transaction, even with a short prompt and response, the token volume adds up fast. At 2,000 tokens per screening interaction (including context, transaction data, and the model's assessment), processing 100,000 transactions daily generates 200 million tokens per day, or roughly 6 billion tokens per month.

At mid-tier API pricing ($3/$15 per million tokens, assuming a 1:1 input-to-output ratio), that is approximately $54,000 per month for a single monitoring use case. At flagship model pricing, the number can exceed $120,000 per month.

Customer advisory

A customer-facing recommendation system handling 10,000 interactions daily, each averaging 3,000 tokens (question, context retrieval, and response), generates 30 million tokens per day. At mid-tier pricing, that runs $8,000-$15,000 per month.

Stacking use cases

Stack transaction monitoring, customer advisory, and an internal knowledge assistant together, which is what any serious AI deployment involves, and the monthly API bill can reach $70,000-$150,000 or more. Annually, that is $840,000 to $1.8 million in API costs alone.

These are real numbers, but they represent high-volume, continuous use. Institutions with lower transaction volumes or intermittent usage patterns may find API costs perfectly manageable. The key is running the calculation for your actual workload, not a generic estimate.

What are the costs that don't appear on the API pricing page?

The per-token charges are only the visible part of the cost structure. Several additional costs emerge once an API-based AI system is in production.

Cost unpredictability

API costs scale linearly with usage. If transaction volumes spike during a busy period, or a new regulatory requirement triggers additional screening, the AI bill spikes with it. There is no ceiling and no fixed monthly cost. For a CFO managing budgets across a financial institution, this unpredictability is a real operational problem.

Vendor dependency

Once your systems are built around a specific provider's API, their token format, their prompt conventions, their rate limits, their SDK, switching to a competitor is expensive. Prompts need to be rewritten and retested, integration logic needs to be updated, and performance needs to be revalidated. This lock-in gives the provider leverage on future pricing.

Data governance overhead

Every API call sends data to the provider's infrastructure. For financial institutions handling sensitive customer data, this creates a governance burden. Under DORA, each API provider constitutes a third-party ICT dependency that must be continuously monitored, contractually governed, and regularly tested. The compliance cost of managing this dependency, legal review, audit preparation, contractual negotiation, is real and ongoing.

Rate limits and latency

API providers impose rate limits. During peak usage, requests may be queued or rejected. API calls also add 200-1,000ms of latency per query due to network round-trips. For real-time applications like transaction monitoring, both constraints directly impact operational throughput.

What does on-premise AI truly cost?

A purpose-built small language model with 1-2 billion parameters, fine-tuned for a specific financial services task, can run on a single GPU. The economics are different from API pricing because the costs are predominantly fixed rather than variable. But "different" does not mean "free."

Infrastructure costs

An NVIDIA L4 GPU costs $2,000-$3,000 to purchase, or $0.44-$0.80 per hour to rent in the cloud. Running an L4 24/7 in a cloud environment costs approximately $320-$580 per month. For higher throughput, an NVIDIA A100 costs roughly $860-$2,830 per month in cloud rental, depending on the provider.

On-premise hardware requires upfront capital expenditure plus ongoing costs for power, cooling, and physical maintenance. These costs vary significantly depending on whether the institution already has data center capacity or needs to build it.

Engineering and maintenance overhead

This is the cost that on-premise advocates often understate. Someone has to set up the inference server, configure model loading, tune batch sizes, manage GPU drivers, handle CUDA compatibility, and keep everything running. Conservatively, a self-hosted deployment requires 10-20 hours per month of engineering time for ongoing maintenance, monitoring, and troubleshooting. At fully loaded engineering costs, that is $1,500-$4,000 per month.

Initial setup is more intensive, expect 40-80 hours of engineering time for the first deployment, including data preparation, model fine-tuning, integration testing, and performance optimization.

Model retraining

Unlike an API where the provider handles model updates, on-premise models need to be retrained when regulations change, new transaction patterns emerge, or accuracy drifts. This requires engineering time, compute resources, and a validation process before the updated model goes into production. Plan for quarterly retraining cycles at minimum.

The skills gap

Running on-premise AI requires ML engineering skills that many financial institutions do not have in-house. Hiring or contracting these skills is an additional cost that should be factored into the total cost of ownership. An external partner can reduce this burden, but it does not eliminate it.

How do the two approaches compare?

Neither option is universally cheaper. The right choice depends on your institution's workload volume, regulatory obligations, technical capacity, and risk tolerance.

When API-based AI makes more sense

For institutions with lower AI query volumes — say under 5,000 queries per day across all use cases — API-based AI is often the simpler and more cost-effective option. You avoid upfront infrastructure costs, you don't need ML engineering staff, and the provider handles all model updates and maintenance. If your usage is intermittent or unpredictable, the pay-per-use model aligns costs with actual consumption.

API-based AI also makes sense for prototyping and pilot projects. Testing a use case with an API before committing to on-premise infrastructure is a sound strategy.

When on-premise SLMs make more sense

For institutions with high-volume, continuous workloads, transaction monitoring, compliance screening, customer advisory at scale, on-premise SLMs offer significant cost advantages. A single GPU serving a purpose-built 2-billion parameter model can handle workloads that would cost $70,000-$150,000 per month via API for a fraction of that amount.

On-premise also becomes the stronger option when regulatory requirements are a primary driver. DORA's third-party oversight obligations and the EU AI Act's audit trail requirements for high-risk systems are significantly easier to meet when you control the entire infrastructure. The governance overhead of managing API providers as critical ICT dependencies can exceed the cost of running your own stack.

The hybrid approach

Many institutions find that a hybrid model works best in practice. Use APIs for low-volume, exploratory, or rapidly changing use cases. Deploy on-premise SLMs for high-volume, production workloads where cost predictability, data sovereignty, and regulatory compliance matter most. This avoids the all-or-nothing decision and lets each use case find its natural cost-optimized path.

What should your institution do first?

Before making any architectural decision, get visibility into your actual costs and projected usage.

Audit your current AI spend

Most institutions lack a centralized view of their total AI costs because usage is distributed across teams and projects. Map every API-based AI system, its token volume, its monthly cost, and its growth trajectory. Without this baseline, any comparison to on-premise alternatives is guesswork.

Model the on-premise alternative for your top use case

Take your highest-volume API workload and calculate what it would cost to serve the same workload with a purpose-built SLM on your own infrastructure — including hardware, engineering time, retraining cycles, and maintenance. Be honest about both sides of the equation.

Factor in regulatory costs on both paths

The TCO comparison must include the governance overhead. For API-based AI, this means DORA compliance costs for third-party oversight. For on-premise AI, this means the EU AI Act's documentation, testing, and monitoring requirements. Neither path is free of regulatory burden, but the nature of that burden differs.

Key takeaways

API-based AI pricing appears cheap at pilot scale and can become expensive at production scale. For high-volume banking workloads, annual API costs can reach $1-2 million before accounting for vendor lock-in, data governance overhead, and compliance costs.

On-premise SLMs offer significant cost advantages for high-volume use cases, but they come with their own hidden costs: engineering overhead, hardware maintenance, retraining, and the skills required to operate the stack.

The honest answer is that neither option is universally superior. The right choice depends on your workload, your team's capabilities, and your regulatory environment. The worst choice is making the decision without understanding the full costs on both sides.

If you want to understand what the numbers look like for your specific workloads, we can help you run the comparison.

Related reading:

The hidden costs of API-based AI: what your CFO needs to know

The hidden costs of API-based AI: what your CFO needs to know

How does API-based AI pricing actually work?

What does API-based AI actually cost at banking scale?

Transaction monitoring

Customer advisory

Stacking use cases

What are the costs that don't appear on the API pricing page?

Cost unpredictability

Vendor dependency

Data governance overhead

Rate limits and latency

What does on-premise AI truly cost?

Infrastructure costs

Engineering and maintenance overhead

Model retraining

The skills gap

How do the two approaches compare?

When API-based AI makes more sense

When on-premise SLMs make more sense

The hybrid approach

What should your institution do first?

Audit your current AI spend

Model the on-premise alternative for your top use case

Factor in regulatory costs on both paths

Key takeaways

Ready to Own Your AI?