How to choose a base AI model for financial services

Choosing a base model is the first architectural decision in any AI deployment — and one of the most consequential. The model you start with determines your infrastructure requirements, your fine-tuning options, your inference costs, and ultimately the accuracy ceiling of the system you build on top of it.

In 2026, this decision is more complex than it was a year ago — not because there are fewer good options, but because there are more. The open-weight ecosystem now offers dozens of production-quality models across a wide range of sizes, architectures, and licensing terms. Choosing well requires understanding what actually matters for your specific use case — and ignoring several things that look important but are not.

This article walks through the decision framework: what to evaluate, what to ignore, and what mistakes to avoid.

Start with the task, not the model

The most common mistake is starting the selection process by comparing models. Leaderboard rankings, benchmark scores, parameter counts — these are interesting but premature. The first question is not "which model is best?" It is "what exactly does this model need to do?"

A model that will power an internal knowledge assistant answering policy questions has fundamentally different requirements from one that will classify transaction alerts or generate client-facing risk summaries. The knowledge assistant needs strong reading comprehension and summarisation. The alert classifier needs fast inference and high precision on a narrow category set. The risk summary generator needs consistent formatting and domain-appropriate language.

Defining the task precisely — including the input format, the expected output, the accuracy threshold, the acceptable latency, and the volume — narrows the model shortlist dramatically before any benchmarking begins.

Parameter count: why bigger is usually not better

The instinct to choose the largest available model is understandable but misguided for most production use cases. Parameter count correlates with general capability — a 70-billion parameter model knows more about the world than a 7-billion parameter model. But for a specific, well-defined task, that extra knowledge is mostly irrelevant. Worse, it comes with concrete costs.

Larger models require more expensive hardware. A 70B model needs multiple GPUs or high-end hardware to run at acceptable speed. A 7B model runs comfortably on a single mid-range GPU. The infrastructure cost difference is not marginal — it can be 5-10x.

Larger models are slower at inference. Every additional parameter adds computation time per token. For real-time applications where latency matters — transaction screening, customer-facing responses, alert triage — this directly impacts the user experience and operational throughput.

Larger models are harder to fine-tune. The compute required for fine-tuning scales with model size. Fine-tuning a 7B model on your institution's data takes hours on a single GPU. Fine-tuning a 70B model takes days on multiple GPUs, with significantly higher costs and more opportunities for things to go wrong.

The right approach is to start with the smallest model that achieves the accuracy your use case requires. A purpose-built SLM with up to 10 billion parameters, fine-tuned on domain-specific data, will outperform a much larger general model on the narrow tasks it was built for — while running faster, costing less, and being easier to maintain.

Benchmarks: useful but misleading

Published benchmarks — MMLU, HumanEval, GSM8K, HellaSwag — measure performance on standardised tests. They are useful for rough comparison between model families but unreliable as predictors of how a model will perform on your specific task.

Why benchmarks can mislead

Benchmarks test general capabilities. Your use case tests specific ones. A model that scores highest on a coding benchmark may be mediocre at compliance document analysis. A model that leads on reasoning tasks may produce inconsistent formatting in structured output. The correlation between benchmark performance and domain-specific accuracy is weaker than most comparison tables suggest.

There is also the problem of benchmark contamination. Some models have been trained on data that includes benchmark questions, which inflates their scores without reflecting genuine capability improvement. This is difficult to detect from published results.

What to do instead

The most reliable evaluation method is testing candidate models on a sample of your actual data. Prepare 50-100 representative examples of the task the model will perform — real queries, real documents, real expected outputs — and run each candidate model against them. Measure accuracy, consistency, formatting compliance, and failure modes. This evaluation takes a few hours and produces more actionable information than any benchmark comparison.

For financial services use cases where accuracy carries regulatory consequences, this step is not optional. Selecting a model based on benchmarks and discovering in production that it hallucinates on your specific domain terminology is an expensive mistake to correct after deployment.

Architecture matters: dense vs mixture-of-experts

Most open-weight models now come in two architectural styles, and the choice between them affects inference speed, hardware requirements, and cost.

Dense models activate every parameter for every token. A 7B dense model uses all 7 billion parameters on every query. Performance is consistent and predictable. Hardware requirements are straightforward: the model needs enough GPU memory to hold all parameters.

Mixture-of-experts (MoE) models contain a larger total parameter count but only activate a fraction of those parameters for each token. Mistral Small 4 has 119 billion total parameters but only 6.5 billion active per token. Llama 4 Scout has 109 billion total but 17 billion active. The result is a model that performs like a much larger model while running with the inference cost of a smaller one.

MoE models offer a compelling trade-off for many use cases — near-frontier performance at moderate inference cost. The trade-off is that they require more total memory to load (all parameters must be in memory, even if only a fraction are active per token) and their routing mechanisms can introduce variability in which expert handles which query.

For most financial services deployments, the practical question is: does the model fit on your available hardware while meeting your latency and accuracy requirements? Both architectures can work. MoE models are often the better choice when you need the capability of a large model but want to run it on a single GPU.

Licensing: read it before you build

Not all open-weight models carry the same licensing terms. Getting this wrong creates legal exposure that is entirely avoidable.

Apache 2.0 and MIT — fully permissive for commercial use, modification, and redistribution. Qwen, Mistral, and several others use Apache 2.0. This is the safest choice for commercial deployment.

Meta's Llama licence — permissive for most commercial use but includes a threshold for applications exceeding 700 million monthly active users. Irrelevant for most financial institutions but worth noting for platform businesses.

Non-commercial or research-only licences — some older or specialised models restrict commercial use entirely. Always verify the licence before investing engineering time in evaluation or fine-tuning.

Derivative work clauses — some licences impose requirements on models derived through fine-tuning. Verify that your planned use — including commercial deployment of a fine-tuned version — is permitted under the base model's licence.

Multilingual capability: check before you assume

If your institution operates across multiple markets, processes documents in multiple languages, or serves customers who communicate in languages other than English, multilingual performance is not a feature you can ignore.

Multilingual capability varies significantly across model families. Some models are trained primarily on English data and handle other languages as a secondary capability — producing lower accuracy, less natural phrasing, and more frequent errors in non-English languages. Others are trained with genuine multilingual balance.

Qwen's training data covers 119 languages with deliberate representation. Mistral has strong multilingual support across 80+ languages. Other model families vary. For a Polish bank or a pan-European fintech, testing candidate models in the actual languages they will encounter is an essential evaluation step — not something to discover after deployment.

The evaluation process that actually works

Bringing together everything above, a practical model selection process for a financial services AI project follows this sequence:

Define the task precisely. What are the inputs? What are the expected outputs? What accuracy is required? What latency is acceptable? What is the expected query volume?

Set hardware constraints. What GPU infrastructure is available or budgeted? This immediately eliminates models that are too large to run on your hardware at acceptable speed.

Shortlist 3-5 candidates. Based on parameter count, architecture, licensing, and language requirements. Use published benchmarks as a rough filter, not a final selection criterion.

Test on your data. Prepare 50-100 representative examples. Run each candidate. Measure accuracy, formatting, consistency, and failure modes. Include edge cases and adversarial examples that test hallucination behaviour.

Evaluate fine-tuning potential. Run a small fine-tuning experiment on your top 1-2 candidates using a sample of domain-specific data. This reveals how the model responds to fine-tuning — some models improve dramatically with a small fine-tuning dataset, others plateau quickly.

Factor in ecosystem and support. Which models have the best tooling for your deployment stack? Which have active communities and regular updates? Which model providers have a track record of maintaining backward compatibility?

The entire process — from task definition to final selection — typically takes one to two weeks. That investment saves months of rework from choosing the wrong model.

Key takeaways

Choosing a base model is a decision that constrains everything built on top of it. The right choice depends on the task, the hardware, the licensing terms, the language requirements, and the model's actual performance on your data — not on benchmark rankings or parameter counts.

Start small. Test on your real data. Fine-tune before judging. And choose the model that fits the job, not the one that leads the leaderboard.

If you are evaluating base models for a financial services AI deployment and want help structuring the selection process, we can guide you through it.

Related reading:

‍

How to choose a base model for your AI project