RAG vs fine-tuning for financial services: when to use which

Most content about RAG vs fine-tuning treats the decision as a general-purpose engineering choice. It isn't. In financial services, the factors that matter most, data sensitivity, regulatory auditability, latency requirements, and accuracy on domain-specific terminology, shift the calculus significantly compared to a typical enterprise deployment.

This article provides a decision framework for banks and fintechs evaluating how to build AI systems using small language models (SLMs). We cover what RAG and fine-tuning actually do, when each approach works best for financial use cases, and when you should combine both.

What is RAG and how does it work?

Retrieval-augmented generation (RAG) is an architecture that connects a language model to an external knowledge base. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents at query time and includes them in the prompt so the model can generate answers grounded in specific, up-to-date information.

The process follows four steps. The user submits a query. A retrieval system searches a knowledge base, typically using vector embeddings and semantic search, to find the most relevant documents. Those documents are combined with the original query into a single prompt. The language model then generates a response using both its trained knowledge and the retrieved context.

RAG does not change the model itself. The model's weights remain exactly as they were. What changes is the input the model sees before generating each response.

When does RAG work well in financial services?

RAG is the right choice when the information the model needs changes frequently and must always reflect the latest version. In banking, this applies to several common scenarios.

Policy and procedure lookup. Internal policies are updated regularly, compliance procedures, risk frameworks, HR policies, operational manuals. A RAG system connected to the bank's document management system ensures the model always references the current version, not a version from six months ago that was current during training.

Regulatory reference. Financial regulations change constantly. A RAG pipeline pulling from a curated regulatory database ensures the model cites current requirements rather than outdated rules it may have absorbed during pre-training.

Customer-facing Q&A. Product terms, fee schedules, and eligibility criteria change with business cycles. RAG lets the model answer customer questions based on the live product catalog rather than a static snapshot.

The common thread is that RAG excels when the knowledge is external to the model, changes over time, and needs to be traceable back to a specific source document. That traceability matters in financial services because regulators expect institutions to demonstrate where an AI-generated answer came from.

What are the limitations of RAG for financial services?

RAG's effectiveness is bounded by retrieval quality. If the retrieval step returns irrelevant documents, because the chunking strategy is poor, the embedding model doesn't understand financial terminology, or the knowledge base is disorganized, the language model will generate confident-sounding answers based on the wrong context.

RAG also adds latency. Every query requires a retrieval step before generation, which adds 100-500ms depending on the knowledge base size and infrastructure. For real-time transaction monitoring or customer-facing systems where response time matters, this overhead is significant.

Finally, RAG cannot change how the model behaves. It can only change what the model sees. If you need the model to consistently use a specific tone, follow a particular output format, or reason about financial concepts with domain-level fluency, RAG alone won't get you there.

What is fine-tuning and how does it work?

Fine-tuning is the process of continuing a pre-trained language model's training on a smaller, domain-specific dataset. Unlike RAG, fine-tuning actually changes the model's internal weights. The model learns new patterns, terminology, reasoning approaches, and behavioral norms from the fine-tuning data.

For financial services, this typically means training the model on the institution's compliance documents, transaction data patterns, regulatory texts, investigation records, and domain-specific communication formats. The result is a model that doesn't just access financial knowledge, it thinks in financial terms.

Modern fine-tuning techniques like LoRA (Low-Rank Adaptation) and QLoRA make this process efficient. Instead of retraining all model parameters, LoRA freezes most of the model and only updates a small set of adapter weights. This reduces the compute cost dramatically, a 2-billion parameter model can be fine-tuned on a single GPU in hours, not days.

When does fine-tuning work best in financial services?

Fine-tuning is the right choice when you need the model to internalize domain-specific behavior that should be consistent across every response, regardless of what documents are retrieved.

Compliance screening and alert scoring. An AML transaction monitoring model needs to consistently evaluate risk signals in the way your compliance team does. This isn't about retrieving a document, it's about the model having internalized the patterns that distinguish genuine suspicious activity from normal behavior. Fine-tuning on historical investigation outcomes teaches the model these patterns.

Document classification and extraction. Financial documents follow specific formats and conventions, loan agreements, regulatory filings, KYC documentation, trade confirmations. A fine-tuned model learns the structure and terminology of these documents, enabling it to classify and extract information with higher accuracy than a general-purpose model prompted with instructions.

Domain-specific language understanding. Financial services has its own vocabulary. Terms like "open-to-buy," "NAV per unit," "mark-to-market," and "waterfall structure" have precise meanings that general-purpose models often misinterpret. Fine-tuning embeds this vocabulary into the model's parameters so it handles financial terminology natively rather than relying on prompt-based explanations.

Consistent output formatting. If your system needs to produce structured outputs, JSON for downstream processing, specific report templates, standardized risk assessments, fine-tuning teaches the model to consistently produce the exact format required, without relying on prompt engineering that can drift over time.

What are the limitations of fine-tuning?

Fine-tuning teaches patterns, not facts. A fine-tuned model can learn how to reason about financial concepts, but it is not a reliable store of specific, frequently changing information. If you need the model to know the current EUR/USD exchange rate, yesterday's regulatory update, or a specific client's account details, that information should come through retrieval, not fine-tuning.

Fine-tuning also requires quality training data. The model learns from what you show it. If the training data contains errors, inconsistencies, or biased investigation outcomes, the model will reproduce those patterns. Data preparation and validation are critical steps that directly impact the quality of the fine-tuned model.

Once fine-tuned, the model's knowledge is static until the next retraining cycle. For financial services where regulations evolve and market conditions shift, this means establishing a regular retraining cadence, typically quarterly or when significant regulatory changes occur.

How do you decide between RAG and fine-tuning?

The decision comes down to what problem you're solving. Here is a practical framework for financial services teams.

Use RAG when the answer lives in a document

If the correct response depends on retrieving specific, current information from a known source, a policy document, a regulatory text, a product specification, RAG is the right approach. The model's job is to find the right information and present it clearly, not to reason independently.

Typical use cases: internal knowledge assistants, regulatory reference tools, customer FAQ systems, product eligibility lookups.

Use fine-tuning when the answer requires domain reasoning

If the correct response requires the model to apply domain-specific judgment, evaluating risk, classifying transactions, interpreting financial language, generating structured outputs, fine-tuning is the right approach. The model needs to have internalized the patterns and conventions of financial services, not just retrieved a document about them.

Typical use cases: AML alert scoring, document classification, compliance report generation, financial language processing.

Combine both when you need reasoning over current data

Many financial applications require both domain reasoning and access to current information. This is where a hybrid approach delivers the best results: a fine-tuned SLM connected to a RAG pipeline.

The fine-tuned model brings domain fluency, it understands financial terminology, follows the institution's formatting conventions, and reasons about compliance concepts correctly. The RAG layer supplies current, specific information, the latest policy version, the relevant regulatory text, the customer's account context.

Example: compliance advisory system. A compliance officer asks about DORA requirements for a specific third-party vendor. The fine-tuned model understands DORA's framework and how to reason about third-party risk. The RAG layer retrieves the specific vendor's contract terms and the latest regulatory technical standards. The combined system delivers an answer that is both domain-aware and factually grounded in current information.

Example: customer-facing recommendation engine. A client asks about mortgage eligibility. The fine-tuned model understands how to evaluate eligibility criteria, assess risk factors, and communicate recommendations in the institution's tone. The RAG layer retrieves the current product terms, the client's profile data, and the applicable regulatory requirements. The combined system delivers a personalized, accurate, and compliant recommendation.

What does this look like on-premise?

For financial institutions running SLMs on their own infrastructure, the combined architecture is straightforward.

The fine-tuned SLM runs on a single GPU, typically a T4, L4, or A100 depending on throughput requirements. The vector database for RAG (Chroma, Milvus, or pgvector) runs alongside it, storing embeddings of the institution's documents. An orchestration layer routes queries, manages retrieval, constructs prompts, and returns responses.

The entire system runs within the institution's infrastructure. No data leaves the environment. The RAG knowledge base is updated as documents change. The model is retrained on a scheduled cadence or when domain conditions shift.

This architecture satisfies DORA's ICT risk management requirements and the EU AI Act's data governance obligations because the institution maintains complete control over every component, the model, the knowledge base, the retrieval pipeline, and the audit trail.

Key takeaways

RAG and fine-tuning solve different problems. RAG gives a model access to current, specific information without changing the model itself. Fine-tuning changes how a model reasons and behaves by embedding domain knowledge into its parameters.

For financial services, the decision framework is practical: use RAG when the answer lives in a document, use fine-tuning when the answer requires domain reasoning, and combine both when you need domain reasoning over current data.

Most production financial AI systems benefit from the hybrid approach, a fine-tuned SLM providing domain fluency, connected to a RAG pipeline providing current, auditable information. This combination delivers the accuracy, traceability, and regulatory compliance that financial institutions require.

If your team is evaluating how to architect AI for financial services, we can help you design the right approach.

Related reading:

‍

RAG vs fine-tuning for financial services: when to use which

RAG vs fine-tuning for financial services: when to use which

What is RAG and how does it work?

When does RAG work well in financial services?

What are the limitations of RAG for financial services?

What is fine-tuning and how does it work?

When does fine-tuning work best in financial services?

What are the limitations of fine-tuning?

How do you decide between RAG and fine-tuning?

Use RAG when the answer lives in a document

Use fine-tuning when the answer requires domain reasoning

Combine both when you need reasoning over current data

What does this look like on-premise?

Key takeaways

Ready to Own Your AI?