Governing open-weight AI models in regulated financial services

Open-weight AI gives institutions control of the model and inherits the documentation burden. What the governance programme actually needs to cover.

Blog Collection Athour img
Jarek Glowka
Co-founder, Compliance & Operations
shape

Governing open-weight AI models in regulated financial services

A financial institution that downloads an open-weight model owns the artefact and inherits the documentation burden the upstream provider would otherwise carry. For a regulated bank, insurer, or payments business, that is not a side-note — it is the central operational reality of running AI in-house. The governance programme around an open-weight deployment is larger than the model itself, and the institutions that get this right treat the registry, the change-management workflow, and the audit trail as the actual deliverables. The weights are the easy part.

This article walks through what that governance programme needs to cover: why open-weight governance is structurally different from proprietary AI governance, what belongs in the model registry, how approval and change management work, what ongoing monitoring looks like in practice, how to handle retraining and lineage, and which gaps most commonly appear in early open-weight deployments.

Why does open-weight governance look different from proprietary AI governance?

The structural difference is that, with an open-weight model deployed on the institution's own infrastructure, the institution itself becomes both deployer and — for any locally-modified version — provider in the EU AI Act's terms. A proprietary API relationship pushes provider obligations onto the vendor; the bank is the deployer and the vendor handles the supplier-side documentation. With open-weight, the institution wears both hats.

That means three things in practice. First, the documentation that an OpenAI or an Anthropic would publish for their own model — model card, training-data summary, intended-use scope, risk register — has to be assembled by the institution from upstream sources, with the institution's own additions for what it has done locally. Where the upstream provider has been thin, the gap is the institution's to close or to explicitly note.

Second, the lineage of "which exact model is in production right now" becomes a governance question rather than a vendor-management one. Llama 3 became Llama 3.1 became Llama 3.2; an institution running open-weight has to track which version is in the production environment, which is being evaluated, and which has been retired — with the documentation trail that supports each move. The open-weight versus open-source distinction determines how much the institution can document about how each version was built; the institution still has to document that each version is running, where, since when, under which licence.

Third, the audit-surface dependency does not go away with on-premise deployment. The institution can run the model autonomously, but its description of how the model was trained remains bounded by what the upstream provider published. The governance programme has to be honest about this in its risk documentation rather than implying full transparency.

What needs to live in the model registry?

The registry is the spine of the governance programme. For each model in production, evaluation, or retired status, the registry should hold a defined set of fields that supervisors, internal audit, and model-risk committees can pull from without the team having to reconstruct the history.

The minimum useful set for an open-weight deployment, in our experience, covers the following.

Identity and provenance. Model name, version, source URL, file hash. This sounds basic; it is not always present. Two fine-tuned variants of the same base model are not the same model, and the registry should make that distinguishable at a glance.

Licence and openness tier. The full licence text (or a versioned link to it), the openness tier under the framework in the vs-open-source piece, and a record of who reviewed it for the specific deployment use case. A licence that was acceptable for internal pilots may not be acceptable for white-label use; the registry should make the original review traceable.

Upstream documentation links. Model card, training-data summary, evaluation results published by the upstream provider, and an explicit note where any of these are missing. "Upstream did not publish training-data composition" is a valid registry entry; "we forgot to check" is not.

Local modifications. Fine-tuning datasets used, hyperparameters, evaluation harness, evaluation results on the institution's test set. Where the upstream provided minimal training documentation, the institution's own fine-tuning documentation becomes the primary evidence of model behaviour.

Risk classification. Where the system sits under EU AI Act Annex III — high-risk creditworthiness assessment, the fraud-detection carve-out, or out-of-scope — with the rationale documented. See what regulated businesses need to know before deploying AI for the underlying classification logic.

Deployment scope. Which use cases the model is approved for, which it is explicitly not approved for, and which decisions or outputs require human-in-the-loop review.

A model registry that captures these fields cleanly is also the bulk of what supervisors will ask for. Building it after the regulator asks is materially harder than building it during deployment planning.

How does approval and change management work for open-weight deployments?

Open-weight deployments need a denser approval matrix than API-based ones because more of what changes is under the institution's own control — and therefore counted as institutional action.

The initial approval should bring together model risk, compliance, and technical sign-off. Model risk owns the performance and bias assessment, compliance owns the regulatory classification and human-oversight design, and technical owns the deployment architecture and operational readiness. None of these can be skipped without leaving a hole that the next supervisory review will surface.

Change events that should trigger formal re-approval are the ones most commonly under-managed in early open-weight deployments:

  • A new upstream version. Upgrading from Llama 3.1 to 3.2 is not a patch; it is a model change with downstream behavioural implications.
  • A new fine-tuning round. Even on the same base model, a fresh fine-tune with new data is a new derived model and a new audit trail.
  • A change in deployment scope. Moving a model from internal-only use to customer-facing, or from one product line to another, alters the risk profile and may shift the EU AI Act classification.
  • A material change in infrastructure. Moving from on-premise to a private cloud, or vice versa, affects the DORA third-party position and the data-flow documentation.
  • A retraining cycle. Even routine retraining on rolling production data is a model change that needs to be documented and re-evaluated before promotion.

The honest tradeoff with open-weight is that more local control means more local approval load. The institution that wants the data-sovereignty advantages of on-premise deployment is also accepting the operational cost of managing the lifecycle that a vendor would otherwise manage.

What does ongoing monitoring and human oversight look like in practice?

A model in production is not a deployment that is finished; it is a system that needs continuous evidence of behaviour. The monitoring layer is the source of that evidence.

The core monitoring activities — described conceptually rather than as a build guide — are accuracy tracking against a rolling evaluation set, drift detection that flags when the distribution of inputs has shifted materially from the training distribution, output-quality sampling where human reviewers periodically assess a random slice of model outputs, and alerting thresholds that trigger intervention when key metrics deteriorate. The pilot-to-production article covers the operational architecture in more detail; the governance question is who reads the monitoring reports and what they are empowered to do.

Bias and fairness re-testing on a defined cadence is the part that is most commonly under-managed. A bias profile that was within tolerance at deployment can drift with new customer populations, product changes, or shifts in upstream training behaviour after a model update. Quarterly re-testing on a held-out fairness evaluation set is a common cadence for financial services applications; more frequent if the model is in a regulated decision path.

Human oversight design matters for two regulatory reasons. The EU AI Act, for high-risk systems, requires that humans be able to meaningfully interpret and override model outputs — which means the analyst reviewing AI-generated recommendations needs explanations and authority, not just a queue. And GDPR Article 22 restricts solely-automated decisions with significant effects on individuals; the institution needs to provide meaningful information about the logic involved and the right to human review. Both requirements push the monitoring layer toward producing audit-grade records of what the model did, what the human saw, and what the decision was.

Where the institution falls short on monitoring, it is usually not because the technical capability is missing. It is because no one owns the activity in the operating model, or because the monitoring reports go to a queue that nobody is empowered to act on. The governance programme has to name owners and decision rights, not just metrics.

How does the institution handle retraining, retirement, and lineage?

Models in production do not stay still. Customer behaviour shifts, regulations evolve, upstream providers release new versions, and the institution's own data improves. The governance programme has to accommodate change without losing the audit trail.

Retraining on the institution's own data follows a defined cadence — quarterly is a common starting point for financial services applications, though the cycle is often tied to the rate of regulatory or product change rather than to the calendar. The discipline that matters is that each retraining round produces its own documentation package: what data was added, what was removed, what evaluation results came out, what the human reviewer signed off on before promotion. The retrained model is a new entry in the registry, not an update to the existing entry.

Upstream lineage tracking covers the version chain on the open-weight side: when the upstream provider releases a new version, the institution evaluates it, decides whether to upgrade, and documents the decision either way. Skipping a version is a valid choice; not documenting that the skip happened is not.

Decommissioning is the part most commonly forgotten. A retired model needs an explicit retirement event in the registry — when it was retired, why, what replaced it, where the audit trail of its operational period is archived. Under DORA third-party requirements, records of significant ICT systems need to be retained for years after the system itself is no longer running. The governance programme has to plan for this rather than discover it during an internal audit two years after the model went out of service.

What governance gaps are most common in early open-weight deployments?

Across the engagements we have reviewed, three gaps account for most of the early-deployment governance problems.

The first is treating the open-weight model as "just software." Teams familiar with software-licence governance import that mental model and stop there: licence reviewed, deployment approved, ticket closed. The model-specific aspects — provenance documentation, bias profile, drift monitoring, lineage tracking — fall outside the existing software-governance frame and end up nobody's responsibility. The fix is not a separate AI-governance committee; it is bringing the model-specific requirements into the existing governance structure with explicit owners.

The second is under-documenting the upstream. When the institution itself did not train the model, there can be a tendency to defer the training-data and methodology documentation entirely to whatever the upstream provider published, without actively recording the gaps. A regulator asking how the model was trained will not accept "we used Mistral, see Mistral's documentation" as a complete answer if Mistral's documentation does not in fact cover what the regulator is asking about. The governance programme has to explicitly catalogue what is known and what is not — and the "not known" entries are a legitimate part of the documentation, not an embarrassment.

The third is conflating fine-tuned variants with the base model in the registry. A bank might have three fine-tuned versions of Llama for three different use cases, each with its own evaluation profile, deployment scope, and bias characteristics. Treating these as instances of "our Llama deployment" rather than as three distinct models in the registry collapses the audit trail in ways that become expensive to unwind when a supervisor asks specific questions about a specific use case.

What good looks like is mostly about discipline rather than tools. The institutions that handle open-weight governance well tend to have a model registry that anyone in the relevant control function can read; a change-management workflow that captures upstream upgrades and local fine-tuning as model events; a monitoring cadence that produces actionable reports rather than dashboards nobody opens; and a retirement protocol that closes models out cleanly with the records intact.

Key takeaways

The institution that runs an open-weight AI model holds the artefact and inherits the documentation burden the upstream provider would otherwise carry. That burden is the governance programme — registry, approvals, monitoring, lineage, retirement. The institutions that get this right treat the governance artefacts as the actual deliverables of the AI programme, not as paperwork attached to it.

The governance gaps that most commonly cause supervisory rework are not technical. They are organisational: ambiguous ownership of model-specific requirements, under-documenting upstream gaps, and treating fine-tuned variants as one deployment rather than several. The fixes are also organisational — clear owners in the existing control functions, an explicit register of known and unknown upstream information, and a registry that distinguishes derived models from the base.

Open-weight gives the institution control. Control comes with the operational cost of running the lifecycle that a vendor would otherwise run — a cost the hidden-costs analysis lays out honestly for the cost side, and which the governance programme is the regulatory side of. Naming both costs upfront in the business case, rather than discovering them during the first supervisory review, is what separates the deployments that scale from the ones that get rolled back.

If your institution is standing up its open-weight AI governance programme and wants to scope the registry, the change-management workflow, and the documentation burden cleanly before the next regulatory review, we can help you work through it.

Related reading:

Ready to Own Your AI?

Stop renting generic models. Start building specialized AI that runs on your infrastructure, knows your business, and stays under your control.