What regulators ask in an AI review: a guide

Financial institutions preparing for AI regulatory review tend to over-prepare for the EU AI Act in the abstract and under-prepare for the questions a regulator will actually ask in the room. The EU AI Act sets the framework. DORA sets the third-party perimeter. National regulators — the FCA in the UK, BaFin in Germany, FINMA in Switzerland, the national authorities across the EU — set the operational expectations and conduct the actual reviews.

The institutions that handle these reviews well treat the regulator as the customer of their documentation. Everything that has been written down — the risk inventory, the model registry, the change-management log, the audit trail — is the deliverable. The model itself is what the documentation describes; it is rarely what the regulator inspects in detail.

This article walks through how an AI regulatory review actually proceeds in a regulated European financial institution: where it starts, what gets asked first, what "adequate human oversight" tends to look like when it is examined in practice, what the most common findings are, and where the named regulators converge and where their emphases differ.

Why do regulatory reviews of AI start where they start?

A regulator walking into an AI review is not opening the conversation with a model architecture question. The opening is almost always a request for the institution's AI system inventory — the document that lists every AI system in production, what it does, which business function owns it, and how it has been classified for risk.

The reason is that the regulator's first decision is one of scope. Until the inventory is on the table, there is no agreed list of what falls within the review. An institution that cannot produce a current inventory in a single document has already created a finding before the substantive questions begin.

The institutions that handle this opening cleanly tend to have invested in keeping the inventory current — typically as part of the broader governance programme — and to have the inventory mapped against the EU AI Act risk categories before the request arrives. The regulator reads the inventory, picks the systems most likely to be high-risk, and the rest of the review is structured around them.

What does the regulator look at first?

After the inventory, the focus moves quickly to two things: classification and documentation.

Risk classification. For each AI system in scope, the regulator asks how the institution classified it under EU AI Act Annex III and what rationale supports the classification. The expectation is not that the institution arrived at the same classification the regulator would — it is that the institution made a reasoned, documented decision. A creditworthiness assessment system that has been classified as low-risk without supporting reasoning is a clearer finding than a system classified as high-risk that the regulator disagrees with on the margins. The regulator reads the rationale.

Technical documentation completeness. Article 11 and Annex IV of the EU AI Act set out what the technical documentation for a high-risk AI system needs to cover: intended purpose, training-data description, architecture, performance metrics, validation results, risk-management documentation, post-market monitoring plan. The regulator is not looking for every element to be perfect on first inspection. They are looking for whether the institution can produce the document at all, whether it is current, and whether it covers the questions the regulation explicitly asks.

The institutions that handle this part of the review well have treated the technical documentation as a maintained artefact rather than a one-time deployment deliverable. The version published at conformity-assessment time becomes the baseline; every change to the model produces a documented update. Reviewers can read the history and understand how the institution has managed the system over time, not only how it was set up.

What does a "technical documentation" review actually cover?

The regulator reading a technical documentation pack tends to focus on five areas, regardless of which national authority is conducting the review.

Intended purpose and scope. What was the system designed to do, and what was it explicitly not designed to do? Scope creep — using a system for a use case beyond the documented intended purpose — is one of the easier findings for a regulator to identify, because the institution's own documentation creates the boundary.

Training-data description. Where the data came from, how it was selected, what biases have been examined, what the limits of the training-data documentation are. For an open-weight model that the institution did not train itself, this section requires honesty about what the upstream provider published and where the gaps are. "Upstream did not disclose training-data composition" is an acceptable entry in the documentation; "no information available" without that scoping is not.

Architecture and intended behaviour. A description of the system architecture sufficient for a third party to understand how decisions are produced. The regulator is not testing for full technical mastery; they are testing whether the institution itself understands the system well enough to describe it without depending on the vendor.

Performance metrics and validation results. Accuracy, error rates, performance on protected groups, validation against the institution's own test data. Regulators are increasingly attentive to bias evaluation — whether the institution can show that performance is comparable across customer segments where regulation expects equivalent treatment.

Post-market monitoring plan. What the institution is monitoring in production, what triggers re-evaluation, what the cadence of review is. This is where the deployment connects back to the governance programme and where the regulator checks that the institution is treating the AI as a system that needs ongoing evidence of behaviour rather than a tool that was approved at launch and forgotten.

How do regulators test for adequate human oversight?

Adequate human oversight is a regulatory expectation that is easy to claim in documentation and harder to evidence in practice. Regulators testing for it tend to look at four operational dimensions, not just at policy.

The first is whether the human reviewer has the information they need to oversee meaningfully. Reviewing an AI-generated recommendation requires understanding what the AI considered, what it weighted, and why. Where the reviewer has only the output and not the reasoning, the oversight is procedural rather than substantive. The regulator often asks to see a sample of what the reviewer actually sees on screen — and the answer to that question reveals what the oversight quality actually is.

The second is whether the reviewer has the authority to override. Override paths that require engineering tickets, vendor support requests, or weeks of internal escalation are not meaningful oversight. The regulator probes the operational reality: when an analyst disagrees with the AI's recommendation, what happens, and how long does it take?

The third is whether the reviewer has the time to engage. If the analyst is reviewing two thousand AI recommendations per day with three seconds per case, the oversight is performative. The regulator looks at queue depth, review cycle time, and the ratio of accepted-without-modification outputs as proxies for whether the review is doing real work or rubber-stamping.

The fourth is whether the audit trail records the human review distinctly from the AI output. "The model recommended X, the analyst approved" with separate timestamps and an identifiable reviewer is auditable evidence. "X was produced" with no human review trail is not. As covered in the pilot-to-production article, the architecture that captures both states is the deployment that survives this question.

What are the most common findings — and how to avoid them?

Across the engagements we have reviewed, four findings account for most of what regulators raise on AI deployments.

Incomplete or out-of-date model registry. The institution can produce a registry but it does not match what is actually running in production. New fine-tuning rounds have not been recorded; upstream version changes have not been noted; deployment scope has expanded without re-approval. The fix is treating the registry as the system of record for what is running, not as an inventory snapshot.

Risk classification without documented rationale. Each AI system has been assigned a risk category, but the document does not explain how the institution arrived at that classification. Particularly problematic for systems that interact with credit decisioning, insurance pricing, or AML triage, where Annex III boundaries are themselves a judgement call. The fix is documenting the classification decision with the same rigour as the model decision.

Vague description of human oversight. The policy document says human oversight is in place; the operational reality is that nobody can describe what the analyst actually does with the AI's output. The regulator will ask for the operational evidence, and the gap between policy and practice becomes the finding.

Audit log gaps. The institution can show what the AI does, but cannot reconstruct specific historical cases on demand. For decisions that may be reviewed by FOS, by an internal complaint process, or by a regulatory inquiry months later, the inability to reconstruct is a structural finding. The fix is treating audit-log architecture as a deployment requirement rather than an operational add-on.

None of these findings is about the technology. They are about the operating discipline around the technology — which is what the regulator is actually evaluating.

What differs between FCA, BaFin, FINMA, and other regulators?

The substantive expectations across European regulators are converging on the EU AI Act and DORA, but national emphases still differ.

The FCA brings a particular focus on Consumer Duty outcomes and on vulnerable-customer treatment. Reviews of AI in customer-facing or customer-affecting functions tend to probe these dimensions in depth. The FCA has been notably willing to publish thematic findings, which gives institutions a useful preview of what the next review will look like.

BaFin tends to focus on governance documentation and on whether the institution's risk-management framework treats AI as ICT risk under MaRisk and IT-related regulatory expectations. Reviews are detailed on the documentation side and on the integration of AI governance into the broader internal-control framework.

FINMA brings a strong emphasis on operational resilience and on the documentation of AI in the context of FINMA's broader expectations for outsourcing and ICT third-party management. Reviews tend to probe how AI risk has been escalated to the board and how the senior-management accountability structure handles AI-specific incidents.

National authorities across the EU are now standing up AI-specific regulatory capability under the EU AI Act and DORA. The patterns from the established regulators above are good predictors of what the newer reviews will emphasise, with national variation around vulnerability frameworks, consumer-protection emphases, and sector-specific (banking versus insurance versus payments) priorities.

The institution preparing for review well does not optimise for any one regulator; it builds documentation that can answer the converged set of expectations, with addenda where national variation is material.

Key takeaways

Regulatory review of AI in regulated European financial services starts with documentation rather than with technology. The institutions that handle reviews well have invested in the AI system inventory, the risk-classification rationale, the technical documentation pack, and the audit trail — treating these as the deliverables of the AI programme rather than as paperwork attached to it.

The four most common findings are operational rather than technical: incomplete model registry, undocumented risk classification, vague human-oversight evidence, and audit-log gaps. None requires a technology fix; all require operating discipline. The institutions that get this right have built the governance discipline into the deployment, not after it.

Regulators across the FCA, BaFin, FINMA, and the EU national authorities are converging on a substantive expectation while retaining national emphases on consumer protection, governance documentation, and operational resilience. Documentation that can answer the converged set is the most cost-effective preparation.

If your institution is preparing for an AI regulatory review and wants to pressure-test the documentation, the operating-model evidence, and the audit-trail completeness before the regulator does, we can help you work through it.

Related reading:

What regulators actually ask for when they review AI in a regulated financial institution