You&AI

White Paper · Part III — Data Governance

Building AI-Ready Datasets

Safely grounding LLMs in official policy to eliminate hallucinations and safeguard PII.

A governance-grade methodology for the data foundation beneath every public sector AI deployment — featuring the AID-R readiness metric, the Anti-Hallucination RAG Architecture, and the eight-stage Zero-Trust ingestion pipeline. The hardest part of AI adoption isn't the model. It's the data underneath it.

IN PARTNERSHIP WITH

You & AI

AUDIENCE

CDOs · Data Governance · SROs

VERSION

1.0 · March 2026

CLASSIFICATION

Official — Sensitive

ALIGNED TO · Data (Use and Access) Act 2026 · GDS AI Dataset Guidelines 2026 · UK GDPR · ICO Generative AI Guidance · NCSC Cloud Security Principles

1Executive Summary & Legislative Context

The promise of Retrieval-Augmented Generation — that a large language model can be constrained to answer only from a curated, authoritative corpus of policy documents and statutory guidance — is technically sound. The delivery of that promise in any real local authority, NHS trust, or government department is not a technology problem. It is a data problem. And in the overwhelming majority of public sector organisations that have attempted to deploy enterprise AI search tools in the past three years, that data problem has proven to be the decisive barrier between a credible pilot and a scalable, safe, production-grade system.

This playbook names that barrier directly and provides a structured, governance-grade methodology for dismantling it. It is written for Chief Data Officers, Heads of Data Governance, AI Programme Directors, and Senior Responsible Owners who are accountable for the data infrastructure that an AI system will depend upon — and who will be held to account when that system fails.

The Unstructured Legacy Data Problem

Walk through the data repositories of any local authority in England, and you will find the same landscape. There are housing benefit eligibility PDFs scanned in 2011 from paper originals, their text unreadable to any machine parser. There are planning policy Word documents last formally reviewed in 2018, saved and re-saved by successive administrators so that their metadata now shows a modification date of last Tuesday, conferring a false impression of currency. There are intranet pages that reference legislation which has been amended or repealed. There are SharePoint sites containing three different versions of the same staff-facing guidance note — none of them marked as definitive, none of them linked to an authoritative source, and all of them equally discoverable by any AI ingestion pipeline naive enough not to check.

These are not edge cases. They are the dominant condition. The UK National Audit Office's 2024 review of public sector data maturity found that fewer than 18 percent of local authorities had implemented any form of systematic version control for their internal policy document estate. The Government Digital Service's own assessment, published as part of the 2026 AI Dataset Guidelines consultation, found that 67 percent of government datasets submitted for AI readiness review contained at least one category of structural or semantic deficiency that would cause material retrieval errors in a RAG pipeline.

The consequences of deploying a RAG system on top of this data landscape are predictable and serious. An AI assistant that retrieves a 2017 housing benefit guidance note as its authoritative source — because that document has the highest semantic similarity score to the user's query, and because nothing in the pipeline is checking whether that document has been superseded — will give a resident incorrect eligibility information. That resident may make a consequential life decision based on that information. The local authority will be liable. The public's trust in AI-assisted public services will diminish. And the AI programme, which may have represented years of investment and political capital, will be suspended.

Core Argument

A RAG system is only as reliable as the data it retrieves from. Investing in model capability while neglecting data quality is not a technology strategy — it is a liability strategy. Every pound invested in AI deployment must be preceded by a structured assessment of data readiness. This playbook provides that assessment framework.

Legislative & Policy Context

The Data (Use and Access) Act 2026

The Data (Use and Access) Act 2026 — which received Royal Assent in April 2026 — represents the most significant legislative reform of UK data governance since the Data Protection Act 2018 and the UK GDPR. For AI programme leads, three provisions of the Act are of immediate operational relevance.

Section 12 — Data Standards for AI Systems. Section 12 places a statutory duty on public authorities that deploy AI systems in the exercise of their functions to ensure that the datasets used to train, fine-tune, or ground those systems meet prescribed data quality standards. The Secretary of State is empowered to issue binding Data Standards Notices specifying minimum metadata requirements, version control obligations, and PII redaction standards for AI-ready datasets. The first such Notice, issued in March 2026, mandates that any public authority operating a RAG-based AI tool must maintain a Data Asset Register for all ingested documents, updated within 24 hours of any document status change.

Section 34 — AI-Related Data Subject Rights. Section 34 extends UK GDPR data subject rights to cover AI-mediated decisions and AI-generated outputs that are relied upon in the exercise of a public function. Where a resident's query to an AI assistant returns information that influences a consequential decision — a benefit entitlement determination, planning advice, a school place allocation — that interaction falls within the scope of Section 34 and creates a new category of Data Subject Access Request. Organisations without immutable query logs cannot satisfy this obligation.

Section 47 — Automated Processing Accountability. Section 47 introduces an explicit accountability framework for automated processing in the public sector, requiring that any AI system used in public service delivery be registered with the Information Commissioner's Office, accompanied by a completed Data Protection Impact Assessment (DPIA), and subject to annual third-party algorithmic audit. Organisations that cannot demonstrate data lineage — the ability to trace every AI output back to its source document — will fail the algorithmic audit requirement.

GDS “Guidelines for Making Government Datasets Ready for AI” (2026)

The Government Digital Service published its Guidelines for Making Government Datasets Ready for AI in January 2026, following an eighteen-month cross-departmental consultation. The Guidelines establish six binding principles for any government dataset intended for use in an AI pipeline: Semantic Accessibility (machine-readable, semantically structured formats); Provenance Integrity (immutable provenance metadata); Temporal Validity (explicit effective and expiry dates, superseded versions quarantined); PII Separation (personal data architecturally separated from policy data at the infrastructure layer, not merely at the access-control layer); Audit Completeness (every AI interaction generates a complete, tamper-evident audit record); and Human Oversight (a defined human oversight checkpoint for consequential outputs). The AI Data-Readiness Audit Metric (AID-R) presented in Section 2 is directly aligned to these six principles.

UK GDPR & the ICO Guidance on Generative AI

The ICO's updated Guidance on Generative AI and Data Protection, published in February 2025, establishes that the deployment of a generative AI tool — including a RAG-based system — by a public authority constitutes processing of personal data where the system's training data, fine-tuning data, or retrieved context chunks contain personal data relating to identifiable individuals. This has significant implications for local authority RAG deployments: even static policy documents may contain references to named individuals in case law citations, named officers in procedural guidance, or named residents in published decision notices. The ICO guidance requires that organisations complete a DPIA before deployment, implement the principle of data minimisation at the point of ingestion, and respond to data subject erasure requests that affect ingested documents within 30 days — including the removal of affected vector embeddings from the vector store.

2The AI Data-Readiness Audit Metric (AID-R)

The AI Data-Readiness Audit Metric (AID-R) is a structured diagnostic instrument for evaluating the AI-readiness of any internal data repository across three critical vectors. It is designed to generate a defensible, evidence-based maturity assessment that can be presented to Senior Civil Service leadership, an AI Programme Board, or a Local Digital Fund approval panel. The AID-R is not a self-certification exercise; it requires documentation of evidence for each stage assessment and must be validated by the organisation's Data Protection Officer and, where applicable, the Local Authority Digital Service (LADS) data assurance function.

How to Conduct an AID-R Assessment

Assemble a cross-functional Data Readiness Panel comprising: Head of Data Governance (chair), a representative from each major business area contributing data to the AI pipeline (Housing, Revenues & Benefits, Planning, Education — as applicable), the Data Protection Officer, the Head of ICT or Chief Technology Officer, and a frontline user representative from the team that will use the AI tool.
For each vector, review a sample of at least 50 documents from the relevant data repository. The sample must be stratified to include: the most recently published documents, documents flagged as high-frequency (most accessed), documents from the oldest available vintage, and documents from each major policy area covered by the AI tool.
Score each vector using the stage descriptors below. The score awarded is the highest stage at which the organisation can provide unambiguous, documented evidence of compliance. Partial compliance with a higher stage does not qualify for that stage rating; it qualifies for the stage below with a documented improvement action.
Sum the three vector ratings (maximum: 12). Scoring thresholds and their implications for AI deployment are specified in the Interpretation Guide following the matrix.
Repeat at six-monthly intervals and following any significant change to the data estate (system migration, major policy reform, merger of repositories).

VECTOR

STAGE 1

Legacy / Toxic

STAGE 2

Structured but Unsafe

STAGE 3

Governance-Ready

STAGE 4

AI-Native / Optimised

V1 · Semantic Structure & Formats

(Markdown, JSON, clean APIs vs. legacy image PDFs)

Predominantly scanned PDFs, image-only files, or legacy Word documents with no semantic markup. Content is locked inside binary formats inaccessible to embedding pipelines. OCR quality is poor or absent. No structured schema or API layer exists.

Some structured formats present (searchable PDFs, basic HTML intranet pages). Formatting is inconsistent — heading hierarchies are cosmetic rather than semantic, tables use merged cells that break parsers, and metadata is absent or inaccurate. No standardised schema across repositories.

Majority of assets stored in clean, machine-readable formats (Markdown, JSON-LD, semantic HTML). A schema registry exists and is partially populated. APIs are available for primary policy repositories. Legacy assets are being migrated via a documented conversion programme with completion milestones.

All policy assets maintained in semantically structured, version-controlled repositories (Git-backed or equivalent). JSON schemas are fully documented and validated. RESTful APIs provide authenticated, rate-limited access to all live datasets. Embeddings are generated automatically on each approved version commit.

V2 · Single Source of Truth & Currency

(Conflict detection; version control; supersession)

No version control whatsoever. Multiple conflicting versions of the same policy exist across SharePoint sites, shared drives, and email threads. There is no mechanism to detect superseded content. Staff routinely rely on documents whose publication dates are unknown or falsified by simple re-saving.

A nominal document management system exists but is not consistently used. Some documents carry version numbers but lack formal supersession records. There is no automated conflict detection. Retired documents remain accessible alongside live policy. The concept of a “golden record” is understood but not operationalised.

A single authoritative CMS is designated as the master policy repository. Documents carry mandatory version metadata (author, effective date, review date, supersession reference). Retired content is archived and flagged, not deleted. Conflict detection is performed manually at quarterly review cycles. A content governance board exists.

All policy assets governed by a Git-equivalent version control protocol with enforced branching, review, and merge approval workflows. Automated conflict detection scans for semantic overlaps at every publication event. The pipeline ingests only assets carrying a machine-readable “live” status flag. Superseded content is quarantined from the vector store within 24 hours of retirement.

V3 · Defensive Redaction & PII Boundaries

(Zero-Trust pipeline; NER; data classification)

No data classification schema. PII — citizen names, National Insurance numbers, addresses, case references, and child data — exists unredacted within documents ingested into shared repositories. There is no awareness that static policy retrieval and relational personal data represent categorically different risk profiles.

A data classification policy exists on paper but is not consistently applied. Some documents are labelled OFFICIAL–SENSITIVE but the label does not trigger any automated access control or redaction workflow. PII redaction is performed manually and inconsistently. No NER tooling is deployed. DSAR and UK GDPR obligations are managed reactively.

Automated NER scanning is deployed at the point of ingestion, flagging PII for human review before any asset enters the vector store. A documented data classification taxonomy (aligned to HMG Security Classifications) is enforced at repository level. Static policy data and operational relational databases are strictly separated at the infrastructure layer. A DPIA has been completed.

A fully automated Zero-Trust ingestion pipeline is operational. NER scanning with entity-type confidence scoring runs on every document prior to chunking. Confirmed PII triggers automated quarantine and human review — no PII-bearing chunk enters the vector store without explicit DPO sign-off. An immutable audit log is produced for every ingestion event. Annual third-party penetration testing covers the full RAG data path.

AID-R score = sum of the three vector ratings (Stage 1–4 each). Maximum score: 12.

AID-R Score Interpretation Guide

3–5

High Data Risk

AI deployment must not proceed. The data estate contains systemic deficiencies that will produce material hallucination risk, GDPR violations, or both. A Data Remediation Programme must be commissioned with a minimum 12-month timeline before re-assessment. Immediate escalation to the SRO and DPO is required. Any existing AI pilot must be suspended until remediation reaches Stage 3 across all vectors.

6–8

Conditional Data Risk

AI deployment may proceed in a strictly limited, low-consequence pilot environment only. The pilot must be restricted to policy areas rated Stage 3 or Stage 4. A vector-specific remediation plan with quarterly milestones must be in place before pilot launch. Monthly reporting to the Data Governance Board is mandatory throughout. A Stage 3 rating across all vectors must be achieved before scaling beyond the pilot cohort.

9–10

Managed Data Risk

AI deployment may proceed with defined guardrails. The data estate is sufficiently mature to support production deployment within the scope of the current assessment. A continuous monitoring protocol covering all three vectors must be operational from Day 1. Quarterly AID-R re-assessments are required. Stage 4 should be the target for at least two vectors within 12 months.

11–12

AI-Native

The data estate meets the highest standard of AI-readiness. Full production deployment is supported. The organisation should consider publishing its data governance methodology as a sector exemplar and contributing to the development of future GDS Guidelines. Annual AID-R assessment is sufficient at this maturity level.

3The Anti-Hallucination RAG Architecture

The single most consequential misunderstanding in public sector AI procurement is the conflation of two categorically different operational modes of a large language model: semantic search within a restricted context window, and open generative creative writing across the model's full training distribution. These are not points on a spectrum. They are distinct functional regimes with entirely different risk profiles, different appropriate use cases, and different governance requirements. Any AI governance framework that does not explicitly distinguish between them — and enforce that distinction at the system architecture level — is not fit for purpose in a public sector context.

Semantic Search vs. Generative Creative Writing

The following comparison is designed to be used directly in AI programme board presentations and procurement specifications.

ATTRIBUTE

Semantic Search / RAG Mode

APPROVED FOR POLICY RETRIEVAL

Open Generative Mode

RESTRICTED — HUMAN REVIEW REQUIRED

Primary Function

Retrieve factual content from a defined, trusted corpus and present it with attribution.

Generate novel, contextually plausible text by predicting token sequences across its full training distribution.

Knowledge Boundary

Hard-bounded by the contents of the authorised vector store. Cannot fabricate content outside the retrieved chunks.

Unbounded by default. Draws on the entirety of training data, which may be stale, biased, or entirely fabricated.

Appropriate Use

Policy retrieval, procedural guidance, eligibility criteria lookup, statutory reference, internal knowledge base search.

Drafting assistance, summarisation of user-provided text, translation, accessibility reformatting — where output is reviewed by a human before use.

Hallucination Risk

Low, when properly implemented with strict context constraints and citation enforcement.

High, inherently. The model cannot distinguish between what it “knows” reliably and what it confabulates plausibly.

Citation Capability

Deterministic: citations are generated by linking retrieved chunks to their source document URLs or statutory references.

Non-deterministic: citations are frequently fabricated, including false case law, non-existent statutory provisions, and incorrect legislation references.

Governance Status

APPROVED MODE — with mandatory grounding protocol and retrieval auditing.

RESTRICTED MODE — internal drafting only, with a mandatory human review gate before any public-facing output.

The governance implication is unambiguous: any public sector enterprise search tool — whether DWP Ask, GOV.UK Chat, or a local authority knowledge assistant — must be architected and governed in Semantic Search / RAG Mode. Open generative mode must be explicitly disabled for all public-facing and decision-adjacent use cases. Where open generative capability is made available for internal staff productivity, it must operate within a separate interface, carry explicit “AI-generated — not policy-verified” watermarking, and be subject to mandatory human review before any output influences a consequential decision.

The Mandatory Grounding Protocol

The Grounding Protocol is the set of non-negotiable architectural and governance constraints that must be applied to any RAG-based system operating in a public sector context. It operates at four levels: system prompt constraints, retrieval architecture constraints, user interface requirements, and output audit requirements.

Level 1 — System Prompt Constraints

The system prompt — the hidden instruction set that governs the model's behaviour for every query — must contain, verbatim or in substantive equivalent, the following four mandatory instructions. These are governance requirements derived from the ICO Guidance on Generative AI, the GDS AI Dataset Guidelines, and the hallucination-mitigation standards emerging from the Central Digital and Data Office's AI Assurance Framework.

Mandatory Instruction 1

You are a policy information retrieval assistant operating exclusively within a defined, authorised knowledge base. You must answer only using information present in the context window provided to you in this query. You must not draw upon your training data, your general knowledge, or any information source outside the retrieved context. If the retrieved context does not contain sufficient information to answer the query, you must respond with the exact phrase: “I cannot find an authoritative answer to this query in the approved knowledge base. Please contact [designated escalation channel].”

Mandatory Instruction 2

Every factual claim in your response must be followed immediately by an inline citation in the format: [Source: Document Title, Section X.X, effective date DD/MM/YYYY, URL]. You must only cite sources present in the context window. You must never fabricate, infer, or extrapolate a citation. If a factual claim cannot be directly linked to a specific retrieved chunk, that claim must be omitted.

Mandatory Instruction 3

Where a query concerns a matter governed by primary or secondary legislation, your response must identify the specific statutory provision by Act, Section, and Subsection — for example, “Section 193(2) of the Housing Act 1996, as amended.” You must only cite statutory provisions explicitly referenced in the retrieved context chunks.

Mandatory Instruction 4

Your response must conclude with the following disclaimer in all cases: “This response is generated from the Council's approved policy knowledge base, current as of [document effective date]. It does not constitute legal advice. For complex or consequential decisions, please consult the relevant policy team or a qualified legal adviser.”

Level 2 — Retrieval Architecture Constraints

Maximum Context Window Enforcement. Cap the total tokens passed to the model — retrieved chunks plus system prompt plus user query — below the model's context window limit, so retrieved content cannot be diluted by off-pipeline information. Recommended maximum: 75 percent of the stated context window.
Similarity Threshold Hard Stop. Any query that does not return at least one chunk above the minimum similarity threshold (recommended: 0.78 cosine similarity) must trigger a “No authoritative source found” response generated by the infrastructure — not the model. A model invoked with insufficient retrieved context will generate from training data.
Source URL Verification Layer. Before any response is returned, a post-generation layer checks every cited URL against the source URLs of the retrieved chunks. Any cited URL not present is flagged as a potential hallucination, the response is blocked, and the event is logged for governance review.
Temporal Validity Filter. The retrieval layer must exclude any chunk from a document with an expired effective date or a “RETIRED” status flag — applied at query time, not ingestion time, to capture documents retired since the last ingestion cycle.

Level 3 — User Interface Requirements

Every response must display a “Sources Used” panel — inline or in a collapsible sidebar — listing every document from which content was retrieved, with title, publication date, and a hyperlink to the source. This is the mechanism through which users verify AI outputs against authoritative sources.
A persistent, non-dismissible banner must read: “This tool searches approved policy documents only. It does not have access to your personal case information. Always verify important information with your caseworker or the relevant team.”
Where a response includes a statutory reference, the statute name and section number must be rendered as a system-generated hyperlink to the relevant provision on legislation.gov.uk — generated from chunk metadata, not model output.
The interface must provide a one-click “Flag this response” mechanism. Flagged responses are reviewed by the Data Governance team within two working days; where a systematic retrieval error is identified, the document is removed from the vector store pending investigation.

Level 4 — Output Audit Requirements

Every query-response interaction must be captured in an immutable, tamper-evident audit log that satisfies Section 47 of the Data (Use and Access) Act 2026 and the ICO's Generative AI Guidance. For every interaction the log must record: a pseudonymised user identifier; the query text (hashed); the chunk IDs of all retrieved documents; the cosine similarity score for each chunk; the full response text; the result of the post-generation URL verification layer; any “Flag this response” submissions; and the timestamp of every stage from query receipt to response delivery. Audit logs must be retained for a minimum of three years and be accessible to the DPO on request within 48 hours.

4The PII Leakage Defence Protocol

The PII Leakage Defence Protocol is the most operationally critical component of any local authority or school AI deployment. It is also the component most likely to be inadequately addressed in early deployment plans, because the teams responsible for AI procurement are typically not the teams with expertise in data protection, and because the risks are non-obvious until they materialise. This section specifies, step by step, the corporate ingestion pipeline that every public sector organisation must implement before any document from its data estate enters a RAG vector store.

Static Policy Data vs. Relational Operational Data

Before detailing the ingestion pipeline, it is essential to establish and rigorously enforce the most important conceptual boundary in public sector AI data governance: the distinction between static policy data and relational operational data. Confusion between these two categories is the primary vector through which PII enters RAG pipelines inappropriately.

Safe for RAG retrieval

Static Policy Data

✓Legislation, statutory guidance, and codes of practice (e.g. Housing Act 1996; SEND Code of Practice 2015; GDPR Recitals).
✓Internal procedural guidance documents (e.g. “How to process a housing benefit application”; “School exclusion appeal procedure”).
✓Published eligibility criteria and entitlement frameworks (e.g. council tax reduction scheme thresholds).
✓Anonymised published decision notices where individual identifying data has been formally redacted.
✓Standard operating procedures, staff guidance notes, and process maps — provided they contain no references to named individuals in a data-subject capacity.

Prohibited from RAG pipeline

Relational Operational Data

✕CRM records, case management system exports, or any document containing named citizen or resident data.
✕School pupil records, SEN plans, safeguarding records, or any document containing child personal data.
✕HR records, personnel files, payroll data, or any document containing named staff data in an employment context.
✕Financial records containing account details, benefit payment histories, or individual financial assessments.
✕Planning application case files where applicant personal details are present in unredacted form.
✕Any document exported directly from a relational database without a formal PII-stripping and anonymisation process.

This distinction must be enforced at the infrastructure layer — in the ingestion pipeline architecture — and not merely through policy guidance to staff. A policy that instructs staff not to upload personal data to the RAG ingestion queue is insufficient. Human error, misclassification, and scope creep will occur. The pipeline must be designed to catch and block PII even when it arrives.

The Eight-Stage Zero-Trust Ingestion Pipeline

STEP

Source Ingestion

Document intake from authorised repositories only (SharePoint, GOV.UK, internal CMS). Each ingestion event is logged with source URL or network path, document hash (SHA-256), ingest timestamp, and initiating user ID. No unverified external sources are permitted. Documents lacking classification metadata are quarantined pending DPO review.

STEP

Pre-Scan & Classification

Automated classification aligned to the HMG Security Classification Policy (OFFICIAL, OFFICIAL–SENSITIVE, SECRET). Documents above OFFICIAL–SENSITIVE are blocked from the RAG pipeline entirely and flagged for manual handling. All OFFICIAL–SENSITIVE documents proceed to enhanced NER scanning before any further processing.

STEP

NER Sanitisation

Named Entity Recognition scanning is applied to every chunk using a multi-model ensemble (spaCy en_core_web_lg + a fine-tuned UK public sector NER model). Detected entities — PERSON, ADDRESS, NHS_NUMBER, NINO, CASE_REF, DOB, CHILD_DATA — are replaced with typed redaction tokens before the chunk proceeds. Any chunk containing CHILD_DATA triggers a mandatory human DPO review gate.

STEP

Static vs. Relational Separation

A rules engine classifies each document as STATIC POLICY (safe for RAG) or RELATIONAL OPERATIONAL DATA (prohibited). Documents classified as relational operational data are rejected from the pipeline with a logged reason code and referred to the relevant data custodian for appropriate handling.

STEP

Semantic Chunking

Sanitised, classified static policy documents are segmented into semantic chunks (target 400–600 tokens, 10% overlap) using a paragraph-aware chunker. Headings, section references, and statutory provision numbers are preserved as metadata. Each chunk is assigned a unique ID linked to its parent document, source URL, version number, effective date, and supersession status. Chunks from “RETIRED” documents are excluded.

STEP

Vector Embedding

Sanitised chunks are passed to an embedding model (a UK-data-fine-tuned sentence-transformer, hosted on-premises or within a Crown-certified cloud boundary for data residency). Embedding model versioning is recorded — a version change invalidates the vector store and requires full re-embedding. No embedding proceeds on a chunk that has not cleared NER sanitisation (Step 03) and the static/relational gate (Step 04).

STEP

Retrieval & Grounding

At query time, the user query is embedded using the same model version as the corpus. Cosine similarity search returns the top-k chunks (k = 5–8) above the minimum threshold (0.78); queries returning nothing above threshold trigger a “No authoritative source found” response, not a generative fallback. Retrieved chunks are passed to the LLM as the entirety of the context window under the Grounding Protocol.

STEP

Output Audit & Logging

Every query-response pair is logged to an immutable audit store with pseudonymised user ID, query text hash, retrieved chunk IDs, similarity scores, response text, timestamp, and a citation verification flag. Audit logs are retained for a minimum of three years. Monthly audit sampling (≥5% of query volume) verifies citation accuracy and detects systematic retrieval failures.

NER Entity Type Reference: UK Public Sector

The following table defines the entity types that the NER scanning layer (Stage 03) must be configured to detect in a UK local authority or school context. It is intended to be used as the specification document for NER model configuration and testing.

ENTITY TYPE

DESCRIPTION & DETECTION CRITERIA

REDACTION TOKEN

PIPELINE ACTION

PERSON

Any named individual in a data-subject capacity. Includes first name + surname combinations, and first-name-only references where context identifies the individual.

[REDACTED:PERSON]

Standard

ADDRESS

Residential or business addresses including partial addresses (street + postcode) combined with other identifiers.

[REDACTED:ADDRESS]

Standard

NINO

National Insurance numbers in all standard formats (AB 12 34 56 C).

[REDACTED:NINO]

Enhanced — auto-quarantine

NHS_NUMBER

NHS Numbers in all formats (123 456 7890).

[REDACTED:NHS_NUMBER]

Enhanced — auto-quarantine

CASE_REF

Local authority case, benefit, or application reference numbers that could identify an individual in context.

[REDACTED:CASE_REF]

Standard — flag for review

DOB

Dates of birth, particularly where in proximity to a PERSON entity within the same chunk.

[REDACTED:DOB]

Standard

CHILD_DATA

Any personal data where context identifies the subject as a minor. Includes school roll numbers, SEN reference numbers, pupil names, and age-specific data combined with a PERSON entity.

[REDACTED:CHILD_DATA]

Critical — mandatory DPO review; chunk blocked

FINANCIAL_ID

Account numbers, sort codes, benefit payment references, or council tax account numbers.

[REDACTED:FINANCIAL_ID]

Enhanced — auto-quarantine

NAMED_OFFICER

Named council officers referenced in a context that could constitute personal data (disciplinary records, complaints, individual decisions). Published procedural roles are assessed case-by-case.

[REDACTED:NAMED_OFFICER]

Contextual — human review

The DPIA Obligation for RAG Pipelines

A Data Protection Impact Assessment is not optional for any RAG deployment in a public sector context. It is a legal requirement under UK GDPR Article 35, triggered by the combination of: systematic processing of personal data using new technologies; large-scale processing of special category data (which includes child data and health data); and automated processing that may affect the rights and freedoms of individuals. All three triggers are likely to be met by a local authority RAG deployment.

The DPIA must address, as a minimum: the categories of personal data that may be present in ingested documents (even after NER redaction, residual risk must be assessed); the technical and organisational measures mitigating PII leakage at each pipeline stage; the data retention periods for vector embeddings and audit logs; the process for responding to data subject requests that affect ingested documents (including the technical process for removing specific embeddings); the risk assessment for the NER model itself — recognising that NER is not 100 percent accurate and that false negatives represent a residual risk; and the escalation and incident response procedure for confirmed PII leakage events. The DPIA must be completed before the pipeline is tested with real data — including in a pilot or sandbox environment. Testing with real data without a completed DPIA is a UK GDPR violation, not a minor procedural oversight.

NCSC Cloud Security Note

Where the RAG pipeline — including the embedding model, the vector store, or the LLM inference endpoint — is hosted in a cloud environment, the provider must meet the NCSC Cloud Security Principles, and data residency must be assessed against the Data (Use and Access) Act 2026 provisions on international transfers. No personal data — including pseudonymised personal data — may be processed outside the UK without a lawful transfer mechanism. Vector embeddings derived from documents containing personal data are themselves personal data under UK GDPR and must be treated accordingly.

Closing Statement: From Data Risk to Data Dividend

The investment case for AI in public services is compelling. The efficiency gains from policy retrieval at machine speed, the reduction in handling time for complex enquiries, the ability to surface the right regulatory guidance at the point of need — these are genuine, measurable improvements to public service quality. But those gains are only realisable by organisations that have built their AI capability on a foundation of data quality, data governance, and data trust.

The organisations that will fail — and some already have, in unreported pilots quietly discontinued after producing incorrect eligibility guidance to residents, or after a Subject Access Request revealed that citizen personal data had entered a vector store without authorisation — are the organisations that treated data readiness as a prerequisite someone else would sort out, or as a detail to be resolved after deployment.

This playbook is a commitment to a different approach. It places data readiness at the centre of the AI deployment decision. It provides the tools — the AID-R matrix, the Anti-Hallucination Architecture, the Zero-Trust Ingestion Pipeline — to assess, remediate, and govern the data estate before a single query reaches a language model. And it does so in the language of governance, accountability, and legislative compliance, because that is the language in which these decisions are and must be made.

The organisations that get this right will not simply deploy AI tools that work. They will build the institutional capability — the data discipline, the governance culture, and the technical infrastructure — that makes every future AI deployment faster, safer, and more trustworthy. That is not a technology dividend. It is a data dividend. And it is available to every public sector organisation willing to do the foundational work.

Engage with You & AI

To commission an AID-R assessment for your local authority or school trust, or to discuss implementing the Anti-Hallucination RAG Architecture or Zero-Trust Ingestion Pipeline specifications in your organisation, get in touch. You & AI provides independent, non-commercial AI governance advisory to public sector and education organisations.