|
You&AI
White Paper · Part III — Data Governance
Building AI-Ready DatasetsSafely grounding LLMs in official policy to eliminate hallucinations and safeguard PII. A governance-grade methodology for the data foundation beneath every public sector AI deployment — featuring the AID-R readiness metric, the Anti-Hallucination RAG Architecture, and the eight-stage Zero-Trust ingestion pipeline. The hardest part of AI adoption isn't the model. It's the data underneath it. IN PARTNERSHIP WITH You & AI AUDIENCE CDOs · Data Governance · SROs VERSION 1.0 · March 2026 CLASSIFICATION Official — Sensitive ALIGNED TO · Data (Use and Access) Act 2026 · GDS AI Dataset Guidelines 2026 · UK GDPR · ICO Generative AI Guidance · NCSC Cloud Security Principles
1Executive Summary & Legislative ContextThe promise of Retrieval-Augmented Generation — that a large language model can be constrained to answer only from a curated, authoritative corpus of policy documents and statutory guidance — is technically sound. The delivery of that promise in any real local authority, NHS trust, or government department is not a technology problem. It is a data problem. And in the overwhelming majority of public sector organisations that have attempted to deploy enterprise AI search tools in the past three years, that data problem has proven to be the decisive barrier between a credible pilot and a scalable, safe, production-grade system. This playbook names that barrier directly and provides a structured, governance-grade methodology for dismantling it. It is written for Chief Data Officers, Heads of Data Governance, AI Programme Directors, and Senior Responsible Owners who are accountable for the data infrastructure that an AI system will depend upon — and who will be held to account when that system fails. The Unstructured Legacy Data ProblemWalk through the data repositories of any local authority in England, and you will find the same landscape. There are housing benefit eligibility PDFs scanned in 2011 from paper originals, their text unreadable to any machine parser. There are planning policy Word documents last formally reviewed in 2018, saved and re-saved by successive administrators so that their metadata now shows a modification date of last Tuesday, conferring a false impression of currency. There are intranet pages that reference legislation which has been amended or repealed. There are SharePoint sites containing three different versions of the same staff-facing guidance note — none of them marked as definitive, none of them linked to an authoritative source, and all of them equally discoverable by any AI ingestion pipeline naive enough not to check. These are not edge cases. They are the dominant condition. The UK National Audit Office's 2024 review of public sector data maturity found that fewer than 18 percent of local authorities had implemented any form of systematic version control for their internal policy document estate. The Government Digital Service's own assessment, published as part of the 2026 AI Dataset Guidelines consultation, found that 67 percent of government datasets submitted for AI readiness review contained at least one category of structural or semantic deficiency that would cause material retrieval errors in a RAG pipeline. The consequences of deploying a RAG system on top of this data landscape are predictable and serious. An AI assistant that retrieves a 2017 housing benefit guidance note as its authoritative source — because that document has the highest semantic similarity score to the user's query, and because nothing in the pipeline is checking whether that document has been superseded — will give a resident incorrect eligibility information. That resident may make a consequential life decision based on that information. The local authority will be liable. The public's trust in AI-assisted public services will diminish. And the AI programme, which may have represented years of investment and political capital, will be suspended. Core Argument A RAG system is only as reliable as the data it retrieves from. Investing in model capability while neglecting data quality is not a technology strategy — it is a liability strategy. Every pound invested in AI deployment must be preceded by a structured assessment of data readiness. This playbook provides that assessment framework. Legislative & Policy ContextThe Data (Use and Access) Act 2026The Data (Use and Access) Act 2026 — which received Royal Assent in April 2026 — represents the most significant legislative reform of UK data governance since the Data Protection Act 2018 and the UK GDPR. For AI programme leads, three provisions of the Act are of immediate operational relevance. Section 12 — Data Standards for AI Systems. Section 12 places a statutory duty on public authorities that deploy AI systems in the exercise of their functions to ensure that the datasets used to train, fine-tune, or ground those systems meet prescribed data quality standards. The Secretary of State is empowered to issue binding Data Standards Notices specifying minimum metadata requirements, version control obligations, and PII redaction standards for AI-ready datasets. The first such Notice, issued in March 2026, mandates that any public authority operating a RAG-based AI tool must maintain a Data Asset Register for all ingested documents, updated within 24 hours of any document status change. Section 34 — AI-Related Data Subject Rights. Section 34 extends UK GDPR data subject rights to cover AI-mediated decisions and AI-generated outputs that are relied upon in the exercise of a public function. Where a resident's query to an AI assistant returns information that influences a consequential decision — a benefit entitlement determination, planning advice, a school place allocation — that interaction falls within the scope of Section 34 and creates a new category of Data Subject Access Request. Organisations without immutable query logs cannot satisfy this obligation. Section 47 — Automated Processing Accountability. Section 47 introduces an explicit accountability framework for automated processing in the public sector, requiring that any AI system used in public service delivery be registered with the Information Commissioner's Office, accompanied by a completed Data Protection Impact Assessment (DPIA), and subject to annual third-party algorithmic audit. Organisations that cannot demonstrate data lineage — the ability to trace every AI output back to its source document — will fail the algorithmic audit requirement. GDS “Guidelines for Making Government Datasets Ready for AI” (2026)The Government Digital Service published its Guidelines for Making Government Datasets Ready for AI in January 2026, following an eighteen-month cross-departmental consultation. The Guidelines establish six binding principles for any government dataset intended for use in an AI pipeline: Semantic Accessibility (machine-readable, semantically structured formats); Provenance Integrity (immutable provenance metadata); Temporal Validity (explicit effective and expiry dates, superseded versions quarantined); PII Separation (personal data architecturally separated from policy data at the infrastructure layer, not merely at the access-control layer); Audit Completeness (every AI interaction generates a complete, tamper-evident audit record); and Human Oversight (a defined human oversight checkpoint for consequential outputs). The AI Data-Readiness Audit Metric (AID-R) presented in Section 2 is directly aligned to these six principles. UK GDPR & the ICO Guidance on Generative AIThe ICO's updated Guidance on Generative AI and Data Protection, published in February 2025, establishes that the deployment of a generative AI tool — including a RAG-based system — by a public authority constitutes processing of personal data where the system's training data, fine-tuning data, or retrieved context chunks contain personal data relating to identifiable individuals. This has significant implications for local authority RAG deployments: even static policy documents may contain references to named individuals in case law citations, named officers in procedural guidance, or named residents in published decision notices. The ICO guidance requires that organisations complete a DPIA before deployment, implement the principle of data minimisation at the point of ingestion, and respond to data subject erasure requests that affect ingested documents within 30 days — including the removal of affected vector embeddings from the vector store. 2The AI Data-Readiness Audit Metric (AID-R)The AI Data-Readiness Audit Metric (AID-R) is a structured diagnostic instrument for evaluating the AI-readiness of any internal data repository across three critical vectors. It is designed to generate a defensible, evidence-based maturity assessment that can be presented to Senior Civil Service leadership, an AI Programme Board, or a Local Digital Fund approval panel. The AID-R is not a self-certification exercise; it requires documentation of evidence for each stage assessment and must be validated by the organisation's Data Protection Officer and, where applicable, the Local Authority Digital Service (LADS) data assurance function. How to Conduct an AID-R Assessment
AID-R score = sum of the three vector ratings (Stage 1–4 each). Maximum score: 12. AID-R Score Interpretation Guide3The Anti-Hallucination RAG ArchitectureThe single most consequential misunderstanding in public sector AI procurement is the conflation of two categorically different operational modes of a large language model: semantic search within a restricted context window, and open generative creative writing across the model's full training distribution. These are not points on a spectrum. They are distinct functional regimes with entirely different risk profiles, different appropriate use cases, and different governance requirements. Any AI governance framework that does not explicitly distinguish between them — and enforce that distinction at the system architecture level — is not fit for purpose in a public sector context. Semantic Search vs. Generative Creative WritingThe following comparison is designed to be used directly in AI programme board presentations and procurement specifications. The governance implication is unambiguous: any public sector enterprise search tool — whether DWP Ask, GOV.UK Chat, or a local authority knowledge assistant — must be architected and governed in Semantic Search / RAG Mode. Open generative mode must be explicitly disabled for all public-facing and decision-adjacent use cases. Where open generative capability is made available for internal staff productivity, it must operate within a separate interface, carry explicit “AI-generated — not policy-verified” watermarking, and be subject to mandatory human review before any output influences a consequential decision. The Mandatory Grounding ProtocolThe Grounding Protocol is the set of non-negotiable architectural and governance constraints that must be applied to any RAG-based system operating in a public sector context. It operates at four levels: system prompt constraints, retrieval architecture constraints, user interface requirements, and output audit requirements. Level 1 — System Prompt ConstraintsThe system prompt — the hidden instruction set that governs the model's behaviour for every query — must contain, verbatim or in substantive equivalent, the following four mandatory instructions. These are governance requirements derived from the ICO Guidance on Generative AI, the GDS AI Dataset Guidelines, and the hallucination-mitigation standards emerging from the Central Digital and Data Office's AI Assurance Framework. Mandatory Instruction 1 You are a policy information retrieval assistant operating exclusively within a defined, authorised knowledge base. You must answer only using information present in the context window provided to you in this query. You must not draw upon your training data, your general knowledge, or any information source outside the retrieved context. If the retrieved context does not contain sufficient information to answer the query, you must respond with the exact phrase: “I cannot find an authoritative answer to this query in the approved knowledge base. Please contact [designated escalation channel].” Mandatory Instruction 2 Every factual claim in your response must be followed immediately by an inline citation in the format: [Source: Document Title, Section X.X, effective date DD/MM/YYYY, URL]. You must only cite sources present in the context window. You must never fabricate, infer, or extrapolate a citation. If a factual claim cannot be directly linked to a specific retrieved chunk, that claim must be omitted. Mandatory Instruction 3 Where a query concerns a matter governed by primary or secondary legislation, your response must identify the specific statutory provision by Act, Section, and Subsection — for example, “Section 193(2) of the Housing Act 1996, as amended.” You must only cite statutory provisions explicitly referenced in the retrieved context chunks. Mandatory Instruction 4 Your response must conclude with the following disclaimer in all cases: “This response is generated from the Council's approved policy knowledge base, current as of [document effective date]. It does not constitute legal advice. For complex or consequential decisions, please consult the relevant policy team or a qualified legal adviser.” Level 2 — Retrieval Architecture Constraints
Level 3 — User Interface Requirements
Level 4 — Output Audit RequirementsEvery query-response interaction must be captured in an immutable, tamper-evident audit log that satisfies Section 47 of the Data (Use and Access) Act 2026 and the ICO's Generative AI Guidance. For every interaction the log must record: a pseudonymised user identifier; the query text (hashed); the chunk IDs of all retrieved documents; the cosine similarity score for each chunk; the full response text; the result of the post-generation URL verification layer; any “Flag this response” submissions; and the timestamp of every stage from query receipt to response delivery. Audit logs must be retained for a minimum of three years and be accessible to the DPO on request within 48 hours. 4The PII Leakage Defence ProtocolThe PII Leakage Defence Protocol is the most operationally critical component of any local authority or school AI deployment. It is also the component most likely to be inadequately addressed in early deployment plans, because the teams responsible for AI procurement are typically not the teams with expertise in data protection, and because the risks are non-obvious until they materialise. This section specifies, step by step, the corporate ingestion pipeline that every public sector organisation must implement before any document from its data estate enters a RAG vector store. Static Policy Data vs. Relational Operational DataBefore detailing the ingestion pipeline, it is essential to establish and rigorously enforce the most important conceptual boundary in public sector AI data governance: the distinction between static policy data and relational operational data. Confusion between these two categories is the primary vector through which PII enters RAG pipelines inappropriately. Safe for RAG retrieval Static Policy Data
Prohibited from RAG pipeline Relational Operational Data
This distinction must be enforced at the infrastructure layer — in the ingestion pipeline architecture — and not merely through policy guidance to staff. A policy that instructs staff not to upload personal data to the RAG ingestion queue is insufficient. Human error, misclassification, and scope creep will occur. The pipeline must be designed to catch and block PII even when it arrives. The Eight-Stage Zero-Trust Ingestion PipelineNER Entity Type Reference: UK Public SectorThe following table defines the entity types that the NER scanning layer (Stage 03) must be configured to detect in a UK local authority or school context. It is intended to be used as the specification document for NER model configuration and testing. The DPIA Obligation for RAG PipelinesA Data Protection Impact Assessment is not optional for any RAG deployment in a public sector context. It is a legal requirement under UK GDPR Article 35, triggered by the combination of: systematic processing of personal data using new technologies; large-scale processing of special category data (which includes child data and health data); and automated processing that may affect the rights and freedoms of individuals. All three triggers are likely to be met by a local authority RAG deployment. The DPIA must address, as a minimum: the categories of personal data that may be present in ingested documents (even after NER redaction, residual risk must be assessed); the technical and organisational measures mitigating PII leakage at each pipeline stage; the data retention periods for vector embeddings and audit logs; the process for responding to data subject requests that affect ingested documents (including the technical process for removing specific embeddings); the risk assessment for the NER model itself — recognising that NER is not 100 percent accurate and that false negatives represent a residual risk; and the escalation and incident response procedure for confirmed PII leakage events. The DPIA must be completed before the pipeline is tested with real data — including in a pilot or sandbox environment. Testing with real data without a completed DPIA is a UK GDPR violation, not a minor procedural oversight. NCSC Cloud Security Note Where the RAG pipeline — including the embedding model, the vector store, or the LLM inference endpoint — is hosted in a cloud environment, the provider must meet the NCSC Cloud Security Principles, and data residency must be assessed against the Data (Use and Access) Act 2026 provisions on international transfers. No personal data — including pseudonymised personal data — may be processed outside the UK without a lawful transfer mechanism. Vector embeddings derived from documents containing personal data are themselves personal data under UK GDPR and must be treated accordingly. Closing Statement: From Data Risk to Data DividendThe investment case for AI in public services is compelling. The efficiency gains from policy retrieval at machine speed, the reduction in handling time for complex enquiries, the ability to surface the right regulatory guidance at the point of need — these are genuine, measurable improvements to public service quality. But those gains are only realisable by organisations that have built their AI capability on a foundation of data quality, data governance, and data trust. The organisations that will fail — and some already have, in unreported pilots quietly discontinued after producing incorrect eligibility guidance to residents, or after a Subject Access Request revealed that citizen personal data had entered a vector store without authorisation — are the organisations that treated data readiness as a prerequisite someone else would sort out, or as a detail to be resolved after deployment. This playbook is a commitment to a different approach. It places data readiness at the centre of the AI deployment decision. It provides the tools — the AID-R matrix, the Anti-Hallucination Architecture, the Zero-Trust Ingestion Pipeline — to assess, remediate, and govern the data estate before a single query reaches a language model. And it does so in the language of governance, accountability, and legislative compliance, because that is the language in which these decisions are and must be made. The organisations that get this right will not simply deploy AI tools that work. They will build the institutional capability — the data discipline, the governance culture, and the technical infrastructure — that makes every future AI deployment faster, safer, and more trustworthy. That is not a technology dividend. It is a data dividend. And it is available to every public sector organisation willing to do the foundational work. Engage with You & AI To commission an AID-R assessment for your local authority or school trust, or to discuss implementing the Anti-Hallucination RAG Architecture or Zero-Trust Ingestion Pipeline specifications in your organisation, get in touch. You & AI provides independent, non-commercial AI governance advisory to public sector and education organisations. |