Data Governance
Purpose
This document defines how the AI Conveyor reviews data for initiatives: availability, quality, access, confidentiality, the legality of processing, and suitability for delivery.
Data is one of the main sources of initiative failure. If the data question is deferred until delivery, the team often finds out too late that the required source does not exist, access is impossible, quality is low, or the data cannot be used in the chosen environment.
Core ideas
- The data owner must be known. You cannot build an initiative on a source that no one is responsible for.
- Data quality is checked before delivery. At the assessment stage a preliminary check is enough, but before adoption you need facts.
- Access must match the purpose. Access for analysis, training, validation, and production operation are different modes.
- Sensitive data requires separate control. Personal data, banking secrecy, trade secrets, and customer information must not end up in unsuitable environments.
- Minimization matters more than convenience. An initiative should use only the data that is truly needed for the result.
How it works
Data governance is embedded in the stage gates:
| Stage | What is checked | Typical decision |
|---|---|---|
| New | whether it is clear which data might be needed | send to assessment or clarify the brief |
| Assessment | whether sources exist, who the owner is, whether there are constraints | choose a product, request access, defer, or reject |
| Delivery | quality, access, processing environment, masking, logging | allow development, restrict the environment, request rework |
| Awaiting impact | whether actual data is available to measure the result | confirm the impact or revise the methodology |
| Support | source stability, quality control, access changes | continue, reconsider, or stop the solution |
Related sections: AI governance model, AI risks, architecture governance.
Minimum data card
For an initiative you need to record:
- which data sources are used;
- who owns each source;
- what the data is used for;
- which fields or datasets are needed;
- whether there is personal data;
- whether there is banking or trade secrecy;
- where the data will be processed;
- who gets access;
- how long the data is retained;
- how the data is deleted or anonymized;
- which data quality metric is critical for the result.
This does not always need to be a separate large form. At early maturity, a set of mandatory fields on the initiative card plus clarification tasks is enough.
Data classification
Minimum classification:
| Class | Examples | Control |
|---|---|---|
| Open | public reference data, published materials | basic source verification |
| Internal | regulations, instructions, anonymized metrics | access for employees and permitted environments only |
| Confidential | management reporting, commercial data, contracts | access restriction, logging, owner sign-off |
| Sensitive | personal data, banking secrecy, customer transactions | separate sign-off, minimization, masking, closed environment |
| Critical | data affecting money, risk, legal actions, or security | extended control, independent review, rollback plan |
The data class affects which AI product can be used, where the data may be processed, and which artifacts are needed to move forward.
Data quality check
Data quality is assessed not in the abstract, but relative to the task.
Minimum criteria:
- completeness — is there enough data for the solution;
- timeliness — is the data out of date;
- accuracy — how well the data reflects the real process;
- stability — does the source structure change without warning;
- linkability — can the data be matched across systems;
- reproducibility — can the calculation or training be repeated;
- availability — can the data be obtained at the required frequency.
If data quality is unknown, an initiative may go into assessment, but it should not go into full delivery without a data check task.
Access and environments
For each initiative you need to distinguish access modes:
- viewing data for assessment;
- exporting a limited set for validation;
- processing in an isolated environment;
- training or tuning the solution;
- production access in operation;
- access by the AI assistant or an agentic scenario.
The rule: the higher the data sensitivity and the impact of the solution, the fewer manual exports and the stricter the processing environment.
For external or cloud services, you must separately verify:
- whether data can be sent there;
- where the information is physically processed;
- who has access to the logs;
- whether the data is used to train external models;
- whether the data can be deleted on request;
- whether there are contractual and regulatory constraints.
What blocks a transition
An initiative should not move into delivery if:
- a data source has not been chosen;
- the data owner is unknown;
- the data is sensitive but there is no decision on the processing environment;
- there is no permission to use the data for the chosen purpose;
- data quality does not allow the hypothesis to be tested;
- the AI product is unsuitable for the data class;
- security gave a negative sign-off;
- the impact cannot be measured from the available sources.
An initiative may move forward with a restriction if the risk is clear, there is an exception owner, and compensating actions are assigned.
The role of the AI assistant
The AI assistant can help to:
- assemble a description of data sources;
- ask questions about the owner and access;
- prepare a draft classification;
- point out processing risks;
- create data check tasks;
- prepare a description for security or architecture.
But the AI assistant must not authorize the use of sensitive data on its own. The decision rests with the data owner, security, and the accountable roles.
Anti-patterns
Bad data governance:
- "we'll find the data later";
- exports to personal computers;
- training on data without a purpose or retention period;
- no source owner;
- using customer data in an unsuitable environment;
- impact measured by a metric that is not accessible;
- data quality checked after a prototype has been built.
Good data governance makes data part of the early assessment, not a late obstacle.