What do we need to know about accuracy and statistical accuracy?
Latest updates
15 March 2023 - This is new chapter with old content. Following the restructuring under the data protection principles, the statistical accuracy content – that used to reside with the chapter ‘What do we need to do to ensure lawfulness, fairness, and transparency in AI systems?’ - has moved into a new chapter that will focus on the accuracy principle. Statistical accuracy continues to remain key for fairness but we felt it was more appropriate to host it under a chapter that focuses on the accuracy.
At a glance
Statistical accuracy refers to the proportion of answers that an AI system gets correct or incorrect.
This section explains the controls you can implement so that your AI systems are sufficiently statistically accurate to ensure that the processing of personal data complies with the fairness principle.
Who is this section for?
This section is aimed at technical specialists, who are best placed to assess the statistical accuracy of an AI system and what personal data is required to improve it. It will also be useful for those in compliance-focused roles to understand how statistical accuracy is linked to fairness.
In detail
- What is the difference between accuracy in data protection law and ‘statistical accuracy’ in AI?
- How should we define and prioritise different statistical accuracy measures?
- What should we consider about statistical accuracy?
- What should we do about statistical accuracy post-deployment?
What is the difference between accuracy in data protection law and ‘statistical accuracy’ in AI?
It is important to note that the word ‘accuracy’ has a different meaning in the contexts of data protection and AI. Accuracy in data protection is one of the fundamental principles, requiring you to ensure that personal data is accurate and, where necessary, kept up to date. It requires you to take all reasonable steps to make sure the personal data you process is not ‘incorrect or misleading as to any matter of fact’ and, where necessary, is corrected or deleted without undue delay.
Broadly, accuracy in AI (and, more generally, in statistical modelling) refers to how often an AI system guesses the correct answer, measured against correctly labelled test data. The test data is usually separated from the training data prior to training, or drawn from a different source (or both). In many contexts, the answers the AI system provides will be personal data. For example, an AI system might infer someone’s demographic information or their interests from their behaviour on a social network.
So, for clarity, in this guidance, we use the terms:
- ‘accuracy’ to refer to the accuracy principle of data protection law; and
- ‘statistical accuracy’ to refer to the accuracy of an AI system itself.
Fairness, in a data protection context, generally means that you should handle personal data in ways that people would reasonably expect and not use it in ways that have unjustified adverse effects on them. Improving the ‘statistical accuracy’ of your AI system’s outputs is one of your considerations to ensure compliance with the fairness principle.
Data protection’s accuracy principle applies to all personal data, whether it is information about an individual used as an input to an AI system, or an output of the system. However, this does not mean that an AI system needs to be 100% statistically accurate to comply with the accuracy principle.
In many cases, the outputs of an AI system are not intended to be treated as factual information about the individual. Instead, they are intended to represent a statistically informed guess as to something which may be true about the individual now or in the future. To avoid such personal data being misinterpreted as factual, you should ensure that your records indicate that they are statistically informed guesses rather than facts. Your records should also include information about the provenance of the data and the AI system used to generate the inference.
You should also record if it becomes clear that the inference was based on inaccurate data, or the AI system used to generate it is statistically flawed in a way which may have affected the quality of the inference.
Similarly, if the processing of the incorrect inference may have an impact on them, an individual may request the inclusion of additional information in their record countering the incorrect inference. This helps ensure that any decisions taken on the basis of the potentially incorrect inference are informed by any evidence that it may be wrong.
The UK GDPR mentions statistical accuracy in the context of profiling and automated decision-making at Recital 71. This states organisations should put in place ‘appropriate mathematical and statistical procedures’ for the profiling of individuals as part of their technical measures. You should ensure any factors that may result in inaccuracies in personal data are corrected and the risk of errors is minimised.
If you use an AI system to make inferences about people, you need to ensure that the system is sufficiently statistically accurate for your purposes. This does not mean that every inference has to be correct, but you do need to factor in the possibility of them being incorrect and the impact this may have on any decisions that you may take on the basis of them. Failure to do this could mean that your processing is not compliant with the fairness principle. It may also impact on your compliance with the data minimisation principle, as personal data, which includes inferences, must be adequate and relevant for your purpose.
Your AI system therefore needs to be sufficiently statistically accurate to ensure that any personal data generated by it is processed lawfully and fairly.
However, overall statistical accuracy is not a particularly useful measure, and usually needs to be broken down into different measures. It is important to measure and prioritise the right ones. If you are in a compliance role and are unsure what these terms mean, you should consult colleagues in the relevant technical roles.
Further reading in this guidance
Model evaluation – How should we evaluate our model’s statistical accuracy?
Further reading outside this guidance
Read our guidance on accuracy as well as our guidance on the rights to rectification and erasure.
European guidelines on automated decision-making and profiling.
How should we define and prioritise different statistical accuracy measures?
Statistical accuracy, as a general measure, is about how closely an AI system’s predictions match the correct labels as defined in the test data.
For example, if an AI system is used to classify emails as spam or not spam, a simple measure of statistical accuracy is the number of emails that were correctly classified as spam or not spam, as a proportion of all the emails that were analysed.
However, such a measure could be misleading. For example, if 90% of all emails received to an inbox are spam, then you could create a 90% accurate classifier by simply labelling everything as spam. But this would defeat the purpose of the classifier, as no genuine email would get through.
For this reason, you should use alternative measures of statistical accuracy to assess how good a system is. If you are in a compliance role, you should work with colleagues in technical roles to ensure that you have in place appropriate measures of statistical accuracy given your context and the purposes of processing.
These measures should reflect the balance between two different kinds of errors:
- a false positive or ‘type I’ error: these are cases that the AI system incorrectly labels as positive (eg emails classified as spam, when they are genuine); or
- a false negative or ‘type II’ error: these are cases that the AI system incorrectly labels as negative when they are actually positive (eg emails classified as genuine, when they are actually spam).
It is important to strike the right balance between these two types of errors. There are more useful measures which reflect these two types of errors, including:
- precision: the percentage of cases identified as positive that are in fact positive (also called ‘positive predictive value’). For example, if nine out of 10 emails that are classified as spam are actually spam, the precision of the AI system is 90%; or
- recall (or sensitivity): the percentage of all cases that are in fact positive that are identified as such. For example, if 10 out of 100 emails are actually spam, but the AI system only identifies seven of them, then its recall is 70%.
There are trade-offs between precision and recall, which can be assessed using statistical measures. If you place more importance on finding as many of the positive cases as possible (maximising recall), this may come at the cost of some false positives (lowering precision).
In addition, there may be important differences between the consequences of false positives and false negatives on individuals, which could affect the fairness of the processing.
Example
If a CV filtering system being used to assist with selecting qualified candidates for an interview produces a false positive, then an unqualified candidate may be invited to interview, wasting the employer’s and the applicant’s time unnecessarily.
If it produces a false negative, a qualified candidate will miss an employment opportunity and the organisation will miss a good candidate.
You should prioritise avoiding certain kinds of error based on the severity and nature of the risks.
In general, statistical accuracy as a measure depends on how possible it is to compare the performance of a system’s outputs to some ‘ground truth’ (ie checking the results of the AI system against the real world). For example, a medical diagnostic tool designed to detect malignant tumours could be evaluated against high quality test data, containing known patient outcomes.
In some other areas, a ground truth may be unattainable. This could be because no high-quality test data exists or because what you are trying to predict or classify is subjective (eg whether a social media post is offensive). There is a risk that statistical accuracy is misconstrued in these situations, so that AI systems are seen as being highly statistically accurate even though they are reflecting the average of what a set of human labellers thought, rather than objective truth.
To avoid this, your records should indicate where AI outputs are not intended to reflect objective facts, and any decisions taken on the basis of such personal data should reflect these limitations. This is also an example of where you must take into account the accuracy principle – for more information, see our guidance on the accuracy principle, which refers to accuracy of opinions.
Finally, statistical accuracy is not a static measure. While it is usually measured on static test data (held back from the training data), in real life situations AI systems are applied to new and changing populations. Just because a system is statistically accurate about an existing population’s data (eg customers in the last year), it may not continue to perform well if there is a change in the characteristics of that population or any other population who the system is applied to in future. Behaviours may change, either of their own accord, or because they are adapting in response to the system, and the AI system may become less statistically accurate with time.
This phenomenon is referred to in machine learning as ‘concept / model drift’, and various methods exist for detecting it. For example, you can measure the distance between classification errors over time; increasingly frequent errors may suggest drift.
You should regularly assess drift and retrain the model on new data where necessary. As part of your accountability responsibilities, you should decide and document appropriate thresholds for determining whether your model needs to be retrained, based on the nature, scope, context and purposes of the processing and the risks it poses. For example, if your model is scoring CVs as part of a recruitment exercise, and the kinds of skills candidates need in a particular job are likely to change every two years, you should anticipate assessing the need to re-train your fresh data at least that often.
In other application domains where the main features don’t change so often (eg recognising handwritten digits), you can anticipate less drift. You will need to assess this based on your own circumstances.
Further reading inside this guidance
See ‘What should we know about accuracy and statistical accuracy’.
Further reading outside this guidance
See our guidance on the accuracy principle.
See ‘Learning under concept drift: an overview’ for a further explanation of concept drift.
What should we consider about statistical accuracy?
You should always think carefully from the start whether it is appropriate to automate any prediction or decision-making process. This should include assessing the effectiveness of the AI system in making statistically accurate predictions about the individuals whose personal data it processes.
You should assess the merits of using a particular AI system in light of consideration of its effectiveness in making statistically accurate, and therefore valuable, predictions. Not all AI systems demonstrate a sufficient level of statistical accuracy to justify their use.
If you decide to adopt an AI system, then to comply with the data protection principles, you should:
- ensure that all functions and individuals responsible for its development, testing, validation, deployment, and monitoring are adequately trained to understand the associated statistical accuracy requirements and measures;
- make sure data is clearly labelled as inferences and predictions, and is not claimed to be factual;
- ensure you have managed trade-offs and reasonable expectations; and
- adopt a common terminology that staff can use to discuss statistical accuracy performance measures, including their limitations and any adverse impact on individuals.
What should we do about statistical accuracy post-deployment?
As part of your obligation to implement data protection by design and by default, you should consider statistical accuracy and the appropriate measures to evaluate it from the design phase and test these measures throughout the AI lifecycle. After deployment, you should implement monitoring, the frequency of which should be proportional to the impact an incorrect output may have on individuals. The higher the impact the more frequently you should monitor and report on it. You should also review your statistical accuracy measures regularly to mitigate the risk of concept drift. Your change policy procedures should take this into account from the outset.
Statistical accuracy is also an important consideration if you outsource the development of an AI system to a third party (either fully or partially) or purchase an AI solution from an external vendor. In these cases, you should examine and test any claims made by third parties as part of the procurement process.
Similarly, you should agree regular updates and reviews of statistical accuracy to guard against changing population data and concept/ model drift. If you are a provider of AI services, you should ensure that they are designed in such a way as to allow organisations to fulfil their data protection obligations.
Finally, the vast quantity of personal data you may hold and process as part of your AI systems is likely to put pressure on any pre-existing non-AI processes you use to identify and, if necessary, rectify/ delete inaccurate personal data, whether it is used as input or training/ test data. Therefore, you need to review your data governance practices and systems to ensure they remain fit for purpose.