‘Consistently more accurate than humans’ – Cabinet Office algorithm helps sort and delete millions of files


Central department releases transparency info with details of automated tool that is being used by knowledge and information management team to determine whether files need to be deleted or archived

The Cabinet Office has found an algorithm used to analyse, sort – and delete – millions of government documents is “consistently more accurate than humans” in performing such checks.

First developed in the first half of 2022, the Automated Digital Document Review tool has been used by the central department’s Digital Knowledge and Information Management (DKIM) team to “review 5.1 million legacy files to date”, according to newly published transparency information outlining the background and operation of the automated tool. The goal of these reviews was to ascertain which files “need to be retained as the official record and information that should be destroyed because it is redundant, outdated, trivial or ephemeral in nature, or has reached its retention period”.

A further 300,000 files are expected to be analysed over the course of the 2023-24 year and, thereafter, the DKIM team believes the program will be used to review a total of between 30,000 and 80,000 files annually. This data will come from the operations of teams across the Cabinet Office, before the department’s information-management team is tasked with conducting reviews and deleting or permanently storing files, as appropriate.

The algorithm, which is based on technology from specialist supplier Automated Intelligence, is programmed to identify “key words or phrases commonly used by civil servants in documents that are recognised as important records [and, conversely, patterns of language… that are commonly found in redundant, outdated and trivial information which is of little or no value”.

The tool will create a relevancy score for files based on the occurrence of keywords – both overall volume and frequency. The precise lexicon used will be subject to “tuning” for each collection of files to which the algorithm is applied.

“A term that significantly increases the likelihood that a document is valuable in documents created in 2005 does not always have the same result on files created in 2023,” transparency documents said.

Reviews of the lexicon will be reviewed at least annually.


Related content


The automated software is reportedly able to perform checks and determine outcomes at not only a far greater scale than would be possible by human reviews, but also more accurately, according to the newly released guidance.

“Automation is consistently more accurate than humans at making decisions about the records value of a document; our tests showed that human error was [about] 1% but the automation showed an error rate likely to be [less than] 0.6%,” it said. “We estimate that a human reviewer could reasonably review up to 200,000 documents per annum. With automation we could achieve a review of several million files with no increase in human resource required to accommodate the higher volumes of files to be reviewed.”

The document added: “The previous method of disposal before the creation of the lexicon model consisted of digital archivists manually reviewing files both at a folder level and individual file level. This method was fairly accurate but extremely slow compared with our now automated solution, and was exclusively carried out using paper documents. Review of digital documents had not previously been attempted at scale.”

Human touch
Once the automated tool has finished its analysis, it creates a “report detailing the recommended files to be deleted”. A DKIM professional will the “review this report, analyse and test the final results” to make sure the algorithm’s rules have been correctly applied.

The ongoing role for humans will also include deciding whether a collection of files is suitable to be analysed by the algorithm, and monitoring its performance to ensure it is meeting the “minimum acceptable level as approved by departmental governance, which is 1% or less of files reviewed are incorrectly identified for destruction”.

The detailed information on the automated tool was provided under the publication scheme of the  Algorithmic Transparency Recording Standard. The standard, which is overseen by government’s Centre for Data Ethics and Innovation (CDEI), was first unveiled two years ago and is intended to provide a consistent framework through which public bodies can provide information on their use of algorithmic tools and the decision-making contexts in which they are being used.

The Cabinet Office’s report on the use of the digital file-review tool is the seventh transparency report released so far – and the first since 2022.

In a recent interview with PublicTechnology, executive director Felicity Burch discussed her ambitions for CDEI – which is part of the Department for Science, Innovation and Technology – to engage with government bodies over the coming months to help release more information.

“It’s a really important question, and the answer to whether we want more organisations to use this is: yes, absolutely,” she said. “And, indeed, we are working with a number of organisations at the moment to get them ready to publish their records. We have done this in quite an iterative way and are working with our colleagues in the public sector to make this a tool that’s easy and practical for them to use. We want to [continue] that iterative rollout – but I would absolutely like to see this scale up.”

Sam Trendall

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *

Processing...
Thank you! Your subscription has been confirmed. You'll hear from us soon.
Subscribe to our newsletter
ErrorHere