Last year, two Whitehall agencies commissioned the expert team known as the ‘nudge unit’ to lead review exercises comparing the speed and quality of work produced by machines and humans
Government has revealed that two departments last year took part in an exercise to test and compare the performance of humans and generative artificial intelligence tools in producing reports to support policymaking.
A new research document published on GOV.UK indicates that the Behavioural Insights Team (BIT) – formerly part of government itself, and popularly known as the ‘Nudge Unit’ – was commissioned in early 2024 to “investigate the robustness and reliability of using generative AI to help produce rapid evidence reviews”.
Used to provide headline insights for policymakers and analysis, these reviews are widely used in government and are intended to “collate and examine the best available academic evidence on a particular topic”.
To test the capabilities of AI in conducting such exercises, BIT – working on behalf of the Departments for Culture, Media and Sport, and Science, Innovation and Technology – ran two parallel reviews, the first of which relied solely on work from humans. The second was “AI-assisted”, with researchers using a variety of tools – including free versions of Elicit and Consensus, and paid-for versions of Claude 2 and ChatGPT 4 – to scan, select, analyse, and synthesise evidence on the topic of the “impact of technology diffusion on UK growth and productivity”.
The generative platforms were provided with a 350-word prompt, with the ultimate instruction that “based on your assessment of the evidence, distil any insights into 3/4 main policymaking implications for the UK government to consider”.
Once the reviews were completed, the outcomes were assessed against two main metrics: speed; and quality.
Related content
- Action hero? Experts examine government’s grand AI plan
- Gen AI tool for councils to ‘turn blurry maps and handwritten notes into clear, digital data’
- DWP seeks leader to build gen AI team and plots £12m deal to target ‘to be identified business problems’
The first headline finding is that the AI-assisted review took 23% less time overall – with work completed in 90.5 hours, compared with 117.75 hours for the human-only exercise.
The biggest disparity was in the analysis phase, where humans took more than twice as long: 34 hours compared with 15.
The AI-based process was also much quicker at scanning and synthesis, taking a cumulative 34.5 hours, while humans working alone required 55.5 hours.
However, humans were more rapid – taking 10 hours, versus 14 for AI – in selecting which studies to use, “having engaged more deeply with the papers during the scanning phase”, the BIT research says.
There was an even bigger advantage for humans in final phase of making revisions to the report. For the AI-produced document, this required 27.5 hours. The entirely human-authored version needed just 18.25 hours.
The research adds: “It took longer to produce a draft of the human report, but our DCMS/DSIT partners assessed it as stronger than the AI-assisted first draft. It consequently required less time for revisions.”
On the issue of the quality of the reviews, BIT found that both the human and AI processes identified “credible” sources – albeit with “surprisingly little overlap”.
“There were 16 references [that] only [featured] in the human-only evidence review, 18 only in the AI-assisted evidence review and 4 in both,” the research finds.
The two processes each ultimately delivered three conclusions – two of which “were thematically similar”.
While the AI-led process produced a similar – and speedier – outcome to its human-only counterpart, the BIT study notes that “no single AI tool currently exists that can effectively conduct every step of the literature review process”.
The research also advises that its findings are not necessarily “generalisable [as] the ability of both humans and AI models to review literature will vary substantially, including by topic”.
“However, we think AI has the potential to enhance the process of conducting rapid evidence reviews,” the document says. “It is not yet a game-changer – it still produces occasional, peculiar hallucinations and errors which mean its outputs require manual verification. [But it] is improving quickly, so these issues may soon be reduced. We therefore recommend that more work be undertaken to understand how and when AI can be implemented in evidence reviews.”
BIT was created in 2010 as part of the Cabinet Office but was spun out in 2014 to become an independent consultancy. Since 2021, it has been owned by innovation charity Nesta.