Surge in Low-Quality Papers Exploiting Public Data Sets and AI

Priyadharshini S May 15, 2025 | 11:50 AM Technology

Last year, Matt Spick, an associate editor at Scientific Reports, noticed an unusual influx of nearly identical papers submitted for peer review. All these papers used data from a publicly accessible U.S. source: the National Health and Nutrition Examination Survey (NHANES), which compiles health and dietary information from over 130,000 individuals through exams, blood tests, and interviews. “I was receiving almost identical submissions daily, sometimes two per day,” says Spick, a statistician at the University of Surrey.

Figure 1. Rise of Low-Quality Research Flooding Public Data Sets with AI Assistance.

What Spick observed at his journal is part of a broader issue. Recently, there has been a sharp increase in low-quality papers leveraging NHANES data—likely driven by illicit paper mills and aided by AI-generated text, according to a study he co-authored published in PLOS Biology. This trend highlights how large public health data sets are vulnerable to exploitation. Figure 1 shows Rise of Low-Quality Research Flooding Public Data Sets with AI Assistance.

These open data resources enable almost anyone to apply established research methods with slight variable changes to produce new, but often meaningless, “findings,” creating what Reese Richardson, a metascientist at Northwestern University, describes as “research Mad Libs.” Similar surges in low-quality publications have been identified across various fields, including genetic studies and analyses of gender disparities or scientific productivity metrics.

The NHANES papers that Spick received all followed a similar pattern: each selected a health condition, an environmental or physiological factor potentially linked to it, and a specific population group. Examples included exploring the relationship between vitamin D levels and depression in men over 65, or between poor dental health and diabetes in women aged 18 to 45. “It seemed like every possible combination was being tested by someone,” Spick explains.

To assess how widespread these types of studies were, Spick and his team searched two major scientific databases, PubMed and Scopus, for NHANES-based papers analyzing single associations. They identified 341 papers published across 147 journals, including Scientific Reports, BMC Public Health, and BMJ Open. From 2014 to 2021, only about four such papers appeared annually on average. However, beginning in 2022, there was a sharp surge, with 190 papers published by October 2024 alone. This growth significantly outpaced the general rise in health studies using large datasets, indicating an additional factor driving the spike in NHANES research.

The timing coincides with the rise of AI chatbots like ChatGPT, which can generate coherent text from simple prompts and input data. Jennifer Byrne, a molecular biologist at the University of Sydney who peer-reviewed the PLOS Biology study, suggests these tools may have been used to endlessly rephrase the same NHANES results to evade plagiarism detection. While it’s impossible to confirm that paper mills—commercial operations selling fraudulent or low-quality authorship—were responsible, Byrne notes that the scale and timing strongly imply some level of coordinated activity behind the surge.

Many of the recent NHANES studies selectively analyzed parts of the data without clear justification—for instance, focusing only on certain years or age groups. This suggests the authors were searching for statistically significant results to produce easy publications, Spick explains. However, fishing through such a large dataset inevitably yields many false positives. When his team examined 28 NHANES studies on depression, only 13 remained statistically significant after adjusting for the risk of false positives.

Spick and colleagues believe their analysis likely underestimates the scale of the issue. Their search targeted only NHANES papers fitting the specific pattern Spick had noticed, yet a broader look shows papers using NHANES data rose from 4,926 in 2023 to 7,876 in 2024. Other major health datasets, like the Global Burden of Disease study, may be similarly vulnerable, Spick notes. These datasets are accessible via coding languages such as Python or R, making it easy for researchers to manipulate the data—but also easy to exploit. His team was able to write code that pulled all NHANES data and systematically tested combinations of diseases and health variables. This “industrialization” of low-quality research floods the literature with meaningless findings. “Honestly, I got really hopping mad about it,” Spick admits.

Reese Richardson points out that these papers highlight deeper issues in scientific publishing and research incentives. “All of the publishers named accepted fees—likely around $1,000 each—to publish this junk,” he says, referring to open-access journals like PLOS Biology that charge author fees to make articles freely available. Richardson adds that researchers are often rewarded for quantity over quality of publications, which drives this problem. Without a radical restructuring of incentives in scientific publishing, he warns, the situation is only going to get worse.

Reference:

  1. https://www.science.org/content/article/low-quality-papers-are-surging-exploiting-public-data-sets-and-ai

Cite this article:

Priyadharshini S (2025), Surge in Low-Quality Papers Exploiting Public Data Sets And AI, AnaTechMaz, pp.125

Recent Post

Blog Archive