AnaTech Maz Technology Magazine

A Strange Phrase is Appearing in Scientific Papers—We Tracked It to a Glitch in AI Training Data

Janani R April 16, 2025 | 10:35 AM Technology

Earlier this year, researchers came across an unusual term in published papers: "vegetative electron microscopy."

This seemingly technical but meaningless phrase has become a “digital fossil”—an error embedded and perpetuated within artificial intelligence (AI) systems, making it incredibly difficult to eliminate from our collective knowledge sources.

Much like biological fossils encased in stone, these digital remnants risk becoming lasting fixtures in the landscape of our information ecosystem.

Figure 1. A Strange Phrase is Infiltrating Scientific Papers—We Tracked It to a Glitch in AI Training Data

The "vegetative electron microscopy" case reveals a concerning example of how AI systems can sustain and magnify errors within our shared body of knowledge. Figure 1 shows A Strange Phrase is Infiltrating Scientific Papers—We Tracked It to a Glitch in AI Training Data.

A Faulty Scan and a Translation Mistake

Vegetative electron microscopy seems to have emerged from an extraordinary blend of unrelated mistakes.

Initially, two papers from the 1950s, originally published in the journal Bacteriological Reviews, were scanned and digitized [1]. During the digitization process, however, an error occurred where "vegetative" from one text column was mistakenly merged with "electron" from another, ultimately giving rise to this spurious term.

Decades later, the term "vegetative electron microscopy" resurfaced in a few Iranian scientific papers. In 2017 and 2019, it appeared in English captions and abstracts.

This likely resulted from a translation error, as the Farsi words for "vegetative" and "scanning" differ by just a single dot—making the mistake both subtle and easy to overlook.

A Growing Error

The bottom line? As of now, Google Scholar shows that “vegetative electron microscopy” is mentioned in 22 papers. One of these papers faced a disputed retraction from a Springer Nature journal, while Elsevier issued a correction on another.

Furthermore, the term is featured in news articles related to investigations of publication integrity. Its frequency notably increased in the 2020s. Uncovering the reason required us to examine modern AI models and perform some digital archaeology through the extensive datasets they were trained on.

Empirical Proof of AI-Driven Data Contamination

Modern AI chatbots like ChatGPT are powered by large language models trained on vast amounts of text to predict the next word in a sequence. The specific contents of a model's training data are typically kept confidential.

To determine whether a model "knew" about vegetative electron microscopy, we tested it by providing excerpts from the original papers to see if the model would complete the sentences with the erroneous term or more logical alternatives.

The results were telling. OpenAI's GPT-3 consistently completed phrases with "vegetative electron microscopy," while earlier models like GPT-2 and BERT did not. This pattern helped pinpoint when and where the contamination began.

We also found that the error persists in later models, including GPT-4 and Anthropic's Claude 3.5, indicating that the term may now be permanently ingrained in AI knowledge repositories.

By analyzing the training datasets of various models, we traced the origins of the term to the CommonCrawl dataset, which consists of scraped internet pages. This dataset appears to be the most likely source where AI models first encountered the term "vegetative electron microscopy."

The Issue of Scale

Identifying errors like this is challenging, and correcting them may be nearly impossible.

One reason is scale. The CommonCrawl dataset, for instance, spans millions of gigabytes. The computational resources needed to handle data at this scale are often beyond the reach of researchers outside major tech companies.

Another obstacle is the lack of transparency in commercial AI models. OpenAI and many other developers withhold detailed information about the training data used for their models. Efforts to reverse-engineer these datasets have been hindered by copyright takedowns.

When errors are identified, fixing them is far from straightforward. While simple keyword filtering might remove problematic terms like "vegetative electron microscopy," it would also delete legitimate references (such as this article).

More concerningly, this case prompts a troubling question: How many other nonsensical terms lurk within AI systems, waiting to be uncovered?

Implications for Science and Academic Publishing

This "digital fossil" also raises significant concerns about the integrity of knowledge as AI-assisted research and writing become increasingly prevalent.

Publishers have reacted inconsistently when alerted to papers containing "vegetative electron microscopy." Some have retracted the affected papers, while others have defended them. Elsevier, for instance, initially attempted to justify the term’s validity before ultimately issuing a correction.

It remains unclear whether other similar anomalies exist within large language models, but the likelihood is high. Regardless, the use of AI systems has already introduced complications into the peer-review process.

For example, experts have noted the rise of "tortured phrases" designed to bypass automated integrity checks, such as replacing "artificial intelligence" with "counterfeit consciousness." Phrases like "I am an AI language model" have also been found in other retracted papers.

Some automated screening tools, such as Problematic Paper Screener, now flag "vegetative electron microscopy" as a potential indication of AI-generated content. However, these tools can only address known errors and cannot detect undiscovered ones.

Living with Digital Artifacts

The rise of AI creates opportunities for errors to become permanently ingrained in our knowledge systems, through processes beyond the control of any single entity [2]. This poses challenges for tech companies, researchers, and publishers alike.

Tech companies need to be more transparent about their training data and methods. Researchers must develop new strategies for evaluating information in a world where AI-generated misinformation can be highly convincing. Scientific publishers must enhance their peer review processes to detect both human and AI-induced errors.

Digital fossils highlight not only the technical challenge of monitoring vast datasets, but also the deeper issue of preserving reliable knowledge in systems where errors can become self-reinforcing.

References:

https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463
https://techxplore.com/news/2025-04-weird-phrase-plaguing-scientific-papers.html

Cite this article:

Janani R (2025), A Strange Phrase is Appearing in Scientific Papers—We Tracked It to a Glitch in AI Training Data, AnaTechMaz, pp.130

Previous Post USC Scientists Develop 5-In-1 Blood Test for Early Alzheimer’s Detection

Next Post Machine Learning Method Reduces Fraud Detection Costs by Generating Accurate Labels from Imbalanced Datasets

A Strange Phrase is Appearing in Scientific Papers—We Tracked It to a Glitch in AI Training Data

A Faulty Scan and a Translation Mistake

A Growing Error

Empirical Proof of AI-Driven Data Contamination

The Issue of Scale

Implications for Science and Academic Publishing

Living with Digital Artifacts

References:

Cite this article:

Recent Post

Oracle Reports Second Recent Hack; Client Login Data Compromised, Says Bloomberg News

Keysight Tools Enhance Efficiency in Data Center Deployment

New Approach Effectively Protects Sensitive AI Training Data

New Tool Assesses Progress in Reinforcement Learning

Surge in Low-Quality Papers Exploiting Public Data Sets and AI

India Could Require an Additional 50 million Sqft of Real Estate for Data Centers By 2030: Deloitte Report

New Research and Data Shed Light on Early Planetary Formation

HPE Aruba Introduces New Range of Switches for Data Center and Campus Modernization

USC Scientists Develop 5-In-1 Blood Test for Early Alzheimer’s Detection

A Strange Phrase is Appearing in Scientific Papers—We Tracked It to a Glitch in AI Training Data

Machine Learning Method Reduces Fraud Detection Costs by Generating Accurate Labels from Imbalanced Datasets

New Model Generates Audio and Music Tracks from Various Data Inputs

AI Surge Drives Data Centre Growth Amid Hyperscaler Risks

Google Introduces an Auto-Restart Feature on Android to Prevent Unauthorized Data Extraction

Chrome Extensions Posing as Fortinet, YouTube, and VPN Services are Stealing User Data

Blog Archive

Popular Lnks