AnaTech Maz Technology Magazine

Machine Learning Method Reduces Fraud Detection Costs by Generating Accurate Labels from Imbalanced Datasets

Janani R April 16, 2025 | 10:59 AM Technology

Fraud is prevalent in the United States and is increasingly fueled by technology. For instance, 93% of credit card fraud now stems from remote account access rather than physical theft. In 2023, fraud losses exceeded $10 billion for the first time.

The financial impact is immense: credit card fraud leads to $5 billion in annual losses, affecting 60% of U.S. cardholders, while identity theft caused $16.4 billion in losses in 2021. Medicare fraud costs $60 billion each year, and government losses range from $233 billion to $521 billion annually, with improper payments reaching $2.7 trillion since 2003.

Figure 1. Machine Learning Reduces Fraud Detection Costs with Accurate Labels from Imbalanced Datasets

Machine learning is essential in fraud detection, as it helps identify patterns and anomalies in real time. By analyzing large datasets, it can recognize normal behavior and flag significant deviations, such as unusual transactions or unauthorized account access. However, detecting fraud is challenging due to the rarity of fraud cases compared to normal ones, and the data is often noisy or lacks proper labels. Figure 1 shows Machine Learning Reduces Fraud Detection Costs with Accurate Labels from Imbalanced Datasets.

To tackle these challenges, researchers from the College of Engineering and Computer Science at Florida Atlantic University have developed an innovative method for generating binary class labels in highly imbalanced datasets, offering a promising solution for fraud detection in sectors like healthcare and finance. This approach does not rely on labeled data, which is a significant advantage in industries where privacy concerns and the high cost of labeling data are major obstacles.

The team tested their method on two large-scale, real-world datasets with severe class imbalances (less than 0.2% fraud cases): European credit card transactions (over 280,000 from September 2013) and Medicare Part D claims (more than 5 million from 2013 to 2019), both labeled as fraudulent or genuine. These datasets, with fraud cases vastly outnumbered by non-fraud cases, present a real-world challenge perfect for evaluating fraud detection techniques.

The results of the study, published in the Journal of Big Data, demonstrate that this new labeling method effectively tackles the challenge of labeling highly imbalanced data within an unsupervised framework. Unlike traditional methods, this approach directly evaluates the newly generated fraud and non-fraud labels without relying on a supervised classifier.

"Machine learning in fraud detection offers numerous advantages," said Taghi Khoshgoftaar, Ph.D., senior author and Motorola Professor in the FAU Department of Electrical Engineering and Computer Science. "Machine learning algorithms can label data much faster than human annotators, significantly boosting efficiency. Our method marks a major breakthrough in fraud detection, particularly in highly imbalanced datasets.

"It reduces the workload by minimizing the need for further inspection, which is critical in fields like Medicare and credit card fraud, where rapid data processing is essential to prevent financial losses and improve operational efficiency."

The study shows that the new method outperformed the widely used Isolation Forest algorithm, offering a more efficient way to detect fraud while reducing the need for further investigation. This demonstrates the method’s ability to generate reliable binary class labels for fraud detection, even in challenging datasets. It provides a scalable solution for fraud detection without relying on expensive and time-consuming labeled data, which often requires significant manual input and is resource-intensive, especially for large datasets.

"Our method generates labels for both fraud (positive) and non-fraud (negative) instances, which are then refined to minimize the number of fraud labels," explained Mary Anne Walauskis, first author and Ph.D. candidate in the FAU Department of Electrical Engineering and Computer Science. "By applying our method, we minimize false positives, or in other words, genuine instances incorrectly marked as fraud, which is crucial for improving fraud detection.

"This approach ensures that only the most confidently identified fraud cases are retained, improving accuracy and reducing unnecessary alarms, making fraud detection more efficient."

The method combines two strategies: an ensemble of three unsupervised learning techniques using the SciKit-learn library and a percentile-gradient approach. The goal is to minimize false positives by focusing on the most confidently identified fraud cases. This is achieved by refining the labels and reducing errors in both the unsupervised methods (EUM) and the percentile-gradient approach (PGM).

The refined labels create a subset of highly confident labels that are likely to be accurate. These labels are then used to generate confidence intervals and finalize the labeling, requiring minimal domain knowledge to determine the number of positive instances.

"This innovative approach holds significant potential for industries affected by fraud, providing a more accessible and effective way to detect fraudulent activity and protect both financial and healthcare systems," said Stella Batalama, Ph.D., dean of the College of Engineering and Computer Science.

"Fraud's impact extends beyond financial losses, causing emotional distress, reputational harm, and diminished trust in organizations. Healthcare fraud, in particular, erodes care quality and increases costs, while identity theft can lead to significant stress. Tackling fraud is essential to reducing its widespread societal consequences."

Looking ahead, the research team plans to further enhance the method by automating the determination of the optimal number of positive instances, thereby improving efficiency and scalability for large-scale applications.

The current journal article, "Unsupervised Label Generation for Severely Imbalanced Fraud Data," is an updated version of the researchers' earlier work, "Confident Labels: A Novel Approach to New Class Labeling and Evaluation on Highly Imbalanced Data."

The original paper was presented and published at the IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI) in November 2024, where it won the Best Student Paper Award. ICTAI, known for its prestige, has an acceptance rate of about 25% from over 400 submissions.

References:

https://www.fau.edu/newsdesk/articles/machine-learning-fraud-detection
https://techxplore.com/news/2025-04-machine-method-fraud-generating-accurate.html

Cite this article:

Janani R (2025), Machine Learning Method Reduces Fraud Detection Costs by Generating Accurate Labels from Imbalanced Datasets, AnaTechMaz, pp.131

Previous Post A Strange Phrase is Appearing in Scientific Papers—We Tracked It to a Glitch in AI Training Data

Next Post New Model Generates Audio and Music Tracks from Various Data Inputs

Machine Learning Method Reduces Fraud Detection Costs by Generating Accurate Labels from Imbalanced Datasets

References:

Cite this article:

Recent Post

Oracle Reports Second Recent Hack; Client Login Data Compromised, Says Bloomberg News

Keysight Tools Enhance Efficiency in Data Center Deployment

New Approach Effectively Protects Sensitive AI Training Data

New Tool Assesses Progress in Reinforcement Learning

Surge in Low-Quality Papers Exploiting Public Data Sets and AI

India Could Require an Additional 50 million Sqft of Real Estate for Data Centers By 2030: Deloitte Report

New Research and Data Shed Light on Early Planetary Formation

HPE Aruba Introduces New Range of Switches for Data Center and Campus Modernization

USC Scientists Develop 5-In-1 Blood Test for Early Alzheimer’s Detection

A Strange Phrase is Appearing in Scientific Papers—We Tracked It to a Glitch in AI Training Data

Machine Learning Method Reduces Fraud Detection Costs by Generating Accurate Labels from Imbalanced Datasets

New Model Generates Audio and Music Tracks from Various Data Inputs

AI Surge Drives Data Centre Growth Amid Hyperscaler Risks

Google Introduces an Auto-Restart Feature on Android to Prevent Unauthorized Data Extraction

Chrome Extensions Posing as Fortinet, YouTube, and VPN Services are Stealing User Data

Blog Archive

Popular Lnks