AnaTech Maz Technology Magazine

New Approach Effectively Protects Sensitive AI Training Data

Priyadharshini S April 17, 2025 | 10:50 AM Technology

Data privacy comes at a cost. While there are security techniques that safeguard sensitive user data, like customer addresses, from attackers attempting to extract them from AI models, these measures often decrease the models' accuracy.

Figure 1. Innovative Method Safeguards Sensitive AI Training Data Efficiently.

MIT researchers recently developed a framework based on a new privacy metric, PAC Privacy, which helps maintain an AI model's performance while ensuring sensitive data—such as medical images or financial records—remain safe from potential attackers. Now, they've enhanced their technique to be more computationally efficient, improving the balance between accuracy and privacy, and creating a formal template that can be applied to privatize virtually any algorithm, even without access to its inner workings. Figure 1 shows Innovative Method Safeguards Sensitive AI Training Data Efficiently.

The team applied this updated version of PAC Privacy to privatize several classic algorithms used in data analysis and machine-learning tasks.

They also found that more "stable" algorithms are easier to privatize with this method. A stable algorithm's predictions remain consistent even when its training data is slightly altered. This greater stability helps the algorithm make more accurate predictions on previously unseen data.

According to the researchers, the new PAC Privacy framework's increased efficiency and its four-step template for implementation make it more practical for real-world applications.

“We often see robustness and privacy as unrelated or even conflicting with creating high-performance algorithms. First, we build a working algorithm, then we make it robust, and finally, we add privacy. Our work shows that this is not always the right approach. If you make your algorithm perform better in various settings, you can essentially achieve privacy for free,” says Mayuri Sridhar, an MIT graduate student and lead author of a paper on this privacy framework.

She is joined in the paper by Hanshen Xiao, PhD ’24, who will start as an assistant professor at Purdue University in the fall, and senior author Srini Devadas, the Edwin Sibley Webster Professor of Electrical Engineering at MIT. The research will be presented at the IEEE Symposium on Security and Privacy.

Estimating Noise

To protect sensitive data used in training an AI model, engineers often add noise—randomness—to make it harder for adversaries to reverse-engineer the original training data. However, this noise can degrade a model's accuracy, so minimizing the noise is desirable.

PAC Privacy automatically estimates the smallest amount of noise required to achieve a specified level of privacy.

The original PAC Privacy algorithm runs an AI model multiple times on different dataset samples, measuring variance and correlations among these outputs to estimate how much noise to add.

The new version of PAC Privacy operates in the same manner but eliminates the need to represent the entire data correlation matrix. It only needs the output variances, making it much faster and better suited for larger datasets.

“Because the thing you are estimating is much smaller than the entire covariance matrix, you can do it much faster,” Sridhar explains, enabling scalability for large datasets.

Adding noise can reduce the utility of the results, so minimizing this loss is crucial. The original PAC Privacy algorithm was limited to adding isotropic noise (uniform in all directions). The new version, however, estimates anisotropic noise tailored to the specific characteristics of the training data, allowing for less noise while achieving the same level of privacy and improving the accuracy of the privatized algorithm.

Privacy and Stability

Sridhar hypothesized that more stable algorithms would be easier to privatize using PAC Privacy. She tested this hypothesis with the more efficient version of PAC Privacy on several classical algorithms.

Stable algorithms exhibit less variance in their outputs when their training data is slightly modified. PAC Privacy divides the dataset into chunks, runs the algorithm on each chunk, and measures the variance among outputs. The greater the variance, the more noise is required to privatize the algorithm.

By employing stability techniques to reduce the variance in an algorithm’s outputs, one can reduce the amount of noise needed to privatize it, Sridhar explains.

The team showed that the privacy guarantees remained strong across different tested algorithms, and the new PAC Privacy version required far fewer trials to estimate the noise. They also tested the method in attack simulations, demonstrating that its privacy guarantees withstood state-of-the-art attacks.

“We want to explore how algorithms could be co-designed with PAC Privacy from the start to ensure they are more stable, secure, and robust,” says Devadas. The researchers also plan to test the method with more complex algorithms and further explore the privacy-utility tradeoff.

"The question now is: When do these win-win situations occur, and how can we make them happen more often?" Sridhar concludes.

Source: MIT NEWS

Cite this article:

Priyadharshini S (2025), New Approach Effectively Protects Sensitive AI Training Data, AnaTechMaz, pp.123

Previous Post Keysight Tools Enhance Efficiency in Data Center Deployment

Next Post New Tool Assesses Progress in Reinforcement Learning

New Approach Effectively Protects Sensitive AI Training Data

Estimating Noise

Privacy and Stability

Cite this article:

Recent Post

Oracle Reports Second Recent Hack; Client Login Data Compromised, Says Bloomberg News

Keysight Tools Enhance Efficiency in Data Center Deployment

New Approach Effectively Protects Sensitive AI Training Data

New Tool Assesses Progress in Reinforcement Learning

Surge in Low-Quality Papers Exploiting Public Data Sets and AI

India Could Require an Additional 50 million Sqft of Real Estate for Data Centers By 2030: Deloitte Report

New Research and Data Shed Light on Early Planetary Formation

HPE Aruba Introduces New Range of Switches for Data Center and Campus Modernization

USC Scientists Develop 5-In-1 Blood Test for Early Alzheimer’s Detection

A Strange Phrase is Appearing in Scientific Papers—We Tracked It to a Glitch in AI Training Data

Machine Learning Method Reduces Fraud Detection Costs by Generating Accurate Labels from Imbalanced Datasets

New Model Generates Audio and Music Tracks from Various Data Inputs

AI Surge Drives Data Centre Growth Amid Hyperscaler Risks

Google Introduces an Auto-Restart Feature on Android to Prevent Unauthorized Data Extraction

Chrome Extensions Posing as Fortinet, YouTube, and VPN Services are Stealing User Data

Blog Archive

Popular Lnks