Larger Datasets Don’t Always Mean Better Results

Priyadharshini S November 29, 2025 | 3:15 PM Technology

Planning the least expensive route for a new subway line beneath a city like New York is an immense challenge, with thousands of potential paths across hundreds of city blocks, each carrying uncertain construction costs. Traditionally, planners would need extensive field studies at many locations to estimate these costs—a process that is both time-consuming and expensive.

Figure 1. Bigger Datasets Aren’t Always Better.

To minimize these costs while still gathering the most useful information, city planners need a way to determine where to start. MIT researchers have developed a new algorithmic method that could help. Their mathematical framework identifies the smallest dataset that guarantees finding the optimal solution, often requiring far fewer measurements than conventional approaches. Figure 1 shows Bigger Datasets Aren’t Always Better.

For the subway example, the method takes into account the problem’s structure—city block networks, construction constraints, and budget limits—along with uncertainties in costs. The algorithm then identifies the minimum set of locations where field studies are needed to ensure the least expensive route is found, and shows how this strategically collected data can be used to make the optimal decision.

This framework applies to a wide range of structured decision-making problems under uncertainty, from supply chain management to electricity network optimization.

“Data are one of the most important aspects of the AI economy. Models are trained on more and more data, consuming enormous computational resources. But most real-world problems have structure that can be exploited. We’ve shown that with careful selection, you can guarantee optimal solutions with a small dataset, and we provide a method to identify exactly which data you need,” says Asu Ozdaglar, Mathworks Professor and head of MIT’s Department of Electrical Engineering and Computer Science (EECS), deputy dean of the Schwarzman College of Computing, and principal investigator in the Laboratory for Information and Decision Systems (LIDS).

Ozdaglar, co-senior author of the study, is joined by co-lead authors Omar Bennouna, an EECS graduate student, and his brother Amine Bennouna, a former MIT postdoc now an assistant professor at Northwestern University, as well as co-senior author Saurabh Amin, co-director of MIT’s Operations Research Center and professor in the Department of Civil and Environmental Engineering. The research will be presented at the Conference on Neural Information Processing Systems.

An Optimality Guarantee

The team began by developing a precise geometric and mathematical definition of a “sufficient” dataset. In many decision problems—like travel planning, construction, or energy distribution—every possible set of costs produces a particular optimal choice. These “optimality regions” partition the decision space. A dataset is sufficient if it can identify which region contains the true costs.

This theoretical foundation led to a practical algorithm that identifies datasets guaranteeing the optimal solution. Surprisingly, the researchers found that a small, carefully chosen dataset is often enough.

Capturing the Right Data

To use the tool, users input the task’s structure—its objective, constraints, and known information. For example, a supply chain manager might aim to minimize costs across multiple shipment routes. Some routes’ costs may already be known, while others remain uncertain.

The algorithm works iteratively: it asks, “Is there any scenario that could change the optimal decision in a way my current data can’t detect?” If yes, it adds a measurement that captures that scenario. If no, the dataset is provably sufficient.

This process identifies exactly which locations or parameters need to be measured to guarantee the minimum-cost solution. Once collected, another algorithm computes the optimal decision—such as which shipment routes to use.

Source: MIT NEWS

Cite this article:

Priyadharshini S (2025), Larger Datasets Don’t Always Mean Better Results, AnaTechMaz, pp.178

Recent Post

Blog Archive