Data preprocessing is essential for improving the performance of machine learning models, especially in streaming environments, where training data is continuously collected and updated. Traditional approaches to filtering noisy training data either evaluate each instance only once, increasing the risk of misclassifications, or check instances against all available classifiers, leading to high computational costs. In this paper, we propose an ensemble-based adaptive filtering approach for cleaning training data in streaming settings, ensuring that models are trained on high-quality datasets. Our method partitions incoming training data into, let us say n number of streaming segments, on these segments we incrementally train n Random Forest classifiers, and evaluates each instance using a dynamically selected subset of three classifiers. If all three classifiers agree with the actual class label, the instance is retained in the training dataset. Otherwise, it undergoes a majority voting mechanism to determine its correctness. Misclassified instances are filtered out, ensuring that only reliable and high-quality data contributes to model training. This approach optimally balances computational efficiency and accuracy, preventing unnecessary recompilations while maintaining robust classification. Experimental results on benchmark streaming datasets demonstrate that our method effectively removes noisy instances, leading to cleaner training data and better-performing machine learning models compared to conventional training data preprocessing techniques. We are able to achieve significant performance without compromising much on accuracy aspect.
Keywords
Ensemble Learning, Noise Reduction, Majority Voting, Machine Learning Optimization, Big Data Processing, Apache Spark.
I. Zliobaite and B. Gabrys, “Adaptive Preprocessing for Streaming Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 309–321, Feb. 2014, doi: 10.1109/tkde.2012.147.
A. Kumar and A. Singh, “Stream mining a review: Tool and techniques,” 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), pp. 27–32, Apr. 2017, doi: 10.1109/iceca.2017.8212816.
Ruoming Jin and G. Agrawal, “An Algorithm for In-Core Frequent Itemset Mining on Streaming Data,” Fifth IEEE International Conference on Data Mining (ICDM’05), pp. 210–217, doi: 10.1109/icdm.2005.21.
M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams,” ACM SIGMOD Record, vol. 34, no. 2, pp. 18–26, Jun. 2005, doi: 10.1145/1083784.1083789.
I. Czarnowski and P. Jędrzejowicz, “Ensemble Classifier for Mining Data Streams,” Procedia Computer Science, vol. 35, pp. 397–406, 2014, doi: 10.1016/j.procs.2014.08.120.
H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet, “A Survey on Ensemble Learning for Data Stream Classification,” ACM Computing Surveys, vol. 50, no. 2, pp. 1–36, Mar. 2017, doi: 10.1145/3054925.
H. M. Gomes et al., “Adaptive random forests for evolving data stream classification,” Machine Learning, vol. 106, no. 9–10, pp. 1469–1495, Jun. 2017, doi: 10.1007/s10994-017-5642-8.
V. M. A. Souza, D. M. dos Reis, A. G. Maletzke, and G. E. A. P. A. Batista, “Challenges in benchmarking stream learning algorithms with real-world data,” Data Mining and Knowledge Discovery, vol. 34, no. 6, pp. 1805–1858, Jul. 2020, doi: 10.1007/s10618-020-00698-5.
L. I. Kuncheva, “Classifier Ensembles for Changing Environments,” Multiple Classifier Systems, pp. 1–15, 2004, doi: 10.1007/978-3-540-25966-4_1.
P. Zhang, X. Zhu, Y. Shi, L. Guo, and X. Wu, “Robust ensemble learning for mining noisy data streams,” Decision Support Systems, vol. 50, no. 2, pp. 469–479, Jan. 2011, doi: 10.1016/j.dss.2010.11.004.
J. Luengo, D. García-Gil, S. Ramírez-Gallego, S. García, and F. Herrera, Big Data Preprocessing. Springer International Publishing, 2020. doi: 10.1007/978-3-030-39105-8.
C. E. Brodley and M. A. Friedl, “Identifying Mislabeled Training Data,” Journal of Artificial Intelligence Research, vol. 11, pp. 131–167, Aug. 1999, doi: 10.1613/jair.606.
M. Belgiu and L. Drăguţ, “Random forest in remote sensing: A review of applications and future directions,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 114, pp. 24–31, Apr. 2016, doi: 10.1016/j.isprsjprs.2016.01.011.
R. Genuer and J.-M. Poggi, “Random Forests,” Random Forests with R, pp. 33–55, 2020, doi: 10.1007/978-3-030-56485-8_3.
Denil, M., Matheson, D., & Freitas, N. (2013, May). Consistency of online random forests. In International conference on machine learning (pp. 1256-1264). PMLR.
Y. Chen, O. Li, Y. Sun, and F. Li, “Ensemble Classification of Data Streams Based on Attribute Reduction and a Sliding Window,” Applied Sciences, vol. 8, no. 4, p. 620, Apr. 2018, doi: 10.3390/app8040620.
J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–37, Mar. 2014, doi: 10.1145/2523813.
K. Ahmad, M. L. Mekhalfi, N. Conci, F. Melgani, and F. D. Natale, “Ensemble of Deep Models for Event Recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 14, no. 2, pp. 1–20, May 2018, doi: 10.1145/3199668.
L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–1001, 1990, doi: 10.1109/34.58871.
CRediT Author Statement
The authors confirm contribution to the paper as follows:
Writing original draft: Vranda Jajoo;
Visualization: Sanjay Tanwani;
Revision: Sanjay Tanwani;
Writing- Reviewing and Editing: Vranda Jajoo;
All authors reviewed the results and approved the final version of the manuscript.
Acknowledgements
Author(s) thanks to Dr. Sanjay Tanwani for this research completion and support.
Funding
No funding was received to assist with the preparation of this manuscript.
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Availability of data and materials
Data sharing is not applicable to this article as no new data were created or analysed in this study.
Author information
Contributions
All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.
Corresponding author
Vranda Jajoo
School of Computer Science and Information Technology, Devi Ahilya Vishwavidhyalaya, Indore, Madhya Pradesh, India.
Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/
Cite this article
Vranda Jajoo and Sanjay Tanwani, “An Adaptive Ensemble Based Filtering Approach for Noise Removal in Online Data Streams”, Journal of Machine and Computing, vol.6, no.1, pp. 242-252, 2026, doi: 10.53759/7669/jmc202606018.