Journal of Machine and Computing


A Novel Fuzzy K-Means Clustering Approach Optimized by Bacterial Foraging Algorithm for Document Categorization



Journal of Machine and Computing

Received On : 18 November 2024

Revised On : 26 February 2025

Accepted On : 06 March 2025

Published On : 05 April 2025

Volume 05, Issue 02

Pages : 1023-1031


Abstract


Document categorization is a crucial task in organizing large collections of text. Traditional clustering methods like K-means often struggle with uncertainties in data. This paper presents a novel approach that combines Fuzzy K-Means (FKM) clustering with Bacterial Foraging Optimization (BFO) to enhance document clustering performance. The proposed method, FKM-BFO, benefits from fuzzy clustering’s ability to assign documents to multiple clusters, reflecting the inherent overlap in topics, while using the BFO algorithm to optimize the clustering process. FKM allows documents to belong to multiple clusters with varying degrees of membership, making it more suitable for real-world text data. However, FKM is sensitive to initial centroid placements and may get stuck in local optima. To address this, BFO, inspired by the foraging behaviour of bacteria, is used to optimize the initial centroids and guide the FKM algorithm to a global optimum. This combination improves clustering accuracy by better determining the cluster center and membership values. We evaluate the FKM-BFO approach using benchmark datasets like 20 Newsgroups and Reuters-21578. The results show that FKM-BFO outperforms traditional clustering methods, such as K-means and Fuzzy C-Means, in terms of accuracy and robustness, especially in handling noisy and high-dimensional data. This hybrid approach offers an effective solution for document categorization, providing higher accuracy and stability. Future work could explore its scalability and applicability to larger, real-time document clustering tasks.


Keywords


Document Clustering, Natural Language Processing, Information Retrieval, Bacterial Foraging Optimization, Convergence Speed and Cluster Quality.


  1. C. Wu et al., “Natural language processing for smart construction: Current status and future directions,” Automation in Construction, vol. 134, p. 104059, Feb. 2022, doi: 10.1016/j.autcon.2021.104059.
  2. P. Kumar, A. Tveritnev, S. A. Jan, and R. Iqbal, “Challenges to Opportunity: Getting Value Out of Unstructured Data Management,” Gas & Oil Technology Showcase and Conference, Mar. 2023, doi: 10.2118/214251-ms.
  3. Y. Wang and C. Zhang, “Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing,” Journal of Informetrics, vol. 14, no. 4, p. 101091, Nov. 2020, doi: 10.1016/j.joi.2020.101091.
  4. G. Li et al., “Research on the Natural Language Recognition Method Based on Cluster Analysis Using Neural Network,” Mathematical Problems in Engineering, vol. 2021, pp. 1–13, May 2021, doi: 10.1155/2021/9982305.
  5. L. Abualigah et al., “Nature-Inspired Optimization Algorithms for Text Document Clustering—A Comprehensive Analysis,” Algorithms, vol. 13, no. 12, p. 345, Dec. 2020, doi: 10.3390/a13120345.
  6. M. H. Ahmed, S. Tiun, N. Omar, and N. S. Sani, “Short Text Clustering Algorithms, Application and Challenges: A Survey,” Applied Sciences, vol. 13, no. 1, p. 342, Dec. 2022, doi: 10.3390/app13010342.
  7. W. Kim, A. Kanezaki, and M. Tanaka, “Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering,” IEEE Transactions on Image Processing, vol. 29, pp. 8055–8068, 2020, doi: 10.1109/tip.2020.3011269.
  8. S. Kulkarni and S. F. Rodd, “Context Aware Recommendation Systems: A review of the state of the art techniques,” Computer Science Review, vol. 37, p. 100255, Aug. 2020, doi: 10.1016/j.cosrev.2020.100255.
  9. S. Ayesha, M. K. Hanif, and R. Talib, “Overview and comparative study of dimensionality reduction techniques for high dimensional data,” Information Fusion, vol. 59, pp. 44–58, Jul. 2020, doi: 10.1016/j.inffus.2020.01.005.
  10. F. Shen, L. Zhao, W. Du, W. Zhong, and F. Qian, “Large-scale industrial energy systems optimization under uncertainty: A data-driven robust optimization approach,” Applied Energy, vol. 259, p. 114199, Feb. 2020, doi: 10.1016/j.apenergy.2019.114199.
  11. B. Diallo, J. Hu, T. Li, G. A. Khan, and A. S. Hussein, “Multi-view document clustering based on geometrical similarity measurement,” International Journal of Machine Learning and Cybernetics, vol. 13, no. 3, pp. 663–675, Mar. 2021, doi: 10.1007/s13042-021-01295-8.
  12. B. Bataineh and A. A. Alzahrani, “Fully Automated Density-Based Clustering Method,” Computers, Materials & Continua, vol. 76, no. 2, pp. 1833–1851, 2023, doi: 10.32604/cmc.2023.039923.
  13. S. M. Miraftabzadeh, C. G. Colombo, M. Longo, and F. Foiadelli, “K-Means and Alternative Clustering Methods in Modern Power Systems,” IEEE Access, vol. 11, pp. 119596–119633, 2023, doi: 10.1109/access.2023.3327640.
  14. V. Mehta, S. Bawa, and J. Singh, “WEClustering: word embeddings based text clustering technique for large datasets,” Complex & Intelligent Systems, vol. 7, no. 6, pp. 3211–3224, Sep. 2021, doi: 10.1007/s40747-021-00512-9.
  15. E. Sherkat, E. E. Milios, and R. Minghim, “A Visual Analytics Approach for Interactive Document Clustering,” ACM Transactions on Interactive Intelligent Systems, vol. 10, no. 1, pp. 1–33, Aug. 2019, doi: 10.1145/3241380.
  16. S. A. Curiskis, B. Drake, T. R. Osborn, and P. J. Kennedy, “An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit,” Information Processing & Management, vol. 57, no. 2, p. 102034, Mar. 2020, doi: 10.1016/j.ipm.2019.04.002.
  17. M. Moradi Fard, T. Thonet, and E. Gaussier, “Deep k-Means: Jointly clustering with k-Means and learning representations,” Pattern Recognition Letters, vol. 138, pp. 185–192, Oct. 2020, doi: 10.1016/j.patrec.2020.07.028.
  18. N. Yadav, “Neighborhood rough set based multi-document summarization,” 2021, arXiv preprint arXiv:2106.07338.
  19. R. Janani and S. Vijayarani, “Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization,” Expert Systems with Applications, vol. 134, pp. 192–200, Nov. 2019, doi: 10.1016/j.eswa.2019.05.030.
  20. A. K. Sangaiah, A. E. Fakhry, M. Abdel-Basset, and I. El-henawy, “Arabic text clustering using improved clustering algorithms with dimensionality reduction,” Cluster Computing, vol. 22, no. S2, pp. 4535–4549, Feb. 2018, doi: 10.1007/s10586-018-2084-4.
  21. A. K. Abasi, A. T. Khader, M. A. Al-Betar, S. Naim, S. N. Makhadmeh, and Z. A. A. Alyasseri, “Link-based multi-verse optimizer for text documents clustering,” Applied Soft Computing, vol. 87, p. 106002, Feb. 2020, doi: 10.1016/j.asoc.2019.106002.
  22. L. Abualigah et al., “Advances in Meta-Heuristic Optimization Algorithms in Big Data Text Clustering,” Electronics, vol. 10, no. 2, p. 101, Jan. 2021, doi: 10.3390/electronics10020101.
  23. N. Alami, M. Meknassi, N. En-nahnahi, Y. El Adlouni, and O. Ammor, “Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling,” Expert Systems with Applications, vol. 172, p. 114652, Jun. 2021, doi: 10.1016/j.eswa.2021.114652.
  24. S. M. Mohammed, K. Jacksi, and S. R. M. Zeebaree, “A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 22, no. 1, p. 552, Apr. 2021, doi: 10.11591/ijeecs.v22.i1.pp552-562.
  25. R. Guan, H. Zhang, Y. Liang, F. Giunchiglia, L. Huang, and X. Feng, “Deep Feature-Based Text Clustering and its Explanation,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 8, pp. 3669–3680, Aug. 2022, doi: 10.1109/tkde.2020.3028943.
  26. D. Greene and P. Cunningham, “Practical solutions to the problem of diagonal dominance in kernel document clustering,” Proceedings of the 23rd international conference on Machine learning - ICML ’06, pp. 377–384, 2006, doi: 10.1145/1143844.1143892

CRediT Author Statement


The authors confirm contribution to the paper as follows:

Conceptualization: Periyasamy S, Kaniezhil R, Venkatesan R, Sathish Kumar R, Sivaramakrishnan A and Karthikeyan K; Methodology: Periyasamy S, Kaniezhil R and Venkatesan R; Writing- Original Draft Preparation: Periyasamy S, Kaniezhil R, Venkatesan R, Sathish Kumar R, Sivaramakrishnan A and Karthikeyan K; Validation: Sathish Kumar R, Sivaramakrishnan A and Karthikeyan K; Writing- Reviewing and Editing: Periyasamy S, Kaniezhil R, Venkatesan R, Sivaramakrishnan A and Karthikeyan K; All authors reviewed the results and approved the final version of the manuscript.


Acknowledgements


The authors would like to thank to the reviewers for nice comments on the manuscript.


Funding


No funding was received to assist with the preparation of this manuscript.


Ethics declarations


Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.


Availability of data and materials


Data sharing is not applicable to this article as no new data were created or analysed in this study.


Author information


Contributions

All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.


Corresponding author


Rights and permissions


Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/


Cite this article


Periyasamy S, Kaniezhil R, Venkatesan R, Sathish Kumar R, Sivaramakrishnan A and Karthikeyan K, “A Novel Fuzzy K-Means Clustering Approach Optimized by Bacterial Foraging Algorithm for Document Categorization”, Journal of Machine and Computing, pp. 1023-1031, April 2025, doi: 10.53759/7669/jmc202505081.


Copyright


© 2025 Periyasamy S, Kaniezhil R, Venkatesan R,Sathish Kumar R, Sivaramakrishnan A and Karthikeyan K. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.