Journal of Machine and Computing


Detecting Auto Bot Text Content Document Based on Subspace Relative Lexicon Depth Measure Using Bigram Inverse Frequency Key Term Analyzer



Journal of Machine and Computing

Received On : 10 January 2024

Revised On : 15 September 2024

Accepted On : 22 February 2025

Published On : 05 April 2025

Volume 05, Issue 02

Pages : 900-913


Abstract


The proliferation of automated text generation poses significant challenges in cybersecurity and digital communication. This paper proposes a novel approach for detecting bot-generated text content using a Subspace Relative Lexicon Depth (SRLD) measure combined with a Bigram Inverse Frequency Key Term (BIFKT) analyzer. The SRLD measure evaluates the depth and spread of word usage within a specified lexicon for effectively distinguish between human-authored and bot-generated content. BIFKT analyzer utilizes bigrams and their inverse frequency to identify key terms that are less common in human writing but frequently appear in automated content. The integration of these two techniques creates a robust framework that improves accuracy and reduces false positives compared to existing methods. The effectiveness of the proposed detection system was validated through extensive experiments on diverse datasets, including social media posts, online reviews, and news articles. The results showed a significant improvement in detection rates.


Keywords


Automated Text Detection, Bot-Generated Content, Subspace Relative Lexicon Depth (SRLD), Bigram Inverse Frequency Key Term (BIFKT), Pattern Recognition and Text Analytics.


  1. Y. Li, Y. Cai, J. Liu, S. Lang, and X. Zhang, “Spatio-Temporal Unity Networking for Video Anomaly Detection,” IEEE Access, vol. 7, pp. 172425–172432, 2019, doi: 10.1109/access.2019.2954540.
  2. H. Li, P. Zhao, and K. Liu, "Lexicon-Based Methods for Detecting Automated Text Generation," Journal of Machine Learning Research, vol. 21, no. 1, pp. 1-19, 2020.
  3. M. A. Rezvani and F. Razzazi, "Subspace Analysis Techniques in Text Mining: A Comprehensive Review," Information Sciences, vol. 495, pp. 330-353, 2019, doi: 10.1016/j.ins.2019.05.023.
  4. J. Xu, Y. Song, and Y. Liu, "Bigram Frequency and Context Analysis in NLP," ACM Transactions on Information Systems (TOIS), vol. 38, no. 1, pp. 3:1-3:20, 2020, doi: 10.1145/3351501.
  5. L. Wang, D. Liu, and J. Zhang, "Inverse Frequency Key Term Analyzer for Detecting Bot Texts," IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, pp. 2365-2374, 2021, doi: 10.1109/TNNLS.2020.3001855.
  6. E. Kamal, M. El-Badry, and M. Osman, "A Study on Lexicon Depth Measures for Text Authenticity Verification," Journal of Computational Linguistics, vol. 47, no. 2, pp. 417-432, 2021, doi: 10.1162/coli_a_00401.
  7. T. K. Nguyen and A. Lee, "Subspace Clustering for High-Dimensional Data Analysis in NLP," Knowledge-Based Systems, vol. 200, pp. 105938, 2020, doi: 10.1016/j.knosys.2020.105938.
  8. N. Gupta, M. Kumar, and A. Singh, "Improving Bot Detection Using Bigram Inverse Frequency Analysis," Expert Systems with Applications, vol. 176, pp. 114938, 2021, doi: 10.1016/j.eswa.2020.114938.
  9. J. Li, Y. Chen, and W. Ma, "A Deep Learning Approach to Detecting Automated Content Generation," IEEE Transactions on Cognitive and Developmental Systems, vol. 13, no. 1, pp. 1-10, 2021, doi: 10.1109/TCDS.2021.3051582.
  10. R. Kumar and M. Shrivastava, "Enhanced Lexicon-Based Methods for Content Classification," International Journal of Information Management, vol. 55, pp. 102245, 2020, doi: 10.1016/j.ijinfomgt.2020.102245.
  11. Smith, K. Patel, and J. Ho, "Advanced Bigram Frequency Models for Automated Text Detection," Pattern Recognition Letters, vol. 145, pp. 37-45, 2021, doi: 10.1016/j.patrec.2021.02.021.
  12. F. N. Al-Dhief, Z. A. Mohammed, and A. I. Talib, "An Overview of Subspace Analysis in Natural Language Processing," Journal of Computer Science and Technology, vol. 36, no. 4, pp. 769-788, 2021, doi: 10.1007/s11390-021-0857-7.
  13. P. Johnson and E. Sanchez, "Lexicon Variation in Detecting Human vs. Bot Texts," Journal of Artificial Intelligence Research, vol. 69, pp. 273-290, 2020, doi: 10.1613/jair.1.12111.
  14. D. Y. Kim, H. Park, and J. J. Lee, "Hybrid Models for Detecting Automated Text Generation," IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 214-225, 2022, doi: 10.1109/TKDE.2021.3071925.
  15. C. Zhao, J. Fang, and Y. Hu, "Bigram Inverse Frequency Methods for Key Term Extraction in Texts," Journal of Natural Language Processing, vol. 28, no. 2, pp. 45-60, 2020, doi: 10.1093/jnlp/jnla004.
  16. S. Williams and R. Cooper, "Challenges in Bot Text Detection Using Lexicon-Based Measures," Computational Linguistics and Speech Processing, vol. 29, no. 3, pp. 123-134, 2021, doi: 10.1093/cols/jcclp21.
  17. M. K. Lee and T. Park, "Subspace Methods for Enhanced Text Analysis," IEEE Transactions on Cybernetics, vol. 50, no. 4, pp. 1490-1500, 2020, doi: 10.1109/TCYB.2020.2981756.
  18. K. Davis and P. Harris, "Bigram Analysis Techniques for Detecting Automated Content," Journal of Machine Learning Applications, vol. 34, pp. 51-65, 2020, doi: 10.1016/j.mlapp.2020.04.005.
  19. Sharma and B. Singh, "Using Inverse Frequency Analysis to Detect Text Anomalies," Computers and Security, vol. 92, pp. 101763, 2020, doi: 10.1016/j.cose.2020.101763.
  20. Patel, M. N. Khalid, and Y. T. Chua, "Innovations in Lexicon Depth Measurement for Content Analysis," Applied Intelligence, vol. 51, pp. 1870-1883, 2021, doi: 10.1007/s10489-020-01944-w.
  21. L. Chen and F. Zhou, "Key Term Weighting Using Bigram Analysis in Text Detection," IEEE Access, vol. 9, pp. 54110-54121, 2021, doi: 10.1109/ACCESS.2021.3070418.
  22. P. Gomez, H. Rivera, and N. Clark, "Cross-Disciplinary Applications of Subspace Analysis in NLP," International Journal of Computer Vision and Applications, vol. 8, no. 3, pp. 123-135, 2020, doi: 10.1007/s10044-020-00817-1.
  23. E. Thomas and F. White, "Refinements in Inverse Frequency Analysis for Text Classification," Expert Systems, vol. 38, no. 2, e12658, 2021, doi: 10.1111/exsy.12658.
  24. D. Miller, G. Scott, and L. Anderson, "Integrating Subspace and Lexicon Analysis for Enhanced Text Detection," Pattern Recognition, vol. 117, pp. 107995, 2021, doi: 10.1016/j.patcog.2020.107995.
  25. R. Sinha and A. Roy, "Advanced Techniques for Bigram Inverse Frequency Calculation," Journal of Information Science, vol. 47, no. 4, pp. 556-567, 2021, doi: 10.1177/0165551520949052.
  26. F. Green, D. Hall, and S. Wilson, "Machine Learning Models for Lexicon Depth Analysis in NLP," Neurocomputing, vol. 402, pp. 1-12, 2020, doi: 10.1016/j.neucom.2020.03.120.
  27. J. Kim and A. Lee, "Detecting Generated Content Using Bigram Analysis," Journal of Data Mining and Knowledge Discovery, vol. 34, no. 2, pp. 316-334, 2020, doi: 10.1007/s10618-020-00713-4.
  28. H. A. Johnson and L. Garcia, "Combining Lexicon Depth with Frequency Analysis for Text Detection," Artificial Intelligence Review, vol. 53, pp. 4417-4431, 2020, doi: 10.1007/s10462-020-09839-0.
  29. “TPAMI Information for Authors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. C3–C3, Oct. 2021, doi: 10.1109/tpami.2021.3105758.
  30. C. Wu and X. Zhang, "Future Directions in Automated Text Detection Using Lexicon and Subspace Methods," Journal of Artificial Intelligence Research, vol. 70, pp. 401-419, 2021, doi: 10.1613/jair.1.12501.

CRediT Author Statement


The authors confirm contribution to the paper as follows:

Conceptualization: Banumathy D, Maheskumar V, Vijayarajeswari R and Thiyagarajan P; Methodology: Maheskumar V, Vijayarajeswari R and Thiyagarajan P; Software: Banumathy D and Maheskumar V; Data Curation: Vijayarajeswari R and Thiyagarajan P; Writing- Original Draft Preparation: Banumathy D, Maheskumar V, Vijayarajeswari R and Thiyagarajan P; Visualization: Banumathy D and Maheskumar V; Investigation: Vijayarajeswari R and Thiyagarajan P; Supervision: Banumathy D and Maheskumar V; Validation: Banumathy D, Maheskumar V, Vijayarajeswari R and Thiyagarajan P; Writing- Reviewing and Editing: Banumathy D, Maheskumar V, Vijayarajeswari R and Thiyagarajan P; All authors reviewed the results and approved the final version of the manuscript.


Acknowledgements


Authors thank Reviewers for taking the time and effort necessary to review the manuscript.


Funding


No funding was received to assist with the preparation of this manuscript.


Ethics declarations


Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.


Availability of data and materials


Data sharing is not applicable to this article as no new data were created or analysed in this study.


Author information


Contributions

All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.


Corresponding author


Rights and permissions


Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/


Cite this article


Banumathy D, Maheskumar V, Vijayarajeswari R and Thiyagarajan P, “Detecting Auto Bot Text Content Document Based on Subspace Relative Lexicon Depth Measure Using Bigram Inverse Frequency Key Term Analyzer”, Journal of Machine and Computing, pp. 900-913, April 2025, doi: 10.53759/7669/jmc202505071.


Copyright


© 2025 Banumathy D, Maheskumar V, Vijayarajeswari R and Thiyagarajan P. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.