Object detection (OD) is a computer vision procedure for locating objects in digital images. Our study examines the crucial need for robust OD algorithms in human activity recognition, a vital domain spanning human-computer interaction, sports analysis, and surveillance. Nowadays, three-dimensional convolutional neural networks (3DCNNs) are a standard method for recognizing human activity. Utilizing recent advances in Deep Learning (DL), we present a novel framework designed to create a fusion model that enhances conventional methods at integrates three-dimensional convolutional neural networks (3DCNNs) with Convolutional Long-Short-Term Memory (ConvLSTM) layers. Our proposed model focuses on utilizing the spatiotemporal features innately present in video streams. An important aspect often missed in existing OD methods. We assess the efficacy of our proposed architecture employing the UCF-50 dataset, which is well-known for its different range of human activities. In addition to designing a novel deep-learning architecture, we used data augmentation techniques that expand the dataset, improve model robustness, reduce overfitting, extend dataset size, and enhance performance on imbalanced data. The proposed model demonstrated outstanding performance through comprehensive experimentation, achieving an impressive accuracy of 98.11% in classifying human activity. Furthermore, when benchmarked against state-of-the-art methods, our system provides adequate accuracy and class average for 50 activity categories.
Keywords
Object Detection, Human Activity Recognization, Deep Learning, 3DCNN, ConvLSTM.
Y. Amit, P. Felzenszwalb, and R. Girshick, “Object Detection,” Computer Vision, pp. 875–883, 2021, doi: 10.1007/978-3-030-63416-2_660.
T. J. Palmeri and I. Gauthier, “Visual object understanding,” Nature Reviews Neuroscience, vol. 5, no. 4, pp. 291–303, Apr. 2004, doi: 10.1038/nrn1364.
X. Wu, D. Sahoo, and S. C. H. Hoi, “Recent advances in deep learning for object detection,” Neurocomputing, vol. 396, pp. 39–64, Jul. 2020, doi: 10.1016/j.neucom.2020.01.085.
A. Yilmaz, O. Javed, and M. Shah, “Object tracking,” ACM Computing Surveys, vol. 38, no. 4, p. 13, Dec. 2006, doi: 10.1145/1177352.1177355.
H. Sharma, M. Agrahari, S. K. Singh, M. Firoj, and R. K. Mishra, “Image Captioning: A Comprehensive Survey,” 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC), Feb. 2020, doi: 10.1109/parc49193.2020.236619.
L. Yang, Y. Fan, and N. Xu, “Video Instance Segmentation,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, doi: 10.1109/iccv.2019.00529.
E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A Survey of Autonomous Driving: Common Practices and Emerging Technologies,” IEEE Access, vol. 8, pp. 58443–58469, 2020, doi: 10.1109/access.2020.2983149.
A. Sophokleous, P. Christodoulou, L. Doitsidis, and S. A. Chatzichristofis, “Computer Vision Meets Educational Robotics,” Electronics, vol. 10, no. 6, p. 730, Mar. 2021, doi: 10.3390/electronics10060730.
S. Jha, C. Seo, E. Yang, and G. P. Joshi, “Real time object detection and trackingsystem for video surveillance system,” Multimedia Tools and Applications, vol. 80, no. 3, pp. 3981–3996, Sep. 2020, doi: 10.1007/s11042-020-09749-x.
M. Cao, J. Jiang, L. Chen, and Y. Zou, “Correspondence Matters for Video Referring Expression Comprehension,” Proceedings of the 30th ACM International Conference on Multimedia, Oct. 2022, doi: 10.1145/3503161.3547756.
J. Liu et al., “PolyFormer: Referring Image Segmentation as Sequential Polygon Generation,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, doi: 10.1109/cvpr52729.2023.01789.
M. Li and L. Sigal, "Referring transformer: A one-step approach to multi-task visual grounding," Advances in Neural Information Processing Systems, vol. 34, pp. 19652-19664, 2021.
Y Y. Zhou et al., “A Real-Time Global Inference Network for One-Stage Referring Expression Comprehension,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 1, pp. 134–143, Jan. 2023, doi: 10.1109/tnnls.2021.3090426.
A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-Vocabulary Object Detection Using Captions,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, doi: 10.1109/cvpr46437.2021.01416.
S. Wu, W. Zhang, S. Jin, W. Liu, and C. C. Loy, “Aligning Bag of Regions for Open-Vocabulary Object Detection,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, doi: 10.1109/cvpr52729.2023.01464.
J. Wang et al., “Open-Vocabulary Object Detection With an Open Corpus,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, doi: 10.1109/iccv51070.2023.00622.
M. A. Bravo, S. Mittal, S. Ging, and T. Brox, “Open-vocabulary Attribute Detection,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, doi: 10.1109/cvpr52729.2023.00680.
I. Ulusoy and C. M. Bishop, “Generative versus Discriminative Methods for Object Recognition,” 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), doi: 10.1109/cvpr.2005.167.
K. Compton, A. Smith, and M. Mateas, “Anza Island,” Proceedings of the The third workshop on Procedural Content Generation in Games, May 2012, doi: 10.1145/2538528.2538539.
A. Joshi, H. Parmar, K. Jain, C. Shah, and Patel Prof. Vaishali R., “Human Activity Recognition Based on Object Detection,” IOSR Journal of Computer Engineering, vol. 19, no. 02, pp. 26–32, Mar. 2017, doi: 10.9790/0661-1902012632.
M. Safaei, P. Balouchian, and H. Foroosh, “UCF-STAR: A Large Scale Still Image Dataset for Understanding Human Actions,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 03, pp. 2677–2684, Apr. 2020, doi: 10.1609/aaai.v34i03.5653.
J. Y. Yun, E. J. Choi, M. H. Chung, K. W. Bae, and J. W. Moon, “Performance evaluation of an occupant metabolic rate estimation algorithm using activity classification and object detection models,” Building and Environment, vol. 252, p. 111299, Mar. 2024, doi: 10.1016/j.buildenv.2024.111299.
M. Hu et al., “Physiological characteristics inspired hidden human object detection model,” Displays, vol. 81, p. 102613, Jan. 2024, doi: 10.1016/j.displa.2023.102613.
P. Su and D. Chen, “Adopting Graph Neural Networks to Analyze Human–Object Interactions for Inferring Activities of Daily Living,” Sensors, vol. 24, no. 8, p. 2567, Apr. 2024, doi: 10.3390/s24082567.
R. Nabiei, M. Parekh, E. Jean-Baptiste, P. Jancovic, and M. Russell, “Object-Centred Recognition of Human Activity,” 2015 International Conference on Healthcare Informatics, Oct. 2015, doi: 10.1109/ichi.2015.14.
N. S. Suriani, F. N. Rashid, and M. H. Badrul, "Semantic object detection for human activity monitoring system," Journal of Telecommunication, Electronic and Computer Engineering (JTEC), vol. 10, pp. 115-118, 2018.
B. A. Mohammed Hashim and R. Amutha, “Elderly People Activity Recognition Based on Object Detection Technique Using Jetson Nano,” Wireless Personal Communications, vol. 134, no. 4, pp. 2041–2057, Feb. 2024, doi: 10.1007/s11277-024-10982-y.
K. K. Reddy and M. Shah, “Recognizing 50 human action categories of web videos,” Machine Vision and Applications, vol. 24, no. 5, pp. 971–981, Nov. 2012, doi: 10.1007/s00138-012-0450-4.
R. Vrskova, R. Hudec, P. Kamencay, and P. Sykora, “Human Activity Classification Using the 3DCNN Architecture,” Applied Sciences, vol. 12, no. 2, p. 931, Jan. 2022, doi: 10.3390/app12020931.
S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, Jan. 2013, doi: 10.1109/tpami.2012.59.
P. Partila, J. Tovarek, G. H. Ilk, J. Rozhon, and M. Voznak, “Deep Learning Serves Voice Cloning: How Vulnerable Are Automatic Speaker Verification Systems to Spoofing Trials?,” IEEE Communications Magazine, vol. 58, no. 2, pp. 100–105, Feb. 2020, doi: 10.1109/mcom.001.1900396.
Z. Yuan, X. Zhou, and T. Yang, “Hetero-ConvLSTM,” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Jul. 2018, doi: 10.1145/3219819.3219922.
K. Ashok, M. Ashraf, J. Thimmia Raja, M. Z. Hussain, D. K. Singh, and A. Haldorai, “Collaborative analysis of audio-visual speech synthesis with sensor measurements for regulating human–robot interaction,” International Journal of System Assurance Engineering and Management, Aug. 2022, doi: 10.1007/s13198-022-01709-y.
C. Shi and S. Liu, “Human action recognition with transformer based on convolutional features,” Intelligent Decision Technologies, vol. 18, no. 2, pp. 881–896, Jun. 2024, doi: 10.3233/idt-240159.
P. Ramya and R. Rajeswari, "Human action recognition using distance transform and entropy based features," Multimedia Tools and Applications, vol. 80, pp. 8147-8173, 2021.
R. Vaghela, D. Labana, and K. Modi, "Efficient I3D-VGG19-based architecture for human activity recognition," The Scientific Temper, vol. 14, pp. 1185-1191, 2023.
N. Aldahoul, H. A. Karim, A. Q. Md. Sabri, M. J. T. Tan, Mhd. A. Momo, and J. L. Fermin, “A Comparison Between Various Human Detectors and CNN-Based Feature Extractors for Human Activity Recognition via Aerial Captured Video Sequences,” IEEE Access, vol. 10, pp. 63532–63553, 2022, doi: 10.1109/access.2022.3182315.
M. Kumar, A. K. Patel, M. Biswas, and S. Shitharth, “Attention-based bidirectional-long short-term memory for abnormal human activity detection,” Scientific Reports, vol. 13, no. 1, Sep. 2023, doi: 10.1038/s41598-023-41231-0.
Acknowledgements
Author(s) thanks to Dr.Humera Khanam M for this research completion and support.
Funding
No funding was received to assist with the preparation of this manuscript.
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Availability of data and materials
Data sharing is not applicable to this article as no new data were created or analysed in this study.
Author information
Contributions
All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.
Corresponding author
Roopa R
Roopa R
Department of CSE, S V University College of Engineering, S V University, Tirupati, Andhra Pradesh, India.
Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/
Cite this article
Roopa R and Humera Khanam M, “Advancements in Real Time Human Activity Recognition via Innovative Fusion of 3DCNN and Convlstm Models”, Journal of Machine and Computing, pp. 759-771, July 2024. doi: 10.53759/7669/jmc202404071.