Journal of Computing and Natural Science

Object Recognition to Content Based Image Retrieval: A Study of the Developments and Applications of Computer Vision

Journal of Computing and Natural Science

Received On : 03 March 2023

Revised On : 25 June 2023

Accepted On : 08 August 2023

Published On : 05 January 2024

Volume 04, Issue 01

Pages : 041-052


Natural Language Processing (NLP) and Computer Vision (CV) are interconnected fields within the domain of Artificial Intelligence (AI). CV is tasked with the process of engaging with computer systems to effectively interpret and recognize visual data, while NLP is responsible for comprehending and processing the human voice. The two fields have practical applicability in various tasks such as image description generation, object recognition, and question-based answering after a visual input. Deep learning algorithms such as word input are typically employed in enhancing the performance of Content-Based Image Processing (CBIR) techniques. Generally, NLP and CV play a vital role in enhancing computer comprehension and engagements with both visual and written information. This paper seeks to review various major elements of computer vision, such as CBIR, visual effects, image documentation, video documentation, visual learning, and inquiry to explore various databases, techniques, and methods employed in this field. The authors focus on the challenges and progress in each area and offer new strategies for improving the performance of CV systems.


Content-Based Image Retrieval, Computer Vision, Human Object Interaction, Natural Language Processing, Artificial Intelligence.

  1. M. Leo, G. Medioni, M. M. Trivedi, T. Kanade, and G. M. Farinella, “Computer vision for assistive technologies,” Computer Vision and Image Understanding, vol. 154, pp. 1–15, Jan. 2017, doi: 10.1016/j.cviu.2016.09.001.
  2. S. C. W. Ong and S. Ranganath, “Automatic Sign Language Analysis: A Survey and the Future beyond Lexical Meaning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 873–891, Jun. 2005, doi: 10.1109/tpami.2005.112.
  3. J. H. Kaas, “Why does the brain have so many visual areas?,” Journal of Cognitive Neuroscience, vol. 1, no. 2, pp. 121–135, Jan. 1989, doi: 10.1162/jocn.1989.1.2.121.
  4. B. Kituku, L. Muchemi, and W. Nganga, “A review on machine translation approaches,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 1, no. 1, p. 182, Jan. 2016, doi: 10.11591/ijeecs.v1.i1.pp182-190.
  5. D. K. Mishra, A. Thomas, J. Kuruvilla, P. Kalyanasundaram, K. R. Prasad, and A. Haldorai, “Design of mobile robot navigation controller using neuro-fuzzy logic system,” Computers and Electrical Engineering, vol. 101, p. 108044, Jul. 2022, doi: 10.1016/j.compeleceng.2022.108044.
  6. B. Jiang, W. Huang, W. Tu, and C. Yang, “An Animal Classification based on Light Convolutional Network Neural Network,” 2019 International Conference on Intelligent Computing and Its Emerging Applications (ICEA), Aug. 2019, doi: 10.1109/icea.2019.8858309.
  7. E. Yaghoubi, D. Borza, J. C. Neves, A. Kumar, and H. Proença, “An attention-based deep learning model for multiple pedestrian attributes recognition,” Image and Vision Computing, vol. 102, p. 103981, Oct. 2020, doi: 10.1016/j.imavis.2020.103981.
  8. S. S. Ashrafi, S. B. Shokouhi, and A. Ayatollahi, “Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection,” Multimedia Tools and Applications, vol. 80, no. 21–23, pp. 32567–32593, Jul. 2021, doi: 10.1007/s11042-021-11215-1.
  9. T. Admin, “What object categories / labels are in COCO Dataset?,” Amikelive | Technology Blog, Oct. 29, 2022.
  10. B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” 2009 IEEE 12th International Conference on Computer Vision, Sep. 2009, doi: 10.1109/iccv.2009.5459466.
  11. K. Han, J. Guo, C. Zhang, and M. Zhu, “Attribute-Aware Attention Model for Fine-grained Representation Learning,” MM ’18: Proceedings of the 26th ACM International Conference on Multimedia, Oct. 2018, doi: 10.1145/3240508.3240550.
  12. X. Wang et al., “Pedestrian attribute recognition: A survey,” Pattern Recognition, vol. 121, p. 108220, Jan. 2022, doi: 10.1016/j.patcog.2021.108220.
  13. H. T. Vu and C.-C. Huang, “Parking space status inference upon a deep CNN and Multi-Task contrastive network with spatial transform,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 4, pp. 1194–1208, Apr. 2019, doi: 10.1109/tcsvt.2018.2826053.
  14. K. Yang, X. Hu, and R. Stiefelhagen, “Is Context-Aware CNN ready for the surroundings? Panoramic semantic segmentation in the wild,” IEEE Transactions on Image Processing, vol. 30, pp. 1866–1881, Jan. 2021, doi: 10.1109/tip.2020.3048682.
  15. X. Gao, W. Xu, M. Liao, and G. Chen, “Trust Prediction for Online Social Networks with Integrated Time-Aware Similarity,” ACM Transactions on Knowledge Discovery From Data, vol. 15, no. 6, pp. 1–30, May 2021, doi: 10.1145/3447682.
  16. M. Ye and Y. Guo, “Zero-Shot Classification with Discriminative Semantic Representation Learning,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, doi: 10.1109/cvpr.2017.542.
  17. K. Wang, L. Zhang, Y. Tan, J. Zhao, and S. Zhou, “Learning Latent Semantic Attributes for Zero-Shot Object Detection,” 2020 IEEE 32nd International Conference on Tools With Artificial Intelligence (ICTAI), Nov. 2020, doi: 10.1109/ictai50040.2020.00045.
  18. H. Wang, Y. Zhang, Z. Ji, Y. Pang, and L. Ma, “Consensus-Aware Visual-Semantic embedding for Image-Text matching,” in Lecture Notes in Computer Science, 2020, pp. 18–34. doi: 10.1007/978-3-030-58586-0_2.
  19. L. Zhang, J. Rojas, J. Liu, and Y. Guan, “Visual-Semantic Graph attention networks for Human-Object Interaction Detection,” 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dec. 2021, doi: 10.1109/robio54168.2021.9739429.
  20. C. Gao, Y. Zou, and J.-B. Huang, “ICAN: Instance-Centric Attention Network for Human-Object Interaction Detection.,” arXiv (Cornell University), p. 41, Jan. 2018, [Online]. Available:
  21. S. Vahora and N. Chauhan, “Deep neural network model for group activity recognition using contextual relationship,” Engineering Science and Technology, an International Journal, vol. 22, no. 1, pp. 47–54, Feb. 2019, doi: 10.1016/j.jestch.2018.08.010.
  22. H. Liu, T. J. Mu, and X. Huang, “Detecting human—object interaction with multi-level pairwise feature network,” Computational Visual Media, vol. 7, no. 2, pp. 229–239, Oct. 2020, doi: 10.1007/s41095-020-0188-2.
  23. H. Wang, Y. Huang, and Q. Zhang, “Human-Object Interaction Detection via Global Context and Pairwise-Level Fusion Features Integration,” Neural Networks, Jan. 2022, doi: 10.2139/ssrn.4299944.
  24. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, doi: 10.1109/cvpr.2017.106.
  25. S.-W. Kim, H. K. Kook, J. Y. Sun, M. C. Kang, and S. J. Ko, “Parallel feature Pyramid Network for object detection,” in Lecture Notes in Computer Science, 2018, pp. 239–256. doi: 10.1007/978-3-030-01228-1_15.
  26. P. Keserwani, A. Dhankhar, R. Saini, and P. P. Roy, “Quadbox: Quadrilateral bounding box based scene text detection using vector regression,” IEEE Access, vol. 9, pp. 36802–36818, Jan. 2021, doi: 10.1109/access.2021.3063030.
  27. Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng, “HICO: A Benchmark for Recognizing Human-Object Interactions in Images,” 2015 IEEE International Conference on Computer Vision (ICCV), Dec. 2015, doi: 10.1109/iccv.2015.122.
  28. D.-T. Le, J. Uijlings, and R. Bernardi, “TUHOI: Trento Universal Human Object Interaction Dataset,” Proceedings of the Third Workshop on Vision and Language, Jan. 2014, doi: 10.3115/v1/w14-5403.
  29. R. Kumar, B. Lahiri, and A. Kr. Ojha, “Aggressive and Offensive language identification in Hindi, Bangla, and English: A Comparative study,” SN Computer Science, vol. 2, no. 1, Jan. 2021, doi: 10.1007/s42979-020-00414-6.
  30. Y. Zhong, X. Hu, C. Luo, X. Wang, J. Zhao, and L. Zhang, “WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with CRF,” Remote Sensing of Environment, vol. 250, p. 112012, Dec. 2020, doi: 10.1016/j.rse.2020.112012.
  31. J. Shen, Y. Zhou, Y. Wang, X. Chen, T. Han, and T. Chen, “Evaluating Code Summarization with Improved Correlation with Human Assessment,” 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), Dec. 2021, doi: 10.1109/qrs54544.2021.00108.
  32. O. M. Nezami, M. Dras, P. Anderson, and L. Hamey, “Face-Cap: Image captioning using facial expression analysis,” in Lecture Notes in Computer Science, 2019, pp. 226–240. doi: 10.1007/978-3-030-10925-7_14.
  33. V. Frehe, J. Mehmann, and F. Teuteberg, “Understanding and assessing crowd logistics business models – using everyday people for last mile delivery,” Journal of Business & Industrial Marketing, vol. 32, no. 1, pp. 75–97, Feb. 2017, doi: 10.1108/jbim-10-2015-0182.
  34. L. J. Graham, K. De Bruin, C. Lassig, and I. Spandagou, “A scoping review of 20 years of research on differentiation: investigating conceptualisation, characteristics, and methods used,” Review of Education, vol. 9, no. 1, pp. 161–198, Nov. 2020, doi: 10.1002/rev3.3238.
  35. A. M. Borghi and L. Riggio, “Sentence comprehension and simulation of object temporary, canonical and stable affordances,” Brain Research, vol. 1253, pp. 117–128, Feb. 2009, doi: 10.1016/j.brainres.2008.11.064.
  36. D. Roy, “Learning visually grounded words and syntax for a scene description task,” Computer Speech & Language, vol. 16, no. 3–4, pp. 353–385, Jul. 2002, doi: 10.1016/s0885-2308(02)00024-4.
  37. K. Kafle and C. Kanan, “Visual question answering: Datasets, algorithms, and future challenges,” Computer Vision and Image Understanding, vol. 163, pp. 3–20, Oct. 2017, doi: 10.1016/j.cviu.2017.06.005.
  38. S. Antol et al., “VQA: Visual Question Answering,” 2015 IEEE International Conference on Computer Vision (ICCV), Dec. 2015, doi: 10.1109/iccv.2015.279.
  39. L. Yu, E. Park, A. C. Berg, and T. L. Berg, “Visual Madlibs: Fill in the Blank Description Generation and Question Answering,” 2015 IEEE International Conference on Computer Vision (ICCV), Dec. 2015, doi: 10.1109/iccv.2015.283.
  40. R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3–4, pp. 229–256, May 1992, doi: 10.1007/bf00992696.
  41. S. Ayub, N. Singh, Md. Z. Hussain, M. Ashraf, D. K. Singh, and A. Haldorai, “Hybrid approach to implement multi‐robotic navigation system using neural network, fuzzy logic, and bio‐inspired optimization methodologies,” Computational Intelligence, vol. 39, no. 4, pp. 592–606, Sep. 2022, doi: 10.1111/coin.12547.
  42. Haldorai, Ravishankar. C. V, Q. S. Mahdi, and G. J. Nehru, “Advanced Communication in Cyber Physical System Infrastructure, Protocols, and Challenges,” 2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT), Feb. 2023, doi: 10.1109/icecct56650.2023.10179710.
  43. J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” SIGIR ’03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Jul. 2003, doi: 10.1145/860435.860459.
  44. P. Gupta, R. E. Banchs, and P. Rosso, “Continuous space models for CLIR,” Information Processing and Management, vol. 53, no. 2, pp. 359–370, Mar. 2017, doi: 10.1016/j.ipm.2016.11.002.
  45. L. Zheng, C. Zhang, and C. Chen, “MMDF-LDA: An improved Multi-Modal Latent Dirichlet Allocation model for social image annotation,” Expert Systems With Applications, vol. 104, pp. 168–184, Aug. 2018, doi: 10.1016/j.eswa.2018.03.014.
  46. T. Melzer, M. J. Reiter, and H. Bischof, “Appearance models based on kernel canonical correlation analysis,” Pattern Recognition, vol. 36, no. 9, pp. 1961–1971, Sep. 2003, doi: 10.1016/s0031-3203(03)00058-x.
  47. Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, “Improving Image-Sentence embeddings using large weakly annotated photo collections,” in Lecture Notes in Computer Science, 2014, pp. 529–545. doi: 10.1007/978-3-319-10593-2_35.
  48. H. Müller, N. Michoux, D. Bandon, and A. Geissbühler, “A review of content-based image retrieval systems in medical applications—clinical benefits and future directions,” International Journal of Medical Informatics, vol. 73, no. 1, pp. 1–23, Feb. 2004, doi: 10.1016/j.ijmedinf.2003.11.024.
  49. A. Pimpalkar and J. R. Raj, “MBiLSTMGloVe: Embedding GloVe knowledge into the corpus using multi-layer BiLSTM deep learning model for social media sentiment analysis,” Expert Systems With Applications, vol. 203, p. 117581, Oct. 2022, doi: 10.1016/j.eswa.2022.117581.


Author(s) thanks to University of Peradeniya for research lab and equipment support.


No funding was received to assist with the preparation of this manuscript.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Availability of data and materials

No data available for above study.

Author information


All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.

Corresponding author

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article‟s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article‟s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

Cite this article

Udula Mangalika, “Object Recognition to Content Based Image Retrieval: A Study of the Developments and Applications of Computer Vision”, Journal of Computing and Natural Science, vol.4, no.1, pp. 041-052, January 2024. doi: 10.53759/181X/JCNS202404005.


© 2024 Udula Mangalika. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.