Machine Learning Based Performance Analysis of Video Object Detection and Classification Using Modified Yolov3 and Mobilenet Algorithm

– Detecting foreground objects in video is crucial in various machine vision applications and computerized video surveillance technologies. Object tracking and detection are essential in object identification, surveillance, and navigation approaches. Object detection is the technique of differentiating between background and foreground features in a photograph. Recent improvements in vision systems, including distributed smart cameras, have inspired researchers to develop enhanced machine vision applications for embedded systems. The efficiency of featured object detection algorithms declines as dynamic video data increases as contrasted to conventional object detection methods. Moving subjects that are blurred, fast-moving objects, backdrop occlusion


I. INTRODUCTION
Understanding dynamic features in objects is critical in autonomous environments. The outdoor surveillance system employs freely moving event cameras. However, external variables make the structure not static, resulting in higher energy and time utilisation [1]. Using decades of machine vision research, we have addressed specialised object recognition challenges, including computerised assembly line sorting and inspection systems, handwriting detection on postal sorting machines, and ATM bill inspection. Despite these successful uses, the appearance of objects can be summarised in a well-controlled sensing environment, resulting in reliable and practical solutions for industrial difficulties that include perception and robot navigation [2].
The most crucial thing to bear in mind is that event cameras are not generating output pixel intensity levels but rather accurately time-stamped spikes, which are defined as events exhibiting a sufficient shift in pixel capturing intensity. In the end, event cameras use less transmission bandwidth and only use a few hundred. To summarise, event-based cameras adopt a distinct method of visual imaging by concentrating on low-latency and lightweight algorithms [3]. The reliability of the adaptive neuro fuzzy inference system (ANFI Stability) for categorising objects that move in a Street View application was examined. Neuro-fuzzy modelling combines the benefits of fuzzy logic and neural learning models, helping the framework defend actions based on object classification judgements [4][5][6].
precision rates of more than 90%, accuracy values greater than 70%, and recall levels greater than 70%, verifying its intended functionality [17]. Srinivas et al. (2022) examined frameworks for these activities using better cyber security control facilities. The technique is divided into two stages: detection of numerous objects using the cyber security probabilistic Gaussian mixture model and background suppression, and tracking of multiple moving objects using the kernel convolution moving window with kalman filter. Simulations outcomes show that the suggested approach can identify and locate objects in complicated and shifting environments with excellent efficiency, resilience, and accuracy. This proposed model also yields a noise-free image [18].
Kyung Pyo Kim et al. (2020) proposed a methodology for enhancing deep learning (DL)-based identification accuracy using shape data gathered from LiDAR point clouds. This research also presents a layer-based building technique that takes into account the three degrees-of-freedom motion of dynamic objects in order to augment this shape information properly. In experiments, the suggested cumulative technique beats existing log-based algorithms. Furthermore, in the actual car data test, the DL algorithm trained on simulated data performed better when gathering the lidar point cloud [19].
Guray Sonugur et al. (2022) suggested a two-stage Interconnected Artificial Neural Network (ICANN) framework. At the end of the GPS-assisted picture registration procedure, live images are transformed into binary images in the first stage. The shape of a silhouette is then created by labelling related components in the image background. Two interlinked neural networks are employed in the second stage. The initial neural network determines if the outlines are objects or noise. The maximum success percentage for object classification in experimental investigations was 96.1%. The acquired findings are compared to the currently popular YOLO object identification technique [20].
OA Pakhomova et al. (2019) developed a method for implementing a motion-detecting approach to enhance the effectiveness of the movement vector search technique used by the detection subsystem; the basic idea is to break each frame into blocks and look for similar sections in the subsequent frames. The study outcomes demonstrate the method's efficacy. To remove them, a motion detection module is suggested to be integrated into a multipurpose machine vision framework that collects images from cameras at the input and communicates accumulated information on the objects seen through parallel streams at the output.
The detection module is in charge of searching for and detecting movement, as well as concealing extraneous information and presenting only the areas required for further classification [21].
TJing Yunduo et al. (2021) suggested an event camera corner extraction and tracking technique that is asynchronous in real-time. The primary motivation for this paper is to increase corner identification and tracking accuracy while maintaining computing efficiency. Lastly, to enable corner event tracking, we offer a data association strategy with temporal, velocity, and spatial direction constraints, in which they associate a recently arrived corner event with the last active corner in its neighbourhood that fulfills the speed direction requirement. The trials are carried out on the conventional event camera dataset, and the findings reveal that the technique performs exceptionally well in corner detection and tracking [22].
Justas Furmonas et al. (2022) summarize the approaches and systems based on events that have been reported and are now known. An examination of these approaches and frameworks analytically supports the findings reached. The paper finishes with suggestions and proposals for future improvements in the domain of events using chamber depth estimation. A recent study demonstrates the use of SNNs, unsupervised and supervised neural networks. Nevertheless, many approaches continue to perform poorly due to a shortage of suitable training data sets [23].
Takehiro Ozawa et al. (2022) suggested a method for predicting motion in bird's-eye view space using contrast optimization. This paper reduces the dimensionality to a 2D motion estimate rather than a 3D motion estimate by translating the dataset to a bird's-eye view employing homograph derived from the camera position. This conversion solves the issue of non-convex loss functions in previous approaches. The experimental findings with CARLA and real-world data show that the suggested approach is efficient and accurate [24].

Problem statement
A deeper design gives tenfold greater effusive capability when compared to standard shallow models. To achieve high detection accuracy, the following issues must be resolved:  Intra-class variances: shape, size, material, colour, and position differences in real-world objects.  Image circumstances and unconstrained surroundings: variables including blur, lighting, shadow, weather conditions, clutter, occlusion, physical object location, motion, and viewpoint.  Imaging noise: compression noise, filter distortions, and low-resolution images are instances of imaging noise.  The detector must discriminate between thousands of organised and unstructured real-world item categories.  Low-end mobile devices possess restricted speed, memory, and processing capabilities.  There should be distinctions between thousands of open-world object classes.  Image or video data on a large scale.  Impossibility of handling previously unseen objects.
The fundamental idea is to use a CNN on the image to complete the task. CNN performs tasks on image patches, and many of these highlighted regions can be produced utilizing region-suggested networks, including the Regional Convolutional Neural Network (RCNN), the Fast-Region Convolutional Neural Network (Fast-RCNN), and the Faster-Region Convolutional Neural Network (Fast-RCNN). A hierarchical clustering approach is utilized to do a selective object recognition search. These approaches have a few bottlenecks that can be addressed with cutting-edge techniques, including You Only Look Once (YOLO) and Single Shot Detector (SSD). An effective object identification method is a technique that recognizes bounding boxes for all real-size objects while using powerful computing resources and a faster processing speed. YOLO and SSD provide promising outcomes. However, there is a trade-off between speed and precision. As a result, the choice of method is application-specific [25].
In instances of dynamic objects in the background, the LIBS approach does not deliver the most accurate results. Suppose there is a slight change in the background, such as swinging a sheet or any other minor alteration. Only upright humans can be spotted in W4 utilizing the cardboard design. It becomes problematic when people are in various positions, crawling and climbing. The detection of spatial irregularities that include U-turns in behavioral subtraction is difficult in this technique. The study can identify both temporal and spatial outliers when need to identify them. When foreground objects become visible during background activities, behavioral camouflage occurs. The Kalman Filter, Mean Shift Algorithm, and GMM all struggle to detect multiple objects with minor occlusions. Conventional object detection techniques cannot identify areas in images with numerous objects. Existing color detection algorithms can only detect primary colors accurately.
Existing approaches detect colors incorrectly if the image contains other colors. Aside from that, some common difficulties are that if the background illumination changes, it could be misinterpreted as a front object. Some approaches also have difficulties in detecting shadows. The closeness in glimpses between foreground and background items can be problematic for camouflage. Another problem is non-static background modelling. In high-traffic locations, the background is frequently obscured by many foreground objects. Because of the constant shift, it makes it challenging to classify the permanent foreground and backdrop [26].
III. PROPOSED METHODOLOGY This work aims to create feature selection, and classification approaches to address existing issues with the detection of moving objects collected using event cameras. Low-range approximation techniques are utilised to extract the dynamic properties of the frames. To minimise battery usage, a freshly enhanced YOLOv3 is employed for feature selection. The suggested approach assesses the frame's entropy, lowering power usage.
Additionally, to reduce the computational time consumption of the suggested enhanced YOLOv3, data set classification is accomplished by utilising YOLOv3 and MobileNet architecture. The ranking is accomplished through a comparative examination of live data sets. MOT20 is a real-time video dataset acquired on this work, and it has been converted to tiny video frames. The suggested design and noise identification of several moving objects will be displayed in the discovered object's frame. A convolutional moving window Kalman filter is used to remove and smooth noise. The video frames will be processed and analyzed after denoising. Utilizing noisy measurements acquired over time, the Kalman filter estimates method parameters and predicts future observations. At every stage, it makes predictions, collects measurements, and subsequently updates based on the forecasts and comparisons. The mathematical estimator can predict and update the state of a wide range of linear processes. In the YOLOv3 network, the binary cross-entropy loss is utilized rather than multiple labels to classify for predicting the classes of bounding boxes to improve performance.  [27]. Table 1 show This factorization considerably reduces both the computation time and the size of the model. The calculation cost is computed utilizing the following operations: . . . . + . . .
(1) Where m and n are the numbers of input and output channels, dk denotes the convolution operation kernel size and df denotes the size of the feature map, respectively. The depthwise and pointwise convolution is followed by BN and ReLU blocks are shown in Fig 2. The computing cost for standard convolution, on the other hand, is: . . . . .
(2) Combining (1) and (2) [28] decomposes conventional convolutions into deep convolutions and 1x1 convolutions to minimize model size. It has been found that the MobileNet framework consumes 8-9 times less computing than existing convolutions, with a minor loss of precision. As a result, the MobileNet network rather than the Darknet-53 model serves as the basis of the YOLOv3 framework for object detection in this work. The YOLO network's convolutional layers are inextricably linked to the underlying Darknet technology. Furthermore, for a more cohesive grid, they can be replaced with their pointy equivalents. The suggested object detection technique is provided in this section. A new object detection system is created using YOLOv3 and MobileNet. Fig 3 depicts the suggested architecture. The suggested approach begins by rescaling image data from event based real time dataset. MobileNet is an essential feature extraction component in this approach due to its excellent accuracy and effectiveness. In contrast to the conventional YOLOv3 model's selection of fixed feature maps, this research reexamines how to determine the object detection feature maps using matching receptive field and object scale. The revised selection of feature map significantly improves the suggested object detection model's performance [29][30].

Accuracy Analysis
The degree of agreement between an actual value and its noise evaluation is referred to as accuracy. Table 2 depicts an accuracy study of the proposed approach. On the X axis of Fig 5, multiple video frame sequences from the MOT 20 data sets are given, and the accuracy in percentage is assessed on the Y axis. This implies a maximum accuracy of 99%. The proposed model's accuracy estimations are validated against existing models using feature masking in video frames. The comparative study takes into account the MOT of ten items as well as the classification accuracy of the suggested model.

Fig 5. Analysis of Accuracy Precision Analysis
Precision is the degree to which reiterated noise measurements produce the same results under similar circumstances. Table 3 depicts the precision analysis of the suggested technique. 100  73  77  82  85  88  90  200  75  78  86  87  89  93  300  77  79  88  90  93  95  400  79  80  89  92  94  96  500  81  82  91  93  95  98 On the X axis of Fig 6, multiple video frame sequences from the MOT 20 data sets are shown, while the precision in percentage is assessed on the Y axis. As a result, the proposed approach obtains the most excellent precision of 98%. The calculation of precision values reveals that the suggested model achieves precision levels that are superior to the state of the art. Existing technology provides accuracy rates of 95%, 93%, 91%, 82%, and 81% for SSD, RCNN, F-RCNN, YOLOv2, and YOLOv3, respectively. In the instance of the suggested model, the observed precision is 98%. Comparisons show that it outperforms traditional technologies.  Table 4 shows the suggested technique's recall analysis. shows multiple video frame sequences from the MOT 20 data set on the X-axis and recovery percentages on the Yaxis. This proposed approach has a maximum recovery value of 95%. A comparison of the suggested method and the current state of the art reveals that the suggested approach outperforms conventional procedures. Model tracking and categorization outperform existing methods.

Fig 7. Analysis of Precision TP Analysis
The criteria needed for evaluating the performance of a tracker is defined as True Positive analysis. The first step is to determine whether each proposed output is a TP that corresponds to an actual goal. The TP of the proposed approach is evaluated in Table 5.  100  53  55  59  61  64  78  200  55  57  61  64  66  84  300  57  61  65  66  68  88  400  58  63  67  68  70  90  500  61  64  68  70  72  95   Fig 8 shows multiple video frame sequences from the MOT 20 data set on the X-axis and true positive on the Y-axis. The suggested approach yields the most excellent True Positive value of 95%.

Fig 8. Analysis of TP FP Analysis
The first step is determining whether each hypothesized output is an FP or false alarm. The study of false positives is shown in Table 6. FP can represent the number of images recognized or classified by a model per second in image classification and object detection applications. It can be utilized for estimating the model's average processing speed. The X-axis in Fig 9 shows distinct video frame sequences from the MOT 20 dataset, while the Y-axis shows false positives. This proposed approach yields the most excellent false positive rate of 94%. The image field is defined by the FP value, which refers to the number of frames transmitted by the screen every second.

MAP analysis
The Position information containing coordinate details linked with a single image is called mean absolute position. The mean absolute position analysis in Fig 12 is shown in Table 9, where several video frame sequences from the MOT 20 data sets are shown on the X-axis, and the MAP is examined on the Y-axis. The most excellent mean absolute position score obtained was 95%. A comparison of the suggested and state-of-the-art MAP scores reveals that the suggested approach outperforms the existing literature. According to quantitative research, combining static and dynamic models can increase prominence detection performance. When merging static foreground networks with dynamic highlight networks, traditional approaches cannot recognize the relevance of video objects. The modelling approach is trained using static foreground data, which results in more accurate predictions than other methods. According to the above research, this work may assume that when training data drops, so performs, and vice versa. This implies that the suggested approach is data-driven. The computational load of the proposed technique and existing algorithms is compared in table 10. It is clear that the suggested approach is faster than the other ways. This method has been found to reduce computation time and eliminate a significant bottleneck in the efficiency of execution. In most circumstances, motion or edge data computations impede video prominence. The outcomes are depicted in Fig 13-14. This encompasses both static and dynamic effects, as well as static and dynamic relationships between the suggested approaches and other conventional methods. In comparison to static or dynamic procedures and other similar methods, the suggested strategy utilizing static and dynamic procedures minimizes MAE and computing costs. Because computing time is minimized, fewer networks are offered to process incoming data.

FPGA Performance
The hardware configuration and functionality of a Xilinx Zynq-7020 FPGA at 100 MHz are directly compared to the outcomes of the algorithms. For testing, this work employs ISIM-integrated logic simulation software. After synthesis and deployment, the timing findings and latency requirements are initially analyzed to confirm that the behavior is met. Table 11 also contrasts the suggested system's energy consumption with the more advanced technique. The proposed event camera in this suggested system consumes a few watts (0.33 W), The algorithm performance alone contributes only 0.33 W of dynamic power to the device. The current study employs the hybrid computing capabilities of Xilinx Zynq devices. However, it is limited by the high latency of frame-based systems. The Zynq module is a strong and diverse development structure; however, it can employ sleep mode and non-volatile memory, and its efficiency is significantly higher than its usefulness, with a substantially lower overall power consumption than the Smart Fusion FPGA. In other words, with correct hardware selection and construction efforts, there are numerous possibilities for the framework's low power consumption (less than 1 W). The recovery comparison utilizing the proposed approach is shown in Fig 15.   Fig 15 .Comparison Of Recall Of Proposed Algorithms V.

Power Consumption
CONCLUSION Traditional detection approaches cannot match the criteria for high-precision, real-time object recognition and classification in dynamic event cameras. This research creates an enhanced YOLOv3 network for thorough consideration. YOLOv3 and MobileNet were used to create a new object detection method. Initially, rather than picking fixed feature maps in the conventional YOLO v3 architecture, the technique for determining feature maps in the MobileNet is optimized based on examining the receptive fields. Experimental outcomes show that the suggested approach can identify and track foreground objects in complicated and vibrant environments with excellent precision, resilience, and efficacy. This approach also yields photos that are smooth and free of noise. Another advantage of this strategy is that it demands less computing time. The suggested method does not suffer from false object tracking even in the circumstance of varying lighting, making the system more efficient and resilient. Future work can be enhanced by experimenting with more video streams in congested areas.

Data Availability
No data was used to support this study.

Conflicts of Interests
The author(s) declare(s) that they have no conflicts of interest.

Funding
No funding was received to assist with the preparation of this manuscript.

Ethics Approval and Consent to Participate
The research has consent for Ethical Approval and Consent to participate.