Introduction
Deep learning “is a technical term that we often encounter when encountering” inspection robots “. This article will explain the application of” deep learning “in inspection robots by Super Dimensional Technology.
1 Overview
Deep Learning (DL) is a type of machine learning algorithm that is based on representation learning of data. It can discover complex structures in big data and use backpropagation to guide machines on how to compute representations from the previous layer of the network, thereby changing the internal parameters of each layer. The purpose of deep learning is to give robots the same analytical learning ability as humans, as it can enhance their perception, decision-making, and control capabilities. Specifically, for ultra dimensional intelligent inspection robots, it mainly involves object detection, object classification, and feature matching algorithms. By comprehensively utilizing the above algorithm models, the ultra dimensional intelligent inspection robot can accurately perceive the status information of alarm indicators and equipment on server cabinets. At the same time, according to the combination of different models, the ultra dimensional intelligent inspection robot can also be applied to distribution room servers to detect the position and status information of rotary switches, pressure plate switches, LEDs, LCDs and other components on distribution cabinets. Below are detailed introductions to each model.
2 Object detection models
The basic task of object detection is to distinguish the category of the detected object in the image, and to use rectangular bounding boxes to determine the location and size of the object, and provide corresponding confidence levels. As a fundamental problem in the field of computer vision, object detection is also the basis for many computer vision tasks such as image segmentation, object tracking, and image description. There are currently two mainstream deep learning object detection algorithms: one is a two-stage object detection algorithm based on region suggestions, such as R-CNN, SPP Net, FastR-CNN, Faster R-CNN, FPN, Mask R-CNN, etc., and the other is a single-stage object detection algorithm based on regression analysis, such as YOLO series, SSD series, RetinaNet, etc. The basic architecture diagram of Class 2 is shown in the following figure:
(a) Two stage object detection algorithm
(b) Single stage object detection algorithm
The development of two-stage object detection algorithms is rapid, and the detection accuracy is constantly improving. However, the problem of their own architecture limits the detection speed. The biggest difference between single-stage object detection algorithms and two-stage object detection algorithms is that the former does not have a candidate region recommendation stage, and the training process is relatively simple. It can directly determine the target category and obtain the position detection box in one stage.
Considering the computational power and detection speed of on-site deployment, Super Dimensional Technology adopts the single-stage object detection algorithm YOLO v5. The YOLO series has developed 5 generations from v1 to v5, with the latest version being YOLOv5 6.0. YOLO v5 can be said to be the culmination of various enhanced data for object detection, including Mosaic data augmentation at the input, adaptive anchor box calculation, adaptive image scaling, focus and CSP structures in the backbone network, FPN+PAN structure in the Neck section, and GIOU_Loss during training.
Based on YOLOv5, Chaowei Technology has improved the model by adopting the following techniques to adapt to the small target model of server room indicator lights, effectively enhancing the detection accuracy of targets.
(1) Modify the neck of the model and replace the current PAN Net with bi FPN to extract detailed information from the image.
(2) Use the method of dividing images for detection. When detecting, divide an image into 9 images, with an overlap area of 10 pixels between every 2 images, in order to prevent one target from being divided into two for detection. After completing the detection of 9 images, non maximum suppression is performed to remove overlapping detection boxes and obtain the detection result of one image.
(3) Two photos will be used for detection, one for long exposure and one for short exposure. Test the two photos according to method (2), and then perform another non maximum suppression to obtain the final lamp detection result.
(4) Using the α – OU technique, the IOU loss function of the original YOLO v5 is improved to enhance the loss weight of small targets and reduce the probability of missed detections due to region overlap.
(5) Combining the feature map output from the fourth layer of the backbone network with the feature map of the pyramid feature extraction network, the output is a P2 detection head, which improves the resolution of the model for image detection, enhances the extraction of subtle feature textures, and reduces the missed detection of small targets.
The detection results of the small indicator light
After improvements to YOLO v5, the detection accuracy for small targets on indicator lights has exceeded 97%.
3 Objective Classification Model
Image classification task is a core task in computer vision, which aims to distinguish different categories of images based on the different features reflected in the image information. Select a category label for a given input image from a known set of category labels. The image classification model is mainly applied to the state classification of targets in inspection tasks. After passing through the object detection model, the detected target screenshots are input into the object classification model to achieve the classification of the target state. For example, identifying the color of indicator lights, the status of switches, and the gear position of rotary switches. Classic deep learning based image classification methods include AlexNet, VGG, GoogLeNet, ResNet, etc. These models have relatively low model depth and weak generalization ability.
In response to issues such as on-site perspective, lighting, scale, occlusion, deformation, background clutter, intra class deformation, motion blur, and diverse categories, Super Dimensional Technology uses the CSWinTransformer model to achieve image classification. CSWin Transformer is an image classification model based on Transformer proposed in 2021. Compared to traditional convolutional neural networks, Transformer learns more about the interrelationships between features, has better universality, and does not rely solely on the data itself; Transformer not only focuses on local information, but also has a diffusion mechanism from local to global to find more suitable feature expressions.’
CSWinTransformer adopts a cross shaped self attention mechanism, which can simultaneously calculate attention weights in both horizontal and vertical directions. In addition, CSWin Transformer also adopts a locally enhanced position encoding, which has two advantages compared to previous position encoding:
(1) Capable of adapting to input features of different sizes;
(2) Has stronger local assumption bias. Another reason for using CSWin Transformer is that compared to other Transformer based image classification models, due to the use of a cross shaped self attention mechanism, the model has smaller parameter and computational complexity. The architecture of the model is as follows:
Using this target classification model, the color misclassification rate is within 1% under different lighting and angles, which can meet the needs of on-site inspection.
4 feature matching
Feature matching refers to the pixel wise recognition and alignment of content or structures in two images that have the same or similar attributes. Generally speaking, the images to be matched are usually taken from the same or similar scenes or targets, or other types of image pairs with the same shape or semantic information, thus having a certain degree of matchability. The main function of feature matching in inspection tasks is to determine the ID of each small indicator light and clarify its function by matching the inspection light with the reference light. When the server indicator light generates an alarm signal, the alarm information can be quickly obtained through the ID of each indicator light.
The classic feature matching algorithms are SIFT algorithm and ORB, but SIFT algorithm takes a long time to perform global feature point detection on images, resulting in slow algorithm running speed and unsatisfactory matching results. The ORB algorithm runs fast, but its descriptors do not have scale invariance, resulting in unsatisfactory matching performance. Therefore, we use SuperPoint to extract image features and the SuperGlue feature matching algorithm proposed in 2020 to match the server’s small indicator lights.
The overall architecture of the model is shown in the following figure:
(a) SuperPoint extracts feature points from images
(b) SuperPoint matches based on feature points
(c) Matching rendering (reference photo on the left, inspection photo on the right)
The matching algorithm based on SuperPoint and SuperGlue achieved a matching accuracy of over 98% in field testing.