TY - JOUR AU1 - Yang, Jie AU2 - Zhu, Wenchao AU3 - Sun, Ting AU4 - Ren, Xiaojun AU5 - Liu, Fang AB - Introduction Forest is an important part of the Earth’s ecosystem, which can help to maintain the Earth’s ecological balance and promote the development of human society and economy [1]. However, the safety of forest areas has been threatened because of the frequent occurrence of forest fires in recent years. The fires destroyed millions of hectares of forest and devastated the ecological environment, which caused huge economic and animal and plant losses yearly [2]. According to incomplete statistics, there are an average of more than 10,000 forest fires and more than 650,000 hectares of affected forest land in China each year, which caused direct property losses of hundreds of millions of yuan. In 2020, severe wildfires occurred in Australia, which burned at least 19 million hectares, thousands of buildings, and caused the deaths of 34 people and over a billion animals [3]. Forest fires are abrupt strong, destructive, and difficult to tackle usually. Once a fire has broken out, flames can spread freely in the forest area, which is difficult to tackle the fire quickly. If forest fires can be real-time effectively detected and areas of fire smoke can be rapidly identified, it which are helpful for fire-fighters to take measures promptly and control the spread of forest fires. Therefore, how to fastly and accurately detect forest fire and smoke, which is critical for forest fire safety. Commonly used methods for forest fire prevention include manual patrolling of forest areas and sensor detection. The manual method requires forest rangers to patrol the forest area and find the fire, then report timely it to the relevant departments that take fire extinguishing measures. However, this method has several drawbacks, such as small patrol regions, high-cost consumption, and delayed firefighting, which can result in huge losses. Sensor detection methods for forest fire and smoke detection mainly employ contact fire detectors, such as fire detectors based on chemical sensors [4] and smoke detectors [5], etc. But this method has shortcomings when we detect large-scale forest fires, the detection effect is easily influenced by sensor angles, sensor distances, tree shading, signal transmission, and uncertainties in the surrounding environment. The aforementioned smoke and fire detection techniques are not very effective in providing early warning of forest fires. With the rapid development of computer vision technology, a large number of video surveillance systems have been installed and employed in forest fire early warning systems. Compared with traditional fire detection technology, video surveillance fire detection technology based on computer vision has the advantages of contactless operation, wide detection range, low maintenance cost, fast response time and good detection performance, etc. The primary goal of this method is to extract the visual features of smoke and fire, such as color [6], texture [7], motion [8], background contrast [9] and the combination of different visual features [10]. However, the fire and smoke detection techniques based on traditional machine learning suffer from many problems, such as complex forest fire images’ background, inconspicuous pixel features, image recognition’s weak generalization and low detection accuracy. Deep learning techniques in computer vision target detection have made rapid progress since 2012, computers’ computing power increases, especially the Graphics Processing Unit (GPU) rapidly develops, and some excellent public data sets were created and in public use, which brought significant accuracy and efficiency developments and lower computational cost, so deep learning is widely used in the field of smoke and fire detection. These algorithms can be essentially classified into two categories: one is the Two-stage algorithm, whose model structure is divided into two stages. The ROI (region of interest) candidate region is first formed, then the task classification and positioning are performed in this region. Whose classical algorithms include R-CNN [11] (regions with CNN features), Fast-R-CNN [12], and Faster-R-CNN [13]. The other is the One-stage algorithm, which directly predicts the target’s category and position information with a regression-based target detection network. Whose classical algorithms are YOLO (you only look once) series [14] and SSD (single-shot multibox detector) algorithm [15]. In addition, the re-searchers have proposed an improved CNN-based method for smoke and fire detection [16–18]. Recently, researchers start to use Transformer [19] Backbone network to improve neural networks, whose typical ones are ViT (vision transformer) [20], Swin [21] and PVT (pyramid vision transformer) [22]. In the improved process, the pretraining weights from the large-scale image classification database are used as the initial weights for the detector Backbone network, then target detection is performed with Transformer codec-based feature fusion [23, 24]. The Transformer model has good performance, while its computation cost is expensive and it is difficult to train. The smoke and fire detection technologies based on the above deep learning have achieved good results, but the higher the performance of the detection algorithm, the more convolutional layers of the network structure, the larger the model. In practice, this results in a model with a large number of weighting parameters and low detection efficiency, which is unfavorable for deployment on resource-constrained devices. Therefore, lightweight networks are needed to implement smoke and flame detection. In order to solve the problem of the model’s overlarge and detection efficiency, researchers usually use network pruning [25], knowledge distillation [26], network parameter quantization [27], and the design of lightweight ConvNet. For example, [28] designed a lightweight fire detection network: FireNet, which can run smoothly on low-cost embedded platforms such as the Raspberry Pi. [29] proposed a convolutional neural network based on YOLOv2 for real-time fire and smoke detection in a fire monitoring system, which was deployed in a low-cost embedded device (Jetson Nano) that can be used in a smoke and fire video monitoring system. Furthermore, [30] proposed a deep learning fire recognition algorithm, which was based on model compression and the lightweight network MobileNetV3, which adopted knowledge distillation to improve the accuracy of pruning models, which effectively decreased computational costs in embedded intelligent forest smoke and fire monitoring systems. Although these lightweight algorithms in the above paper solved the problems of inadequate performance and small applicability in forest smoke and fire monitoring systems, there are still the following problems: (1) If we want to get high detection accuracy in certain cases, then detection methods have a large calculation amount, large model and slow calculation speed, etc. (2) If some lightweight models reduce the number of model parameters, then they are unable to balance accuracy and speed. YOLOv5 is one of the more sophisticated detection algorithms in the YOLO family, which is widely employed in many target identification jobs, it has fast speed, high accuracy and small weight, etc. YOLOv5s has both faster detection speed and higher detection accuracy to meet the requirements of forest fire detection. [31] employed the YOLOv5 detection algorithm for flame detection and optimized the algorithm’s network structure, resulting in effective detection outcomes. However, the large model size and complexity hindered deployment improvements. To address this challenge, this study introduces a lightweight network model based on YOLOv5s for forest smoke and fire detection, with the following main contributions: The lightweight C3Ghost and Ghost modules are introduced into the Backbone network and neck network, therefore, the model is compressed while ensuring accuracy and speed of detection. The PAN structure is improved by adding learnable weight parameters in feature fusion, which effectively improved the network performance. The CA attention mechanism is introduced into the Backbone network to highlight the key information of smoke and fire, at the same time, invalid background information is suppressed, thereby the detection accuracy of the algorithm is improved. The other main contents of this article are as follows: The second section introduced the YOLOv5 algorithm and improved YOLOv5 algorithm. The third section experiments with the algorithm model before and after improvement. The fourth section discussed the experimental results. The fifth section summarized all the work. Method YOLOv5s method The network structure of YOLOv5s is shown in Fig 1. Its structure is mainly di-vided into four parts: image data input, image feature extraction, feature fusion, and image target detection. In the image input stage, the dataset is extended by a series of image enhancement methods such as mosaic and adjusting image brightness. In the image feature extraction stage, the C3 module and the convolution module are used to complete the image extraction operation, and the SPPF module is used to do further dimensionality reduction on the convolutionally extracted information, which can extract higher-order features and enhance the image feature stability. In the feature fusion stage, the PAN (path aggregation network) structure is used to complete the feature fusion of different layers. Finally, three scale detectors are output to detect targets on different scales. The prediction boxes are filtered by NMS (non-maximum suppression), and the highest confidence of prediction box information is retained as the final detection result. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. YOLOv5s network structure diagram. https://doi.org/10.1371/journal.pone.0291359.g001 Improvement of YOLOv5s In order to make the algorithm model run smoothly on resource-constrained embedded devices and improve the speed and accuracy of the smoke and fire, this paper proposes a lightweight fire and smoke detection network based on YOLOv5s. Fig 2 is the improved YOLOv5s structure diagram. The inputs of each module are derived from the outputs of the previous module, and the size of the feature map for each module’s inputs and outputs are unchanged from the unimprovement YOLOv5s. In the Backbone and Neck network, in order to reduce the model size and FLOPs, the ordinary convolution is replaced by the Ghost module [32], and it is embedded in the C3 module to form the C3Ghost module. The CoordAttention [33] (CA) module is embedded after the four C3Ghosts in the Backbone to enhance the feature extraction ability of the network by highlighting the key information of flame detection. In the Neck network, for the purpose of distinguishing the importance of different features in the feature fusion, a number of weight parameters are added in the PAN [34] structure, therefore weight parameters can be learned by the neural network. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. The structure of the optimized lightweight YOLOv5s model proposed in this paper (added Ghost (①), CA(②), and PAN-weight(③) to the original model). https://doi.org/10.1371/journal.pone.0291359.g002 Ghost. Neural network algorithm is difficult to deploy on mobile devices, which has always needed to be solved. The main reason is that the computing ability of mobile devices can-not be compared with that of high-performance GPU devices, which makes it difficult for the existing models to perform the desired performance on mobile devices. While the Ghost module as a new basic unit of the neural network, which can be achieved more feature maps with lower calculation amounts and fewer parameters. The structure is shown in Fig 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Ordinary convolution and Ghost module. https://doi.org/10.1371/journal.pone.0291359.g003 Fig 3 shows the ordinary convolution method and the Ghost module. There are usually many convolution operations in a convolutional neural network, which leads to large calculation volume, at the same time, the output feature map usually produces redundant features, such as feature map similarity. Suppose the size of the input feature map is h × w × c, then the size of the output feature map is h’ × w’ × n, where h and w represent the height and width of the input feature map respectively, h’ and w’ represent the height and width of the output feature map respectively, and the size of the convolution kernel is k × k. When we use the ordinary convolution method to calculate, the number of FLOPs can be expressed as n × h’ × w’ × c × k × k. When using the Ghost module for computation, a small number of feature maps is generated by the ordinary convolution method, then these generated feature maps need a certain number of cheap transformation operations, finally, the need number of feature maps can be obtained. In the cheap transformation operation, we assume that the channel of the feature map is m (m<