TY - JOUR AU - Wang, Yanfeng AB - 1. Introduction Maize is an important food crop and one of the most widely planted food crops globally. A healthy maize industry plays an essential role in ensuring the world’s food security. However, with the change in climate and environment, the stress of diseases and insect pests caused irreversible losses to maize and other crops. The leaf diseases caused by various pathogens greatly restricted the photosynthesis of maize leaves and the transportation of nutrients, which seriously affected the yield and quality of maize [1]. Only symptomatic spraying of pesticides can control the spread of diseases and minimize losses. It is essential for disease control to discover and judge the disease type in the time and select suitable pesticides for precise treatment. The traditional method of relying on plant pathologists to identify disease types on-site is time-consuming, labor-intensive, inefficient, and prone to subjective errors, especially in the field environment, which greatly increases labor costs [2]. In recent years, machine vision combined with image processing technology is continuously overcoming the shortcomings of artificial recognition, such as easy misjudgment, dependence on expert experience, labor, and manpower [3, 4]. However, the ideas of these studies are often based on the color, texture and spatial structure of the image [5, 6]. The thresholds cannot meet all complex background images obtained under natural conditions, and the ultimate recognition accuracy is limited. There are some problems such as poor adaptability, weak anti-jamming ability, etc. As a result, the practical application is severely limited. At present, more and more researchers are devoted to the field of deep learning. Compared with the traditional recognition methods, the emergence of the convolutional neural network (CNN) effectively improves the recognition efficiency and accuracy, which is obviously better than machine vision. Since the proposal of LeNet [7] in 1998, the convolutional neural network has developed continuously upgraded models such as AlexNet [8] in 2012, Googlenet [9] in 2014, and ResNet [10] in 2015. Many novel CNN models are also being proposed to apply in the field of plant classification. For example, Muhammad Rizwan Latif [11] proposed using deep learning architecture, serial feature fusion, and the optimal feature selection convolutional neural network. Nazar Hussain et al. [12] proposed a new deep learning-based framework for plant leaf disease identification that includes feature fusion and selecting the best features. The network models become more and more deep and complex. At the same time, it also solves the problems of gradient disappearance and the explosion of backpropagation. However, in practice, the original disease images of maize leaves include issues such as noise and background interference, resulting in low classification accuracy. In order to address the aforementioned issues, we propose a WG-MARNet-based maize disease identification method capable of identifying maize leaf pests and diseases. The following are our two main contributions. We use data augmentation to improve data quality, diversify data features, and expand the size of our dataset to achieve a better outcome. We propose the WG-MARNet with the following design for the classification of maize leaf diseases. To minimize image noise at the input side and to construct a high and low frequency multi-channel network structure, a wavelet threshold-guided bilateral filtering (WT-GBF) algorithm is proposed to be integrated into the network structure based on the features of maize diseases. It is proposed to use average down-sampling and tile operations to improve the multi-scale feature fusion technology. Use improved multi-scale feature fusion technology to enhance the ability of target feature expression. An attenuation factor is proposed to be added to the high and low-frequency channels to increase the stability of the network parameters during learning. To make it easier for the reader to read this paper, Table 1 lists the abbreviations that frequently appear in this paper. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Abbreviations appearing in this article. https://doi.org/10.1371/journal.pone.0267650.t001 2. Related work In recent years, with the application of computer vision technology in the field of agriculture, there have been a lot of achievements in the research of image segmentation based on crop and disease location. In 2010, Hengqiang Su et al. [13] extracted the lesion area of maize leaf by Super-Green segmentation and Otsu threshold segmentation, combined with image morphology operation, then extracted the color and texture characteristics parameters of different lesion areas of maize leaf, and finally classified the experimental data by SVM method. In 2012, in order to better identify common types of maize diseases, Baiyi Zhang [14] partitioned the lesion area by preprocessing and using an improved level set algorithm and achieved a good recognition accuracy. In 2013, in order to identify maize varieties, Donglaima et al. [15] first used the Ostu algorithm to segment maize varieties, on which six characteristic parameters were extracted, and then used the K-means clustering algorithm to identify maize varieties. In 2014, in order to better identify maize diseases, Shanwen Zhang et al. [16] proposed an algorithm of local discriminant projects (LDP) to identify maize diseases. Based on image segmentation and LDP, the dimension of the disease image is reduced. Finally, a database is established to identify the disease image with high accuracy. In 2016, in order to better identify maize leaf diseases, Liangfeng Xu et al. [17] proposed an adaptive multi-classifier method to identify maize leaf diseases, and combined it with cluster analysis to obtain adaptive weights. The proposed method can improve the accuracy of maize leaf disease identification. However, because of the weak robustness of traditional support vector machine and other methods, the application effect is not good in a complex field environment. Also, some researchers have contributed in terms of data. To help farmers with the early detection of plant diseases and other scholars with data from their research, Hafiz Tayyab Rauf et al. [18] made a dataset containing diseases of citrus fruits such as Healthy, Blackspot, Canker, Scab, Greening, and Melanose. With the deep learning technology in target detection and image processing, convolutional neural network (CNN) has been widely used in image recognition and classification [19, 20]. In the research of plant diseases and insect pests identification, CNN has been proved to have better performance than traditional machine learning methods. Brahimi et al. [19] used 15,000 tomato disease images to classify and recognize 9 diseases in the data set based on the AlexNet model and obtained better recognition results. In 2017, Mansheng Long et al. [21] applied migration learning in the convolution network training process, constructed AlexNet model based on TensorFlow, and classified algae spot, yellow disease, coal pollution disease, and soft rot disease of Camellia oleifera with 96.53% accuracy. Rehman M Z U [22] proposed a new technique in 2018 for apple and grape disease detection and classification based on new adaptive thresholding and optimized weighted based segmentation fusion. The method is highly efficient in terms of accuracy, sensitivity, precision, and F1 value. Muhammad Zia Ur Rehman [23] proposed a classification method in 2021 for citrus diseases based on deep learning which achieved 95.7% classification accuracy. In 2021, Jaweria Kianat [24] proposed a framework for cucumber disease classification based on feature fusion and selection techniques based on deep learning with an accuracy of 93.50% obtained from the selected dataset. Ahmad Almadhor et al. [25] trained advanced classifiers for image-level and disease-level classification using a high-resolution guava leaf and fruit dataset and obtained an overall classification accuracy of 99%. Almetwally M. Mostafa et al. [26] proposed an AI-Driven framework for the recognition of guava plant diseases through machine learning. After pre-processing and enhancing the data, enhanced data were then augmented over the nine angles using the affine transformation method—augmented enhanced data used by five DL networks by altering their last layers. The final application was applied to different networks and better results were obtained. In 2022, Zia ur Rehman [27] proposed a new method for real-time apple leaf disease detection and classification using MASK-RCNN and deep learning feature selection based on deep learning, achieving the best accuracy of 96.6% in Integrated Subspace Discriminant Analysis (ISDA) classification. Although the above research has produced positive results, most of the previous researches on methods of identifying disease types with the help of deep learning and CNN were carried out in the laboratory or under controlled conditions. The sample size of the image set obtained in the field is small, which affects the generalization of the model. When using a large public data set as a research object, the image background in the data set is too simple and the data is seriously underrepresented. In the face of practical application, due to the lack of representativeness of the data set, the ability of the model to extract disease regional features in the complex background is reduced, especially in the actual maize original image, such as noise, unclear features, and background interference, the recognition accuracy, and speed are greatly reduced. Facing the above problems, we proposed WG-MARNet for the classification of maize leaf disease. The method can eliminate the noise of maize images, enhance the focus characteristics of maize, and realize the high-precision recognition of maize disease images. The improvement of the WG-MARNet is as follows: According to the principle that maize lesions have huge feature differences in high and low frequency images, the wavelet threshold-guided bilateral filtering is used for high and low frequency decomposition, and a high and low frequency multi-channel network structure is established to improve the ability of feature extraction. The multi-scale feature fusion method is improved by using average down-sampling and tile operations. This not only enhances the ability of target feature expression but also reduces the increase in the number of features and reduces the risk of overfitting. Attenuation factors are introduced on the high and low frequency multi-channels to optimize the problem of unstable performance when training deep networks. Through the comparative experiments of convergence and accuracy, we use PRelu and Adabound instead of the Relu activation function and Adam optimizer. The flow chart based on data enhancement and the WG-MARNet framework is shown in Fig 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Flowchart of data enhancement and WG-MARNet framework. https://doi.org/10.1371/journal.pone.0267650.g001 3. Materials and methods 3.1 Data acquisition and preprocessing The data set used in the experiment was derived from the data set website and field collection. Websites for collecting the data set include the China Science Data Network (http://www.csdata.org/) and Digipathos. On the data set website, 150 images of 9 common maize diseases caused by fungi are carefully selected. The other part of our data set is collected in cooperation with the Hunan Academy of Agricultural Sciences, China. We use Sony ILCE-7M2 to shoot optical images of different diseases from multiple angles at different time periods in the morning, noon, and evening under sunny and cloudy weather conditions. Such photos can reflect the many complex conditions of maize growing in the field to ensure that the collected images are more representative. Finally, 1000 images were collected with a pixel size of 3600×2700, including 458 samples with uniform illumination under sunny conditions, 263 samples with uneven illumination, and 279 samples under cloudy conditions. The number of 9 disease images finally obtained through field collection and data collection website is 1150. https://github.com/FXD96/Corn-Diseases is the link to the dataset we collated. In order to effectively improve data quality, increase the diversity of data features, and reduce the dependence of convolutional networks on computer hardware due to complex backgrounds, data enhancement operations are performed on the collected disease image sets. we use multi-angle flipping, brightness adjustment, saturation adjustment, and adding Gaussian noise to expand the corresponding data set to 8 times the original. The transformed image is uniformly adjusted to 224×224×3 (height×width×color channel). The original sample size and the enhanced sample size distribution are shown in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Profile of sample images for nine types of diseases. https://doi.org/10.1371/journal.pone.0267650.t002 3.2 WG-MARNet In order to improve the recognition accuracy of maize diseases and solve the problem of low accuracy caused by noise and unclear features in original maize images obtained from the complex environment, this paper designs a WG-MARNet model. First, the maize leaf disease image data set is used as the input of the model. After the WT-GBF processing layer, the image noise is eliminated and the input image is decomposed at high and low frequencies, which improves the ability to resist environmental interference and avoids the characteristics of maize disease spots in the high and low frequency images. Second, the high and low frequency multi-channel multi-scale fusion network structure (MARNet) with attenuation factor is established, which improves the model feature extraction ability while enhancing the robustness of the deep network. Finally, PRelu and adabound are selected as the activation functions and optimizers of the WG-MARNet. The structure of the WG-MARNet is shown in Fig 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Structure diagram of the WG-MARNet. https://doi.org/10.1371/journal.pone.0267650.g002 The input image size used in this network is a three-channel image of 224×224. The image is decomposed into high and low frequencies using WT-GBF to obtain low-frequency images and high-frequency images. Then the high-frequency images are taken again to obtain the weakened background information. At the same time, the higher frequency image that enhances the lesion information, the low-frequency image and the higher frequency image are fed into the network in a channel. The following analysis of the network takes the low-frequency channel as an example. The first layer of the network is the convolution layer, which contains 64 channel convolution operations. Then, the batch normalization layer is used to re parameterize the distribution of feature maps. Then, a nonlinear excitation layer PRelu is added to introduce nonlinearity into the network of this layer. Every two network feature extraction layers need to be added with a nonlinear excitation function to introduce nonlinearity. Otherwise, multiple feature extraction network layers can be represented by one feature extraction network layer, which cannot introduce stronger feature extraction ability and wasted computing resources. The reason why residual structure is not used directly from the first convolution layer is that the feature map is used instead of the original input image when a shortcut is directly connected. The following feature extraction network is divided into 16 residual blocks with 4 groups, conv2-x to conv5-x. Each residual block group conv-x contains a corresponding number of residual blocks. Table 3 shows that the number, size, and step size of convolution kernels in each residual block are also included in the table. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Parameter configuration of residual block in conv1-conv5 group. https://doi.org/10.1371/journal.pone.0267650.t003 The block diagram of a residual block network in conv2_x residual group is given below, as shown in Fig 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Conv2_ 1 parameter operation diagram. https://doi.org/10.1371/journal.pone.0267650.g003 The picture above is the structure of the first residual block Conv2_1 in the Conv2_X residual group. After Conv1_X convolution, the input feature map is a 64-channel 56×56 feature map. After three convolution kernels extract features, the output becomes a 28×28 feature map with 64 channels. The parameter "3" in the figure indicates that the size of the convolution kernel is 3×3, "2" indicates that the stride of the convolution layer is 2, and "64" indicates the output channel. The value of the stride determines whether the size of the output feature map will change. When stride is 2, it means that the size of the output feature map is half of the input feature map, and the shortcut connection part is for a convolution operation to make the output feature map. The number of channels is the same as that of the shortcut so that the connection operation of Element wise Add can be performed. In addition, dropout is added to the fully connected layer to prevent overfitting further. Among them, after average pooling, the sizes of the output characteristic graphs of conv2_x, conv3_x, conv4_x, and conv5_x are changed into [56,56256], [28,28512], [14,141024], [7,7, 2048], if conv2_x, conv3_x, and conv5_x feature vectors are selected for multi-feature fusion, then the fused feature vectors are spliced into a 2816 dimensional feature vector, and then the high and low-frequency feature vectors are fused to generate a 5632-dimensional feature vector for the following network classification. 3.2.1 Denoising and high and low frequency decomposition of maize leaf disease images. Generally speaking, due to the influence of image acquisition equipment and shooting conditions, there is always some noise in maize images. The goal of WT-GBF processing is to reduce the impact of these noises (small-scale texture details, outliers and spots, etc.), highlight useful information (object edges, foreground, and background boundaries, etc.) to improve the accuracy of subsequent network recognition. At the same time, due to the characteristics of maize diseases (Large spot disease, small spot disease, and rust, etc.), the leaf disease information is mainly manifested in the high-frequency part of the image, and only a small part of the disease information remains in the low-frequency background image. WT-GBF decomposes high and low-frequency images to realize subsequent network sub-channel processing of high and low-frequency images and improve feature extraction capabilities. The WT-GBF is chosen to decompose the high and low frequencies of the image. The reason is that compared with other commonly used filters, this algorithm can well retain the detailed texture of the low-frequency background image and has the effect of maintaining the boundary. The two kernel functions in bilateral filtering are combined spatial domain function and range kernel function. It is precise because of the role of these two kernel functions in the filtering process that bilateral filtering has edge and detail retention characteristics. The effect of wavelet threshold denoising is very good. Firstly, the maize disease image is denoised by the wavelet threshold to obtain a smoother image. The wavelet threshold uses a hard threshold, and then the smoothed image is used as a guide image for calculating the kernel function of bilateral filtering. Fig 4 shows the comparison of several common filters. It can be seen that the low-frequency image obtained by using wavelet threshold to guide bilateral filtering has better detail retention. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. High and low frequency decomposition effect picture. (a) Low Frequency Patch Images. (b) Higher Frequency Patch Images. https://doi.org/10.1371/journal.pone.0267650.g004 After the maize lesion image is subjected to WT-GBF, the low-frequency background image IL and the high-frequency lesion image IH are obtained. In order to obtain a higher frequency image IHH, the high frequency image IH needs to be passed through the filter again, which can be expressed as: (1) The images of maize disease images decomposed by high and low frequencies are shown in Fig 4. 3.2.2 Multiscale feature fusion. With the training of deep neural networks, the content of the features extracted by the network will also vary greatly with the different feature levels [28], and these levels of information have their own characteristics. For image tasks: In general, the features extracted by the shallow network contain rich detailed information on the image content. However, due to the relatively shallow feature level, there will be a lot of redundant information in the information contained. If the information is directly used for classification, the effect is often unsatisfactory due to the lack of high-level semantic information; the information extracted from the deep network contains more semantic information, and compared with the features extracted from the shallow network, there is no Too much detailed information of the image content. The positioning information is not accurate enough. Because the extracted information is too refined and too abstract, it also leads to the lack of information to some extent, and the integrity of the information cannot be guaranteed. The hierarchical features are between the shallow features and the deep features. There is a certain amount of detailed information on the image content, as well as high-level semantic information, and the information content is relatively complete. In the feature fusion method shown in Fig 5, we first average down-sampling the extracted shallow, middle and high-level features to reduce the size of the feature map and the number of features. Then we tile the features, stitch the features of each level after tiling to get a fused feature vector, which will function as the last feature vector. The result is predicted by the training classifier. This method eliminates the operation required for progressive fusion and reduces the number of features. Finally, the three features are fused directly for classification. Compared with the existing popular methods, the fusion method proposed in this paper is more suitable for this research task. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Multi-scale feature fusion method based on task design. https://doi.org/10.1371/journal.pone.0267650.g005 3.2.3 Multi-channel feature fusion. The feature fusion layer of a multi-channel convolution neural network can fuse different feature information, which makes the fused feature information more distinguishable and better expressive ability for images. After the input image is pooled by convolution of each channel in the multi-channel convolution neural network, the output feature maps of the high-frequency channel and the low-frequency channel can be obtained, respectively. Before feature fusion, Average pooling was performed on the feature map. A kernel size of 4 and a stride of 4 were used to reduce the dimensions of the extracted features. Then the two-dimensional picture data is transformed into one-dimensional feature vectors, and the data is batch standardized to make the data distribution more dispersed and closer to the data distribution of the test set, which reduces the model overfitting. Finally, the processed one-dimensional data is input into the full connection layer for feature information fusion. The multi-channel feature fusion process is shown in Fig 6. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Multi-channel feature fusion process diagram. https://doi.org/10.1371/journal.pone.0267650.g006 3.2.4 Attenuation factor. To increase the stability of the learning process of neural network parameters, a convolution attenuation factor is introduced into the convolution channel, and the sparse restriction on the output characteristic graph of each convolution module is applied [29]. The data flow on each channel of WG-MARNet is controlled and managed. The structure block scheme shown in Fig 7 is adopted in this paper. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Attenuation factor introduction graph. https://doi.org/10.1371/journal.pone.0267650.g007 The nonlinear mapping function for a single structure is: (2) In the expression, Σ means summing up the neurons in the corresponding output characteristic map of each channel. λ1, λ2, and λ3 are convolution decay factors, and 0<λ3<λ2<λ1≤1. The convolution module shares a convolution attenuation factor with all the neurons on the output characteristic graph. Different convolution modules will use different size attenuation factors. Formula (2) shows that the output of WG-MARNet is determined by the output data of each channel. Based on the contribution of each channel to the output of the neural network, a concept of network output contribution ratio is introduced to define the contribution ratio of network output to the attenuation factor of each channel. The Shortcut channel has no attenuation factor, so its attenuation factor is equivalent to 1, so the contribution ratio of neural network output to the Shortcut channel M = 1/(1+λ1+λ2+λ3). The attenuation factors of the three convolution modules on the convolution channel are λ1, λ2, and λ3 which are artificially set network hyperparameters, so their network output contribution rates are: (3) Because 0<λ3<λ2<λ1≤1, there is Mn3