TY - JOUR AU - Yue, Huajie AB - Introduction Lung cancer is specifically divided into non-small cell lung cancer (NSCLC) and small cell lung cancer. NSCLC accounts for approximately 85% of newly diagnosed lung cancers yearly [1]. The emergence of targeted therapy has substantially increased the survival rate of NSCLC patients. Prior to targeted therapy, it should be determined whether important disease-causing genes are mutated. KRAS is a common causative gene in NSCLC, and approximately one-third of patients with NSCLC have KRAS mutations. The usual diagnostic tool is a puncture biopsy. However, this invasive method has many limitations, such as it is unsuitable for all body types and has unpredictable consequences such as increased risk of cancer metastasis [2]. Therefore, there is an urgent need for a non-invasive diagnostic method that can accurately predict KRAS mutations in lung cancer patients. This method will not only improve the treatment outcome of patients but also guide prognosis. In recent years, researchers have used CT images to predict gene mutations based on traditional radiomics and machine learning. Song et al. [3] propose a machine-learning model for predicting EGFR and KRAS mutation status. They used the model to extract statistical, shape, pathological, and deep learning features from 144 CT scans of tumor regions. Shiri et al. [4] used minimum redundancy, maximum correlation feature selection, and random forest classifier to build a multivariate model. The model analyzed radiological features extracted from images of tumors and successfully predicted EGFR and KRAS mutation status in cancer patients. The radiomics and machine learning methods mentioned above have successfully predicted gene mutations. However, most of these methods rely on hand-crafted features. In recent years, deep learning based on convolutional neural networks has attracted much attention in the field of medical image computing. This data-driven approach can automatically extract complex image features [5–7]. In addition, imaging genomics is more expected to develop in the field of deep learning than single modality data for analytical studies. It integrates disease imaging data and genomic data. Imaging genomics is a high-throughput research method correlating imaging features with genomic data. In recent imaging genomics studies, researchers have proposed a series of deep learning algorithms and theoretical models based on image or genetic data. Dong et al. [8] proposed a multichannel and multitasking deep learning (MMDL) model. They used the fusion of radiological features of CT images and clinical information of patients to improve the accuracy of the model to predict KRAS gene mutations. Hou et al. [9] proposed a multimodal information fusion module based on attention that successfully predicted lymph node metastasis using deep learning features of CT images fused with genetic data. Therefore, machine learning and deep learning-based imaging genomics approaches have great potential and application in predicting KRAS gene mutation status in NSCLC. Although the above model achieved considerable performance, there are still some challenges in the study of deep learning methods based on image and genetic data for predicting KRAS mutation status in NSCLC: 1) Majority of deep learning methods [8, 9] that study classification tasks focus only on classification methods. However, these studies did not use the segmentation features generated by the segmentation task to facilitate the classification task to improve the performance and effectiveness of the classification task. Lesion segmentation and classification are two highly related tasks. The segmentation can help remove distractions from CT images and thus is highly beneficial for improving the accuracy of lesion classification. 2) Most of the studied fusion methods used simple fusion means of direct concatenation. However, they ignore the correlation and difference between medical images and genetic data. It not only leads to ineffective mining of useful semantic features between multi-scale image features and gene features but also fails to make full use of the complementarity of multimodal information. 3) Many studies used models that overemphasized the deep features of lesion abstraction. Nonetheless, they did not pay sufficient attention to the importance of detailed shallow features in prediction results. This leads to limitations in improving accuracy. To overcome these difficulties and achieve non-invasive and accurate prediction of KRAS gene mutations in NSCLC. We propose a Semi-supervised Multimodal Multiscale Attention Model (S2MMAM) for predicting KRAS gene mutation status in NSCLC. The model uses the Mean Teacher [10] framework as the main structure of the network. Mean Teacher can make full use of labeled images to achieve analytical prediction of unlabeled images in order to diminish the dependence of the network on manual annotation. In order to compensate for the information loss of single-modal unlabeled image data to the network, the model not only uses Semi-supervised Multimodal Fusion Classification Networks (S2MF-CN) to share the parameter strategy of the Supervised Multilevel Fusion Segmentation Network (SMF-SN) to enrich the key information of the lesion. S2MMAM also multimodally fuses the patient’s genetic data with the image data to expand the mutation knowledge. Specifically, SMF-SN designs a new Triple Attention-guided Feature Aggregation (TAFA) module. It aims to adaptively fuse high-level semantic features with low-level semantic features using an attention-guided mechanism. TAFA can ignore background noise and localize the extraction of lesion key features. In S2MF-CN, we propose an Intra and Inter Mutual Guidance Attention Fusion (I2MGAF) module to guide the fusion between inter-information and between intra-information in a staged manner. I2MGAF can effectively extract complementary information from different modalities at different scales to facilitate classification efficiency improvement. In contrast to conventional radiomics and machine learning [3, 4], we used a convolutional neural network technique for CT image feature extraction as compared to previous studies for KRAS mutation prediction. This technique is more efficient and reduces the cost of manual annotation. Moreover, it can realize the prospect of end-to-end applications. Studies [5–9] that have made predictions for other diseases in multimodal-based classification tasks have used simple multimodal fusion methods. In contrast, our proposed method focuses more on extracting different dimensions of information from different modal data to achieve complementary fusion. The contributions of this paper are as follows: A Semi-supervised Multimodal Multiscale Attention Model (S2MMAM) based on imaging genomics is proposed, which effectively solves the problem of difficult intermediate fusion of multimodal heterogeneous data. S2MMAM exploits the facilitation of supervised segmentation features for semi-supervised classification tasks to improve the model performance for predicting KRAS gene mutation status in NSCLC. A new Triple Attention-guided Feature Aggregation (TAFA) module is designed. It is based on the attention module to adaptively fuse high-level semantic features with low-level semantic features. TAFA can suppress low-level background noise and retain detailed local semantic information. We use the Intra and Inter Mutual Guidance Attention Fusion (I2MGAF) module to guide segmentation and classification feature fusion, as well as CT image and genetic data fusion, respectively. It can achieve multi-scale multimodal information fusion and improve classification performance. Related work Mean Teacher in semi-supervised learning Semi-supervised learning has been studied in the medical imaging community for a long time [11, 12]. It can reduce the human workload on labeled data. Current research has shown the potential to improve network performance when labels are scarce. There are three semi-supervised models based on the principle of consistency: the Π-Model [13], Temporal Ensembling (TE) [13], and the Mean Teacher model. In order to show the advantages and disadvantages of three consistency-based semi-supervised methods more succinctly, we summarize Table 1, which allows a more precise comparison of the three approaches. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Comparison of three commonly used consistency-based semi-supervised methods. https://doi.org/10.1371/journal.pone.0297331.t001 In recent years, Mean Teacher has achieved good results as a basic framework in semi-supervised classification tasks. Wang et al. [14] successfully identified diabetic macular edema based on the Mean Teacher model using a small amount of roughly labeled data and a large amount of unlabeled data. Liu et al. [15] used the Mean Teacher-based framework of the network model to successfully achieve skin lesion diagnosis with ISIC 2018 challenge and thorax disease classification with ChestX-ray14. Wang et al. [16] proposed a model that unifies diverse knowledge into a generic knowledge distillation framework for skin disease classification. It enables the student model to acquire richer knowledge from the faculty model. The above model demonstrates that Mean Teacher achieves excellent results in semi-supervised classification tasks, so we use it as the basic framework for Our S2MMAM. Segmentation facilitates classification Using segmentation tasks to facilitate classification network tasks is a basic form of multitask learning [17]. In multitask learning, the segmentation task associated with the classification task can assist the learning of the target by the classification task, thus improving the performance of the classification task [18]. Similarly, in a single-task classification model, this idea is borrowed from above. The information captured by the segmentation branch of the model can be transferred to the classification model to expand the foci information. The supervised segmentation task is trained using masked labeled data. The aim is to obtain the most comprehensive high-level semantic features of the target region and reduce the learning of noisy backgrounds. Rich segmentation features can support the classification task to learn more and richer semantic information. Thus, a supervised segmentation network can assist the classification task by suppressing the background noise introduced by missing physician labeling information in semi-supervised classification networks and improving the classification accuracy. According to Table 2, the above works demonstrate that segmentation has a facilitating effect on classification. However, there is a common problem: they are all studied for supervised models. Supervised models have high requirements for data labeling costs. We believe that the combination of segmentation and classification tasks can make the network more informative. Therefore, our research aims to combine the idea of segmentation facilitating classification with semi-supervised models. We combined two related tasks of NSCLC lesion segmentation and KRAS gene mutation status prediction. S2MMAM allows S2MF-CN to obtain the key features of lesions upon initialization through the strategy of sharing network parameters between SMF-SN and S2MF-CN. In S2MF-CN, the segmentation features are guided to merge with the classification features to obtain the extracted key features. This strategy can enrich the lesion information and improve the network model classification performance. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Comparison of three commonly used consistency-based semi-supervised methods. https://doi.org/10.1371/journal.pone.0297331.t002 Multiscale features and attention learning Traditional convolution operations mostly focus on extracting local features. However, due to the limited information contained in local features, the model cannot learn the full range of region of interest contents well. Multi-scale features contain local features of multiple regions of interest. The extracted local features are fused with other operations to obtain comprehensive information about the target, which helps the network model to learn. To extract multi-scale features, The Atrous Spatial Pyramid Pooling (ASPP) module [21] captures contextual information by multi-step convolution of the target region using different expansion rates. In the medical image domain, the PSE [22] module uses a patch-level pyramid design to extend SE operations to multiple scales, allowing the network to adaptively focus on vessels of variable width. The scale-aware Feature Aggregation (SFA) module [23] effectively extracts hidden multi-scale background information and aggregates multi-scale features to improve the model’s ability to handle complex vasculature. The Convolutional Block Attention Module (CBAM) [24] introduces channel and spatial attention. It extracts multiple key feature information from both dimensions to enrich the network content. In the medical image application domain, Context-assisted full Attention Network (CAN) [25] combines Non-Local Attention (NLA), Channel Attention (CA), and Dual-pathway Spatial Attention (DSA) to extract lesion information in multiple directions. Currently, it is widely believed that both multi-scale features and attention mechanisms can help models enhance the recognition of feature maps from different dimensions. However, the above papers have a common problem: they do not combine the ideas of multi-scale and attention mechanism. Therefore, we combine these two techniques and design the TAFA module. On the one hand, fuse high and low dimensional segmentation features to obtain abstract and detailed information. On the other hand, we fuse segmentation and classification features of different levels to guide the features to learn key factors adaptively and enhance the ability of the network to capture lesions. Thus, the predictive capability of the model is improved. Mean Teacher in semi-supervised learning Semi-supervised learning has been studied in the medical imaging community for a long time [11, 12]. It can reduce the human workload on labeled data. Current research has shown the potential to improve network performance when labels are scarce. There are three semi-supervised models based on the principle of consistency: the Π-Model [13], Temporal Ensembling (TE) [13], and the Mean Teacher model. In order to show the advantages and disadvantages of three consistency-based semi-supervised methods more succinctly, we summarize Table 1, which allows a more precise comparison of the three approaches. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Comparison of three commonly used consistency-based semi-supervised methods. https://doi.org/10.1371/journal.pone.0297331.t001 In recent years, Mean Teacher has achieved good results as a basic framework in semi-supervised classification tasks. Wang et al. [14] successfully identified diabetic macular edema based on the Mean Teacher model using a small amount of roughly labeled data and a large amount of unlabeled data. Liu et al. [15] used the Mean Teacher-based framework of the network model to successfully achieve skin lesion diagnosis with ISIC 2018 challenge and thorax disease classification with ChestX-ray14. Wang et al. [16] proposed a model that unifies diverse knowledge into a generic knowledge distillation framework for skin disease classification. It enables the student model to acquire richer knowledge from the faculty model. The above model demonstrates that Mean Teacher achieves excellent results in semi-supervised classification tasks, so we use it as the basic framework for Our S2MMAM. Segmentation facilitates classification Using segmentation tasks to facilitate classification network tasks is a basic form of multitask learning [17]. In multitask learning, the segmentation task associated with the classification task can assist the learning of the target by the classification task, thus improving the performance of the classification task [18]. Similarly, in a single-task classification model, this idea is borrowed from above. The information captured by the segmentation branch of the model can be transferred to the classification model to expand the foci information. The supervised segmentation task is trained using masked labeled data. The aim is to obtain the most comprehensive high-level semantic features of the target region and reduce the learning of noisy backgrounds. Rich segmentation features can support the classification task to learn more and richer semantic information. Thus, a supervised segmentation network can assist the classification task by suppressing the background noise introduced by missing physician labeling information in semi-supervised classification networks and improving the classification accuracy. According to Table 2, the above works demonstrate that segmentation has a facilitating effect on classification. However, there is a common problem: they are all studied for supervised models. Supervised models have high requirements for data labeling costs. We believe that the combination of segmentation and classification tasks can make the network more informative. Therefore, our research aims to combine the idea of segmentation facilitating classification with semi-supervised models. We combined two related tasks of NSCLC lesion segmentation and KRAS gene mutation status prediction. S2MMAM allows S2MF-CN to obtain the key features of lesions upon initialization through the strategy of sharing network parameters between SMF-SN and S2MF-CN. In S2MF-CN, the segmentation features are guided to merge with the classification features to obtain the extracted key features. This strategy can enrich the lesion information and improve the network model classification performance. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Comparison of three commonly used consistency-based semi-supervised methods. https://doi.org/10.1371/journal.pone.0297331.t002 Multiscale features and attention learning Traditional convolution operations mostly focus on extracting local features. However, due to the limited information contained in local features, the model cannot learn the full range of region of interest contents well. Multi-scale features contain local features of multiple regions of interest. The extracted local features are fused with other operations to obtain comprehensive information about the target, which helps the network model to learn. To extract multi-scale features, The Atrous Spatial Pyramid Pooling (ASPP) module [21] captures contextual information by multi-step convolution of the target region using different expansion rates. In the medical image domain, the PSE [22] module uses a patch-level pyramid design to extend SE operations to multiple scales, allowing the network to adaptively focus on vessels of variable width. The scale-aware Feature Aggregation (SFA) module [23] effectively extracts hidden multi-scale background information and aggregates multi-scale features to improve the model’s ability to handle complex vasculature. The Convolutional Block Attention Module (CBAM) [24] introduces channel and spatial attention. It extracts multiple key feature information from both dimensions to enrich the network content. In the medical image application domain, Context-assisted full Attention Network (CAN) [25] combines Non-Local Attention (NLA), Channel Attention (CA), and Dual-pathway Spatial Attention (DSA) to extract lesion information in multiple directions. Currently, it is widely believed that both multi-scale features and attention mechanisms can help models enhance the recognition of feature maps from different dimensions. However, the above papers have a common problem: they do not combine the ideas of multi-scale and attention mechanism. Therefore, we combine these two techniques and design the TAFA module. On the one hand, fuse high and low dimensional segmentation features to obtain abstract and detailed information. On the other hand, we fuse segmentation and classification features of different levels to guide the features to learn key factors adaptively and enhance the ability of the network to capture lesions. Thus, the predictive capability of the model is improved. Method Overview In this paper, we propose a Semi-supervised Multimodal Multiscale Attention Model (S2MMAM). The overall architecture of the model is divided into two parts: Supervised Multilevel Fusion Segmentation Network (SMF-SN) and Semi-supervised Multimodal Fusion Classification Network (S2MF-CN), as shown in Fig 1. In this model, the useful information of CT images is captured by SMF-SN and transferred to S2MF-CN to facilitate the execution of image prediction tasks. The S2MMAM utilizes the fusion of CT images and genetic data to accurately predict whether KRAS is mutated in NSCLC. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Overview of our S2MMAM, including: (a) Supervised Multilevel Fusion Segmentation Network (SMF-SN). The inputs are CT images and pixel-level mask images, and the outputs are segmented lesion images, (b) Semi-supervised Multimodal Fusion Classification Network (S2MF-CN), and (c) processing of gene data. In the S2MMAM, the useful information of CT images is captured by SMF-SN and transferred to S2MF-CN to facilitate the execution of image prediction tasks. The S2MMAM utilizes the fusion of CT images and genetic data to accurately predict whether KRAS is mutated in NSCLC. https://doi.org/10.1371/journal.pone.0297331.g001 In the NSCLC dataset, each patient corresponds to a set of CT images and gene data (Section Dataset). Specifically, in our problem setting, we are given a training set containing N labeled data and M unlabeled data where N<