TY - JOUR AU1 - Vania, Malinda AU2 - Lee, Deukhee AB - Abstract Lower back pain is one of the major global challenges in health problems. Medical imaging is rapidly taking a predominant position for the diagnosis and treatment of lower back abnormalities. Magnetic resonance imaging (MRI) is a primary tool for detecting anatomical and functional abnormalities in the intervertebral disc (IVD) and provides valuable data for both diagnosis and research. Deep learning methods perform well in computer visioning when labeled general image training data are abundant. In the practice of medical images, the labeled data or the segmentation data are produced manually. However, manual medical image segmentation leads to two main issues: much time is needed for delineation, and reproducibility is called into question. To handle this problem, we developed an automated approach for IVD instance segmentation that can utilize T1 and T2 images during this study to handle data limitation problems and computational time problems and improve the generalization of the algorithm. This method builds upon mask-RCNN; we proposed a multistage optimization mask-RCNN (MOM-RCNN) for deep learning segmentation networks. We used a multi-optimization training system by utilizing stochastic gradient descent and adaptive moment estimation (Adam) with T1 and T2 in MOM-RCNN. The proposed method showed a significant improvement in processing time and segmentation results compared to previous commonly used segmentation methods. We evaluated the results using several different key performance measures. We obtain the Dice coefficient (99%). Our method can define the IVD’s segmentation as much as 88% (sensitivity) and recognize the non-IVD as much as 98% (specificity). The results also obtained increasing precision (92%) with a low global consistency error (0.03), approaching 0 (the best possible score). On the spatial distance measures, the results show a promising reduction from 0.407 ± 0.067 mm in root mean square error to 0.095 ± 0.026 mm, Hausdorff distance from 12.313 ± 3.015 to 5.155 ± 1.561 mm, and average symmetric surface distance from 1.944 ± 0.850 to 0.49 ± 0.23 mm compared to other state-of-the-art methods. We used MRI images from 263 patients to demonstrate the efficiency of our proposed method. Graphical Abstract Open in new tabDownload slide Graphical Abstract Open in new tabDownload slide magnetic resonance image, intervertebral discs segmentation, automatic instance segmentation, regional convolutional neural network, multistage optimization mask-RCNN, multistage segmentation Highlights A method to enhance the accuracy of instance intervertebral disc segmentation from magnetic resonance imaging data was proposed. An efficient multi-optimization training system at a different stage to handle the computational time problem in deep learning. A robust transfer learning scheme to compensate for the low dataset availability and improving the deep learning model’s generalization. Influence of group normalization and dropout regularization on handling the mixed data of T1 and T2 is analysed. 1 Introduction An aging population increases musculoskeletal pain’s pervasiveness affecting the bones, muscles, ligaments, tendons, and nerves (Bressler et al., 1999). The most common type of musculoskeletal pain occurs in the lower back. The problem can have a rapid onset with severe short-term (acute) or long-lasting (chronic) symptoms (Podichetty et al., 2003). Two-thirds of adults worldwide have experienced lower back pain at some time during their lives. The frequency and intensity only increase due to the natural deterioration of the intervertebral discs (IVDs) as they age (degenerative disc disease; Prince et al., 2015). IVDs serve two functions: motion and shock absorber. As we age, the disc will lose this ability to distribute stress load to other areas (bone and joints). Changes in the body’s ability to transfer these stresses cause arthritic changes to occur in our surrounding anatomy, resulting in increased back pain and stiffness. Each IVD has the following two parts: The annulus fibrosus is the firm, tough outer layer. It contains the nerves. Tearing the annulus fibrosus creates considerable pain. The nucleus pulposus is the soft, jelly-like core. This part of the IVD contains proteins that can cause the tissues they touch to become swollen and tender if these proteins leak out to the nerves of the annulus fibrosus. The individual will experience a great deal of pain. Medical imaging, especially magnetic resonance imaging (MRI), has become increasingly common in diagnosing and treating IVD-related conditions (Takatalo et al., 2009; Seon-Yu et al., 2012; Korez et al., 2017; Fallah et al., 2019; Pang et al., 2021). An accurate automatic medical image segmentation of the IVDs obtained from an MRI is crucial for the accurate diagnosis of degenerative disc disease in the creation of a treatment plan (Korez et al., 2017; Fallah et al., 2019; Zhou et al., 2019; Pang et al., 2021). Medical imaging and computer technologies have revolutionized healthcare, improving diagnostic accuracy and increasing patient safety and comfort through massive quantities of medical images available. However, the scarcity of qualified experts, in contrast with the massive amounts of medical images to examine, drives the need for efficient, robust, and problem-tailored computer-aided image analysis (Kumar et al., 2020). Organ segmentation in MRI images is a rapidly growing research area in medicine and industry. Recent and continued achievements in artificial intelligence and deep learning have enhanced the performance of medical image analysis and computer-aided diagnosis (Lundervold & Lundervold, 2019; Chan et al., 2020; Kumar et al., 2020; Tang, 2020). Automated segmentation overcomes weaknesses associated with manual segmentation, such as the considerable time required to provide an accurate segmentation. Moreover, manual segmentation causes stress to the specialists. Demand for their knowledge and skill far outstrips the availability of their expertise and skill. As with any labor-intensive profession, the long time spent in examination increases the risk of human error (Korez et al., 2017). In contrast, automated image segmentation and deep learning can support fast and intelligent computer-aided diagnosis systems providing medical practitioners with models to accurately diagnose and treat conditions, including degenerative disc disease (Liu et al., 2019; Park et al., 2020). Medical image analysis relies on accurate segmentation to extracts regions of interest and provides more intuitive and meaningful medical information than raw images (Jain et al., 2010; Gao et al., 2011; Vania et al., 2019). The rapid growth of deep learning research in recent years has contributed tremendously to image processing. The extraction of intricate features from raw data using deep learning eliminates the time-consuming task of identifying and labeling the same features (Kamnitsas et al., 2017; Li et al., 2018; Lundervold & Lundervold, 2019; Kumar et al., 2020). Automated feature extraction combined with high model accuracy makes deep learning highly favorable in the medical community. Deep learning is being widely adopted for representative feature extraction, segmentation, and recognition in general 2D imaging (Li et al., 2018). However, medical image segmentation comes with particular challenges because different data formats and acquisition parameters require specialized treatment during learning model development. It needed technical methods to adapt networks and models to achieve clinically acceptable accurate segmentation. For instance, CT and MRI images are two important diagnostic bases in clinical diagnosis. Both images are frequently in the three-dimensional (3D) format, and the segmentation of soft tissues needs to be done slice by slice on 2D images (Liu et al., 2020). Suppose all medical images are hand-marked by radiologists. In that case, it will take up to 15 min per image, a time-consuming task with a low inter-rater agreement. Therefore, it is necessary to develop automated methods that can segment the soft tissues. The segmentation methods are expected to have a broad impact by supporting clinician decisions (Liu et al., 2020). At present, methods based on deep learning have made remarkable achievements in the field of image segmentation (Liu et al., 2021). The fully convolutional network was the first to successfully use deep learning for image semantic segmentation. This was the pioneering work of using convolutional neural networks for image segmentation and diagnosis in medical image (Zhang et al., 2021). There are outstanding segmentation networks such as U-Net (Ronneberger et al., 2015) and mask-RCNN (He et al., 2017), and ResNet (He et al., 2016) that have become a pioneer to improve the performance of medical image segmentation (Liu et al., 2021). At the same time, extensive research has been focused on IVD image segmentation from MRI (Chen & Belavy, 2014; Zheng et al., 2017; Wang et al., 2019). However, there are still challenges related to MRI modality limitations (Pang et al., 2021). These challenges include dataset limitations (Zhou et al., 2019). High-performance deep neural networks are profoundly dependent on the expansive availability of labeled training data for proper training (datasets). Privacy restrictions limit the number of images available for training. The limited availability of human expertise to manually select features and annotate large datasets for network training is the most substantial challenge. Also, datasets are often unbalanced due to image size, high variability across patient anatomy, and an unbalanced occurrence of cases (diagnosis) in most datasets. This increases the difficulty of training deep learning models for recognizing patterns of a condition in medical images due to the scarcity of sample availability. Additionally, deep learning training procedures can be, in many cases, very costly in terms of time, computing power, and scarce GPU availability (Lundervold & Lundervold, 2019; Pang et al., 2021). These challenges and limitations often dissuade practitioners from adopting deep learning techniques. Mitigating these barriers will notably advance the application of deep learning, especially in medical imaging. In the most recent studies, there is a technique that has been developed to take advantage of large pre-trained networks by only adapting the value of the trainable coefficients to the particular problem being treated. The most widely used of these techniques is transfer learning (Pan & Yang, 2010; Shin et al., 2016; Gopalakrishnan et al., 2017; Huang et al., 2017; Masi et al., 2018; Vidal et al., 2021). Transfer learning is a method in which a model developed in one task is reused as a starting point in another task to solve a different but related task involving new data (Pan & Yang, 2010; Vidal et al., 2021). Therefore, to solve this problem, we propose a methodology that allows adapting the knowledge from a well-known domain to a new domain with a small number of sample data. Recently, multimodal medical image segmentation using deep learning has gained great interest (Dolz et al., 2019; Zhou et al., 2019). Multimodality is widely used in medical imaging because it can provide multi-information about a target (tumor, organ, or tissue) and can be utilized to improve the segmentation (Zhou et al., 2019). We attempt to overcome the dataset limitations while increasing the quantitative assessments of medical images using deep learning technologies with an efficient trade-off of computational time, specifically focused on IVD instance image segmentation by utilizing multimodality information. The systems need to automate the difficult and tedious IVD instance segmentation tasks, producing highly accurate results. In this paper, we present a fully automated approach for IVD instance segmentation from MRI using multistage optimization mask-RCNN (MOM-RCNN) with the following main contributions: We propose an efficient multistage optimization training scheme utilizing a combination of stochastic gradient descent (SGD; Robbins & Monro, 1951) and adaptive moment estimation (Adam; Kingma & Ba, 2015) to reduce the computational time by achieving faster iterations without trading off a bad convergence rate. A robust transfer learning scheme (combined T1 and T2 data from MRI) to compensate for the low dataset availability while improving our deep learning model’s generalization. We demonstrate the capabilities of our system using MOM-RCNN to improve the accuracy of instance IVD segmentation results significantly. Robustness to various levels of impulse noise and heterogeneous datasets allows fully automatic instant segmentation, rendering the proposed method suitable for efficient instant segmentation in computer-aided diagnosis systems. Reduction of manual pre-processing time spent in denoising. Reduction of post-processing in conventional segmentation methods such as multiple thresholding. The proposed method encompasses a series of contributions, and for these reasons, we believe the proposed method suggests its feasibility for efficient segmentation in any computer-aided diagnosis system. 2 Related Works 2.1 U-Net U-Net is chosen as one of our benchmarks because it is a convolutional neural network architecture that is widely used for medical image segmentation, currently state-of-the-art in medical image segmentation performance (Liu et al., 2020). The architecture consists of encoder and decoder structures (Ronneberger et al., 2015). More recently, the variant of U-Net architecture (Attention U-Net) is used to segment the pancreas, and multiclass abdominal from CT shows that U-Net has capabilities to use different datasets and training sizes while achieving state-of-the-art performance without requiring multiple CNN models (Schlemper et al., 2019). Furthermore, USE-Net that incorporates squeeze-and-excitation (SE) blocks into U-Net is designed to exploit adaptive channel-wise feature recalibration to boost the generalization performance for prostate MRI segmentation (Rundo et al., 2019). U-Net constructs feature maps using convolutional layers to extract essential features followed by max-pooling layers to reduce map sizes during encoding. This process is repeated four times. Two 3 × 3 convolutional layers finally connect the encoder to the decoder. U-Net up-samples the feature maps with 2 × 2 max-pooling layers followed by a transposed convolution operation in a 3 × 3 convolutional layer during decoding. This process repeats four times. Finally, a 1 × 1 convolution is applied to retrieve the segmentation map. The feature maps from the convolutional layers are transferred to the decoder before applying the pooling layers during encoding. These intermediate maps are then concatenated with the output of the up-sampling operation. The concatenated feature map is propagated to successive layers. U-Net predicts a segmentation map by combining global image context information and localization information of the target structure (e.g. organ). However, U-Net hyper-parameters are determined for a specific task and highly dependent on the dataset (e.g. T1 only or T2 only). This model requires abundant paired training data. Lack of sufficient and consistent labeled training samples may result in poor segmentation maps, a common problem for medical image segmentation deep learning algorithms. More recently, integrating multimodal images in deep learning segmentation methods has also gained growing attention (Dolz et al., 2019; Zhou et al., 2019; Fallah et al., 2019). However, it is still not fully exploited for IVD localization and segmentation. IVD-Net using U-Net does the IVD localization and segmentation sequentially, not jointly; localization is assessed after the segmentation is done (Dolz et al., 2019; Fallah et al., 2019). This means that the process of localization itself is not optimized during training and prone to aggregate error. Therefore, we aim to solve this by combining the two processes into one process by conducting IVD instance segmentation. Moreover, the segmentation results are often compromised by translation invariance, whose processing results in segmentation maps with relatively low resolution. Image details, which could be integral to computer-aided diagnosis, are lost during the max-pooling process in the encoding because it only retains a pixel with the largest value among the neighboring four pixels and removes the information of the other pixels. Therefore, the max-pooling efficiently detects the dominant information representing image characteristics, with a trade-off loss of detailed information, potentially lowering overall quality. The missing detail is not restored through up-samples during decoding. Finally, U-Net does not consider semantic gap problems between corresponding levels of encoding and decoding (Ibtehaz & Rahman, 2020). This occurs when feature maps transferred through skip connection are simply concatenated from the encoding to the decoding stages. This suggests that it is difficult to achieve detailed instance segmentation with U-net, and it may lead to unsatisfactory results for complex objects, such as IVDs. 2.2 Mask-RCNN Mask-RCNN is chosen as one of our benchmarks because it is famous for being the one of the outstanding segmentation networks (Liu et al., 2021). Mask-RCNN is a regional convolutional neural network with a two-phase framework for instance image segmentation (He et al., 2017) as shown in Fig. 1. The first phase uses region proposal network (RPN) to process the image and generates candidate object bounding boxes for delivery into phase two. The second phase classifies candidate object bounding boxes, generates refined bounding boxes, and predicts masks. Mask-RCNN involves several hyper-parameters that must be tuned carefully based on the application. The hyper-parameters are obtained from the following three main modules in mask-RCNN: Convolutional backbone: The convolutional backbone is responsible for feature extraction from the whole image. RPN: Computes the region proposals and then Region of Interest Align (RoIAlign) extracts features from each candidate object bounding boxes. Box Head and Mask Head: Performs two parallel operations, performs bounding-box recognition (classification and regression), and mask prediction. Figure 1: Open in new tabDownload slide Framework of mask-RCNN. Figure 1: Open in new tabDownload slide Framework of mask-RCNN. Box Head gets organ detection, classification, and bounding box regression. Mask Head, after RoIAlign, outputs high-accuracy segmentation masks. Overall, the loss function consists of classification loss, bounding box regression loss, and mask loss (He et al., 2017). This model also requires abundant paired training data. In addition, it suffers from several particular challenges. The model lacks the ability to handle heterogeneous data types (e.g. mixed T1 and T2 data; either use T1 data only or T2 data only). It is prone to experience overfitting and is computationally expensive (time consuming). Studies have shown that learning with high-resolution data and long training times on powerful devices could lead to overfitting (Lin et al., 2016; Zhang et al., 2018). There is a fundamental effort in deep learning to prevent overfitting by properly controlling or regularizing the training and improve generalization (Zhang et al., 2018). Improvements in instance segmentation algorithms are required to mitigate the downsides of mask-RCNN and other currently available deep learning methods. 3 Methodology 3.1 A MOM-RCNN The proposed method is extended from the mask-RCNN (He et al., 2017) framework, the state-of-the-art instance image segmentation, which has demonstrated impressive performance on various instance image segmentation researches. As shown in Fig. 2, the proposed MOM-RCNN method consists of four stages, divided into Backbone, Neck, DenseHead, and ROIHead (Region of Interest Head). The Backbone: Transforms the input image into a raw feature map. We use modified ResNet-50 based on (He et al., 2016) architecture. The Neck: Connects the backbone and the head. The refinement and reconfiguration are performed on the raw feature map. It consists of a top-bottom pathway and lateral connections (Lin et al., 2017). The top-bottom path generates a feature pyramid map similar in size to the raw feature map. Lateral connections are convolutional and add operations between two corresponding levels of the two paths. The DenseHead: Performs dense locations of feature maps. The RPN scans each region and predicts whether an object is present or not. One of the significant advantages of the RPN is that RPN does not observe the actual image. The network scans the feature map using predefined numbers of anchor boxes, making it much faster. ROIHead (BBoxHead and MaskHead): Extract and faithfully preserve exact spatial locations of ROI-impacting features from multiple feature maps using the ROIAlign. This part receives ROI features as input and makes ROI-wise task-specific prediction. There are two parallel tasks in this stage: BBoxHead: The bounding box is located and classified in the detection branch for IVD detection. MaskHead: The fully convolution network (FCN) generated the corresponding IVD mask and background image segmentation in the segmentation branch. Figure 2: Open in new tabDownload slide Pipeline of a MOM-RCNN. Figure 2: Open in new tabDownload slide Pipeline of a MOM-RCNN. The loss function consists of classification loss, bounding box regression loss, and mask loss that added together without weight. In the following, we will introduce our network’s critical steps in detail, shown in Fig. 3. A MOM-RCNN uses SGD optimization (Robbins & Monro, 1951), and a method for stochastic optimization (Adam optimization) (Kingma & Ba, 2015). The SGD and Adam optimization are used on the training process. The SGD is good and easy to use for finding global optima. SGD falters and becomes challenging to use for discovering local optima. Adam optimization combines the best properties of the adaptive gradient (Duchi et al., 2011) and root mean square propagation (Tieleman & Hinton, 2012) to provide an optimization algorithm that can handle sparse gradients on noisy problems. It is suggested as the default optimization method for deep learning applications (Ruder, 2016). Figure 3: Open in new tabDownload slide The illustrations of the MOM-RCNN Framework. Figure 3: Open in new tabDownload slide The illustrations of the MOM-RCNN Framework. Our ResNet-50 feature extractor is initialized with weights trained on ImageNet (Deng et al., 2009). All other weights (e.g. in the RPN) are initialized using Xavier initialization (Glorot & Bengio, 2010). We choose to train the network on the T1 and T2 MRI training dataset with a batch size of two. We use a single-GPU machine. For the original mask-RCNN, an effective batch size of 16 was used. The training consists of three stages: only the MaskHead and not the proposed ResNet-50 backbone is trained in the first stage. In the second stage, the prediction heads (DenseHead and ROIHead) and parts of the backbone [starting at layer 4 (CN4)] are optimized. Finally, in the third stage, all the model components (backbone and heads) are trained together. For the first two training stages, we use T1 data with SGD optimization; for the last stage, we use T2 with Adam optimization. Each optimization has separate configuration parameters, such as SGD configuration parameters are learning rate = 0.001, learning momentum = 0.9, and weight decay = 0.0001. Adam configuration parameters are alpha (learning rate or step size) = 1.0E-6 (slow learning right down during training), beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-08 (a very small number to prevent any division by zero). 3.2 Normalization in ResNet-50 We extract features from five feature maps using the ResNet-50 architecture (CN1, CN2, CN3, CN4, and CN5), as shown in Fig. 3. Our ResNet-50 architecture is different from the original ResNet-50 architecture (He et al., 2016), as shown in Figs 4 and 5. We use a top-bottom approach to generate final feature maps, start from the smallest feature map, and continue down to bigger ones by upscale operations. In the diagram, layer two generates a feature map, and this feature map operated with 1 × 1 convolutions to bring down the number of channels to 256. These elements are then added to the up-sampled output from the previous iteration. This process’s outcomes are operated with a 3 × 3 convolution layer with stride 2 to create the final four feature maps (FP2, FP3, FP4, and FP5). The fifth feature map (FP6) is obtained from a max-pooling operation from P5. All five feature maps are used in RPN to generate candidate object bounding boxes. However, only four feature maps (FP2, FP3, FP4, and FP5) are used while associating with ROIs. Figure 4: Open in new tabDownload slide The overview of the original ResNet-50 architecture. At stage 1, the feature map size is down-sampled with a convolutional layer with stride 2. The batch normalization and ReLu layer follow it. Starting from stage 2, the number of filters used by the layers is the same within each stage. Each stage has a convolutional (Conv) block and several identity blocks. Figure 4: Open in new tabDownload slide The overview of the original ResNet-50 architecture. At stage 1, the feature map size is down-sampled with a convolutional layer with stride 2. The batch normalization and ReLu layer follow it. Starting from stage 2, the number of filters used by the layers is the same within each stage. Each stage has a convolutional (Conv) block and several identity blocks. Figure 5: Open in new tabDownload slide The overview of the proposed ResNet-50 architecture. Every convolutional layer is followed by the group normalization, dropout regularization, and then ReLu layer at each stage. Each stage has a convolutional (Conv) block and several different numbers of identity blocks. Figure 5: Open in new tabDownload slide The overview of the proposed ResNet-50 architecture. Every convolutional layer is followed by the group normalization, dropout regularization, and then ReLu layer at each stage. Each stage has a convolutional (Conv) block and several different numbers of identity blocks. In the original ResNet-50 architecture, each convolutional block and identity block usually contains a set of convolutional layers followed by batch normalization and the ReLU activation function. We adopt a group normalization (Wu & He, 2018) and dropout regularization (Srivastava et al., 2014) and make several modifications to more successfully handle the mixed training data of T1 and T2. MOM-RCNN employs high-resolution training data. This limits the batch size to one or two images per batch. For this reason, batch normalization is not sufficient for IVD segmentation using MOM-RCNN. Batch normalization requires a sufficiently large batch size [e.g. 32 per worker (Wu & He, 2018)], while a small batch size leads to inaccurate estimation of the batch statistics. Reduction of batch size risks increasing the model error significantly (Wu & He, 2018). After the group normalization, we add dropout regularization. In deep learning, regularization is crucial to prevent the models from overfitting and to increase the generalization effect. As a regularization method, dropout has been successfully applied to many deep learning models (Dahl et al., 2013; Inoue, 2019) and is analysed to avoid co-adaptation problems among the hidden nodes of deep feed-forward neural networks by dropping out randomly selected hidden nodes (Hinton et al., 2012; Helmbold & Long, 2017). 3.3 Loss functions MOM-RCNN uses a complex loss function calculated as the sum of different losses at each stage of the model. The loss corresponds to the criteria the model should assign to each of its stages. The features obtained by ROIAlign are used as input to the BBoxHead, for classification and bounding box regression, and as input to the MaskHead, for segmentation. The classification is done by passing the output from the FCN layer, using all the features through the softmax layer. The MOM-RCNN loss function indicates the difference between the predicted value and the ground truth. It plays an essential role in IVD segmentation model training. In our MOM-RCNN, a joint loss function is defined to train bounding box refinement regression, class prediction classification, and mask prediction generation. Class prediction classification loss (Lr,class and Lm,class) and bounding box refinement regression loss (Lr,box and Lm,box) are obtained from both RPN and mask prediction generation stages, while mask prediction generation loss (Lmask) is obtained only from mask prediction generation stages. Lmask is only defined per class to avoid competition among mask outputs. The MOM-RCNN loss function is defined as $$\begin{equation*} L_{\mathrm{ MOM-RCNN}} = L_{r,\mathrm{ class}}+L_{m,\mathrm{ class}}+L_{r,\mathrm{ box}}+L_{m,\mathrm{ box}}+L_{\mathrm{ mask}} , \end{equation*}$$(1) where Lr,class : This corresponds to the loss assigned to an RPN’s improper classification of anchor boxes (presence/absence of any object). It should be high when the model in the final output is not detecting multiple objects to ensures that RPN will capture it. Lr,box : This corresponds to the localization accuracy of the RPN. It is used to tune in the event the object is being detected, but the bounding box should be corrected. Lm,class : This corresponds to the loss assigned to the improper classification of an object present in the proposed region. It is high in the event the object is being detected from the image but misclassified. Lm,box : This corresponds to the loss assigned to the localization of the identified class’s bounding box. It is high if the correct classification of the object is done, but localization is not precise. Lmask : This corresponds to masks created on the identified objects. The class prediction classification error (Lr,class and Lm,class) is computed by $$\begin{equation*} L_{r,\mathrm{ class}} = \frac{1}{M_{\mathrm{ class}}}\sum _{i} - \log \Big [pr_{i}^{*}pr_{i}+\big (1-pr_{i}^{*}\big )\big (1-pr_{i}\big )\Big ] , \end{equation*}$$(2) where Mclass indicates the number of categories; pri is the probability that the i-th ROIs are predicted to be positive samples (IVD). When the ROIs are positive samples, |$pr_{i}^{*}$|=1; otherwise, |$pr_{i}^{*}$|=0. The same equation works for Lm,class. The equation for bounding box refinement regression loss (Lr,box and Lm,box) is computed by $$\begin{equation*} L_{r,\mathrm{ box}} = \frac{1}{M_{\mathrm{ regress}}}\sum _{i}pr_{i}^{*} S \big (\mathrm{ trans}_{i}, \mathrm{ trans}_{i}^{*} \big ) , \end{equation*}$$(3) where Mregress is the pixel number in the feature map, transi indicates four translation scaling parameters of positive sample ROIs to the prediction region; |$\mathrm{ trans}_{i}^{*}$| indicates the four translation scaling parameters of positive sample ROIs to the real label; S( · ) is a smooth function. The same equation works for Lm,box. The mask prediction generation loss Lmask is computed by $$\begin{equation*} L_{\mathrm{ mask}}= - \frac{1}{n^2}\sum _{1\le x,y \le n} \Big [lbl_{xy}^{p}=\big (1-lbl_{xy}\big )\log \big (1-lbl_{xy}^{2}\big )\Big ] , \end{equation*}$$(4) where lblxy is the label value of the coordinate point (x, y) in the n × n region and |$lbl_{xy}^{p}$| is the predicted value for the p-th class at that point. 4 Experimental Design and Results There are several factors affecting the diversity of the medical image data, including the type of MRI machine used, image type (T1 and T2), the scan time, and patients. To prove the robustness of our method, we tested our algorithm on a diverse array of public patient datasets. 4.1 Data sources We obtained a public dataset of IVDM3SegChallenge (Chen & Belavy, 2014) from MICCAI Challenge that consists of 48 patients and a public dataset from Kaggle Challenge, namely SpineSegT2W dataset, to train and validate the models (Challenge, access date: 10 August 2020) that consist of 215 patients. The dataset of SpineSegT2W contains 215 T2-weighted spine MR images of patients with disc herniation and degeneration. Each image is paired with ground truth labeled by expert radiologists. The SpineSegT2W dataset was acquired from patients in China with the following protocol: slice thickness 4.4 mm, pixel spacing 0.34 mm, repetition time (TR) = 68.5 s, and echo time (TE) = 90 ms. In the direction of the sagittal viewpoint, the size of the image is 880 × 880, and the number of slices varies from 12 to 15 for different patients. Due to privacy concerns, all the data have been anonymized; therefore, there is no information about the gender and age of the patients. The latest public musculoskeletal dataset is IVDM3Seg, which is published in MICCAI 2018. The goal of IVDM3SegChallenge is to provide a standard evaluation framework for investigating automatic IVD segmentation algorithms. The T1 data of 48 patients are paired with the manually annotated ground truth provided by the organizer. The data from MICCAI Challenge were acquired with a 1.5 Tesla MR scanner of Siemens (Siemens Healthcare, Erlangen, Germany) with the following protocol: slice thickness 2.0 mm, pixel spacing 1.25 mm, repetition time (TR) = 10.6 ms, and echo time (TE) = 4.76 ms. All the images were scanned from patients involved in the second Berlin BedRest study (Belavy et al., 2010). The summary of the demographic information of the patients is shown in Table 1. Table 1: Demographic information of the patients in MICCAI Challenge dataset. Subject information . Mean . SD . Age (year) 35.1 8.5 Weight (kg) 69.8 8.0 Height (cm) 176.0 0.06 Subject information . Mean . SD . Age (year) 35.1 8.5 Weight (kg) 69.8 8.0 Height (cm) 176.0 0.06 Open in new tab Table 1: Demographic information of the patients in MICCAI Challenge dataset. Subject information . Mean . SD . Age (year) 35.1 8.5 Weight (kg) 69.8 8.0 Height (cm) 176.0 0.06 Subject information . Mean . SD . Age (year) 35.1 8.5 Weight (kg) 69.8 8.0 Height (cm) 176.0 0.06 Open in new tab The data from MICCAI and Kaggle Challenge ensure diversity among patients and scanning machines in a clinical setting because it is obtained from two different providers on two different continents. In our work, diversity among patients and MRI imaging machine types were considered important. As a result, we obtained 263 patients’ data. 4.2 Parameter selection, training, and testing computational time Because spine MR images are volumetric data, they are processed frame by frame (Liu et al., 2020). During training, the 2D slices were sampled randomly across all scans without forcing each batch to have only consecutive slices or only from the same subject. The code of MOM-RCNN methods was implemented in Python 3.7.3,1 and the deep convolutional neural network structures are established based on the Keras2 framework with a TensorFlow backend.3 Each training is carried out on a single GPU NVIDIA GeForce GTX 1080Ti on an Ubuntu workstation with CUDA 9.0. Figure 6 shows the training and validation loss curves over a 150-epoch training of the five-fold cross-validation. The solid lines represent the mean loss curves over the five-folds, and the shadow line indicates the ranges. Figure 6: Open in new tabDownload slide Convergence analysis for 150 epochs. Training and validation loss curves of the five-fold cross-validation. The solid lines represent the mean loss curves over the five-folds, and the shadow line indicates the ranges. The dark blue color line represents the training loss, and the light blue color represents the validation loss. Figure 6: Open in new tabDownload slide Convergence analysis for 150 epochs. Training and validation loss curves of the five-fold cross-validation. The solid lines represent the mean loss curves over the five-folds, and the shadow line indicates the ranges. The dark blue color line represents the training loss, and the light blue color represents the validation loss. In order to artificially increase the data set size and to diversify the data, we employ several common methods for augmentation using python imgaug library such as multiply function, Gaussian blur, crop, and four-point perspective transformation to images. As a result, we obtained a total of 16 548 images. We split it into 80% for training and 20% for test set randomly. Five cross-validations were employed; therefore, in the training set, it is being split into 80% for training (10 590 images) and 20% (2648 images) for validation. All training images are resized to the image resolution of 600 × 600 pixels while preserving their aspect ratio. As the training images may have different aspect ratios, the image’s remaining space is zero-padded. This is different from the image resolution used in the original mask-RCNN (He et al., 2017), in which resizing achieves a minimum of 800 pixels and a maximum size of 1000 pixels. Training the network took approximately 96 hours. Segmentation processing time took about 1.6 minutes (5 seconds for each image) per patient. The training in U-Net took approximately 132 hours and around 8 min (25 seconds for each image) to process one patient. The training in Mask-RCNN took approximately 118 hours and around 3.8 minutes (12 seconds for each image) per patient. Our method improves both training and segmentation when compared against the current state-of-the-art U-net and mask-RCNN. The summary of the comparative analysis of processing time is shown in Table 2. Table 2: Comparative analysis of processing time for evaluated segmentation methods. Method . Training time . Inference time – 1 slice . U-Net ±132 hours ±25 seconds Mask-RCNN ±118 hours ±12 seconds MOM-RCNN with SGD ±120 hours ±6 seconds MOM-RCNN with Adam ±108 hours ±6 seconds MOM-RCNN with SGD+Adam ±96 hours ±5 seconds Method . Training time . Inference time – 1 slice . U-Net ±132 hours ±25 seconds Mask-RCNN ±118 hours ±12 seconds MOM-RCNN with SGD ±120 hours ±6 seconds MOM-RCNN with Adam ±108 hours ±6 seconds MOM-RCNN with SGD+Adam ±96 hours ±5 seconds Open in new tab Table 2: Comparative analysis of processing time for evaluated segmentation methods. Method . Training time . Inference time – 1 slice . U-Net ±132 hours ±25 seconds Mask-RCNN ±118 hours ±12 seconds MOM-RCNN with SGD ±120 hours ±6 seconds MOM-RCNN with Adam ±108 hours ±6 seconds MOM-RCNN with SGD+Adam ±96 hours ±5 seconds Method . Training time . Inference time – 1 slice . U-Net ±132 hours ±25 seconds Mask-RCNN ±118 hours ±12 seconds MOM-RCNN with SGD ±120 hours ±6 seconds MOM-RCNN with Adam ±108 hours ±6 seconds MOM-RCNN with SGD+Adam ±96 hours ±5 seconds Open in new tab The majority of the testing dataset used was separate from the dataset used in the training set to increase the objectivity of the analysis results. Objectivity here is necessary to show that even the algorithm could produce good segmentation results based on previous training data without, first, seeing the input data. To cross-check the analysis results, we also conduct the testing on the training dataset itself. 5 Experimental Results Several types of measurement metrics are used to evaluate different quality aspects in medical image segmentation according to the types of segmentation errors (Taha & Hanbury, 2015). The evaluation of segmentation was computed for each segmentation result. The metrics chosen for quantitative analysis were divided into spatial overlap based, pair counting based, information theoretic based, probabilistic based, and spatial distance. The spatial overlap-based metrics consist of the Dice coefficient (DC), sensitivity, specificity, F-measure, precision, and global consistency error (GCE). The pair counting-based metrics use Rand Index (RI) and the Adjusted RI (ARI). The information theoretic-based metrics use Mutual Information (MI) and Variation of Information (VOI). The probabilistic-based metrics use Cohens Kappa (KAP), Area under ROC curve (AUC), and Matthews Correlation Coefficient (MCC). The spatial distance measurement uses the root mean square error (RMSE), Hausdorff distance (HD), and average symmetric surface distance (ASSD; Mansilla et al., 2020). In this work, we tested the datasets using U-Net, mask-RCNN, MOM-RCNN with SGD only, and MOM-RCNN with Adam only and compared those results against our proposed method (MOM-RCNN with SGD+Adam). 5.1 Qualitative results A few representative slices of the testing set results are selected. The comparison of our qualitative evaluation results against those obtained using different methods is presented in Fig. 7. The results demonstrate the proposed method segments more accurately than other methods. The U-Net result is only shown in one color because U-Net does not design for instance segmentation. Figure 7: Open in new tabDownload slide Qualitative comparison for U-Net, mask-RCNN, MOM-RCNN with SGD, MOM-RCNN with Adam, and MOM-RCNN with SGD+Adam. Figure 7: Open in new tabDownload slide Qualitative comparison for U-Net, mask-RCNN, MOM-RCNN with SGD, MOM-RCNN with Adam, and MOM-RCNN with SGD+Adam. 5.2 Quantitative results Each segmentation result from every method was converted into a binary image to constitute query image S, and the labeled images were used as gold standard (GS). Spatial overlap measures: We used the confusion matrix to calculate the spatial overlap measures considering four variables: the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN). As demonstrated in Table 3, our method shows a significant improvement in the segmentation results compared to other methods based on the one-way ANOVA test. It is shown in Table 8 that there is a statistically significant difference between the DSC, with a p-value range from <0.001 to 0.028. Based on the DC score, we obtained around 99% similarity with the ground truth. Our method indicates the most considerable improvement in the sensitivity and specificity categories with a p-value range from <0.001 to 0.043 and <0.001 to 0.01, respectively. Sensitivity is 88%, and specificity is 98%. These results show a significant improvement in the accuracy categories compared to the other methods. Our method obtains a 0.92 for the F-measurement, approaching 1 (the best possible score) with PPV score at 0.9215. This improvement, further, aided in a GCE approaching zero. This performance indicates a very low error segmentation result. Pair count measures: The pair count measures analyse the ground truth and segmentation result as two sets and considers four cardinalities: a, b, c, and d. Then, the four cardinalities are defined based on the confusion matrix (TP, FP, TN, and FN, respectively). The RI considers the four abovementioned cardinalities, and a value of zero indicates no correlation between the segmentation and ground truth. As shown in Table 4, the proposed method was evaluated through pair count measures: RI and ARI, reaching respective results of 0.98 and 0.89, approaching 1 (the best possible score). These results show a considerable improvement in segmentation accuracy based on RI with a p-value range from <0.001 to 0.022 shown in Table 8, indicating that the proposed method achieves the highest agreement with the GS compared to the other methods. Information measures: As demonstrated in Table 5, the improvement in the MI reaching respective results of 0.07 and reduction in VOI of 0.05, being very close to 0 (identical information), shows that the proposed method achieves the highest shared information with the GS compared to the other methods. Probabilistic measures: As shown in Table 6, the improvement in the KAP, AUC, and MCC shows that the proposed method achieves the highest agreement with the GS compared to the other methods. The KAP score ranges between 0 and 1. The MCC ranges between −1 and 1, where a score of 1 represents a perfect segmentation, 0 no better than a random segmentation, and 1 indicates total disagreement between GS and S. In the probabilistic measures, the best possible score is 1. Spatial distance measures: Our method also outperforms the other methods on both spatial distance measures, demonstrated by reduction in RMSE from 0.407 ± 0.067 to 0.095 ± 0.026 mm and HD from 12.313 ± 3.015 to 5.155 ± 1.561 mm, as shown in Table 7. We also perform ASSD, computed as the average distance between the segmentation contours and the corresponding ground truth. Table 7 shows the precision of the proposed method by obtaining a final ASSD of 0.49 ± 0.23 mm, and the differences between the other methods are also statistically significant with a p-value range from <0.001 to 0.015. Table 3: Spatial overlap measures for evaluated segmentation methods. . DC . Sensitivity . GCE . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.9446 0.0324 0.7315 0.0311 0.2157 0.1573 Mask-RCNN 0.9644 0.0157 0.8056 0.0327 0.0891 0.0951 MOM-RCNN with SGD 0.9804 0.0147 0.8667 0.0157 0.0357 0.0203 MOM-RCNN with Adam 0.9876 0.0174 0.8756 0.1111 0.0304 0.0247 MOM-RCNN with SGD+Adam 0.9946 0.0051 0.8857 0.1567 0.0288 0.0517 Specificity Fmeasure Precision Method Mean SD Mean SD Mean SD U-Net 0.8915 0.0154 0.5237 0.0847 0.6115 0.0387 Mask-RCNN 0.9152 0.0732 0.7125 0.0357 0.7035 0.0157 MOM-RCNN with SGD 0.9723 0.0516 0.8678 0.0327 0.8815 0.0427 MOM-RCNN with Adam 0.9745 0.0178 0.8815 0.0892 0.8915 0.0325 MOM-RCNN with SGD+Adam 0.9897 0.0561 0.9167 0.0412 0.9215 0.0641 . DC . Sensitivity . GCE . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.9446 0.0324 0.7315 0.0311 0.2157 0.1573 Mask-RCNN 0.9644 0.0157 0.8056 0.0327 0.0891 0.0951 MOM-RCNN with SGD 0.9804 0.0147 0.8667 0.0157 0.0357 0.0203 MOM-RCNN with Adam 0.9876 0.0174 0.8756 0.1111 0.0304 0.0247 MOM-RCNN with SGD+Adam 0.9946 0.0051 0.8857 0.1567 0.0288 0.0517 Specificity Fmeasure Precision Method Mean SD Mean SD Mean SD U-Net 0.8915 0.0154 0.5237 0.0847 0.6115 0.0387 Mask-RCNN 0.9152 0.0732 0.7125 0.0357 0.7035 0.0157 MOM-RCNN with SGD 0.9723 0.0516 0.8678 0.0327 0.8815 0.0427 MOM-RCNN with Adam 0.9745 0.0178 0.8815 0.0892 0.8915 0.0325 MOM-RCNN with SGD+Adam 0.9897 0.0561 0.9167 0.0412 0.9215 0.0641 Open in new tab Table 3: Spatial overlap measures for evaluated segmentation methods. . DC . Sensitivity . GCE . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.9446 0.0324 0.7315 0.0311 0.2157 0.1573 Mask-RCNN 0.9644 0.0157 0.8056 0.0327 0.0891 0.0951 MOM-RCNN with SGD 0.9804 0.0147 0.8667 0.0157 0.0357 0.0203 MOM-RCNN with Adam 0.9876 0.0174 0.8756 0.1111 0.0304 0.0247 MOM-RCNN with SGD+Adam 0.9946 0.0051 0.8857 0.1567 0.0288 0.0517 Specificity Fmeasure Precision Method Mean SD Mean SD Mean SD U-Net 0.8915 0.0154 0.5237 0.0847 0.6115 0.0387 Mask-RCNN 0.9152 0.0732 0.7125 0.0357 0.7035 0.0157 MOM-RCNN with SGD 0.9723 0.0516 0.8678 0.0327 0.8815 0.0427 MOM-RCNN with Adam 0.9745 0.0178 0.8815 0.0892 0.8915 0.0325 MOM-RCNN with SGD+Adam 0.9897 0.0561 0.9167 0.0412 0.9215 0.0641 . DC . Sensitivity . GCE . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.9446 0.0324 0.7315 0.0311 0.2157 0.1573 Mask-RCNN 0.9644 0.0157 0.8056 0.0327 0.0891 0.0951 MOM-RCNN with SGD 0.9804 0.0147 0.8667 0.0157 0.0357 0.0203 MOM-RCNN with Adam 0.9876 0.0174 0.8756 0.1111 0.0304 0.0247 MOM-RCNN with SGD+Adam 0.9946 0.0051 0.8857 0.1567 0.0288 0.0517 Specificity Fmeasure Precision Method Mean SD Mean SD Mean SD U-Net 0.8915 0.0154 0.5237 0.0847 0.6115 0.0387 Mask-RCNN 0.9152 0.0732 0.7125 0.0357 0.7035 0.0157 MOM-RCNN with SGD 0.9723 0.0516 0.8678 0.0327 0.8815 0.0427 MOM-RCNN with Adam 0.9745 0.0178 0.8815 0.0892 0.8915 0.0325 MOM-RCNN with SGD+Adam 0.9897 0.0561 0.9167 0.0412 0.9215 0.0641 Open in new tab Table 4: Pair count measures for evaluated segmentation methods. . RI . ARI . Method . Mean . SD . Mean . SD . U-Net 0.8982 0.0601 0.6504 0.0673 Mask-RCNN 0.9251 0.1481 0.7128 0.3684 MOM-RCNN with SGD 0.9802 0.0079 0.8828 0.0505 MOM-RCNN with Adam 0.9751 0.0152 0.8789 0.0712 MOM-RCNN with SGD+Adam 0.9821 0.0059 0.8989 0.0651 . RI . ARI . Method . Mean . SD . Mean . SD . U-Net 0.8982 0.0601 0.6504 0.0673 Mask-RCNN 0.9251 0.1481 0.7128 0.3684 MOM-RCNN with SGD 0.9802 0.0079 0.8828 0.0505 MOM-RCNN with Adam 0.9751 0.0152 0.8789 0.0712 MOM-RCNN with SGD+Adam 0.9821 0.0059 0.8989 0.0651 Open in new tab Table 4: Pair count measures for evaluated segmentation methods. . RI . ARI . Method . Mean . SD . Mean . SD . U-Net 0.8982 0.0601 0.6504 0.0673 Mask-RCNN 0.9251 0.1481 0.7128 0.3684 MOM-RCNN with SGD 0.9802 0.0079 0.8828 0.0505 MOM-RCNN with Adam 0.9751 0.0152 0.8789 0.0712 MOM-RCNN with SGD+Adam 0.9821 0.0059 0.8989 0.0651 . RI . ARI . Method . Mean . SD . Mean . SD . U-Net 0.8982 0.0601 0.6504 0.0673 Mask-RCNN 0.9251 0.1481 0.7128 0.3684 MOM-RCNN with SGD 0.9802 0.0079 0.8828 0.0505 MOM-RCNN with Adam 0.9751 0.0152 0.8789 0.0712 MOM-RCNN with SGD+Adam 0.9821 0.0059 0.8989 0.0651 Open in new tab Table 5: The information theoretic-based measurement results. . MI . VOI . Method . Mean . SD . Mean . SD . U-Net 0.0489 0.0136 0.2331 0.0831 Mask-RCNN 0.0653 0.0417 0.1684 0.0378 MOM-RCNN with SGD 0.0721 0.0165 0.0677 0.0597 MOM-RCNN with Adam 0.0714 0.0118 0.0699 0.0715 MOM-RCNN with SGD+Adam 0.0751 0.0178 0.0575 0.0257 . MI . VOI . Method . Mean . SD . Mean . SD . U-Net 0.0489 0.0136 0.2331 0.0831 Mask-RCNN 0.0653 0.0417 0.1684 0.0378 MOM-RCNN with SGD 0.0721 0.0165 0.0677 0.0597 MOM-RCNN with Adam 0.0714 0.0118 0.0699 0.0715 MOM-RCNN with SGD+Adam 0.0751 0.0178 0.0575 0.0257 Open in new tab Table 5: The information theoretic-based measurement results. . MI . VOI . Method . Mean . SD . Mean . SD . U-Net 0.0489 0.0136 0.2331 0.0831 Mask-RCNN 0.0653 0.0417 0.1684 0.0378 MOM-RCNN with SGD 0.0721 0.0165 0.0677 0.0597 MOM-RCNN with Adam 0.0714 0.0118 0.0699 0.0715 MOM-RCNN with SGD+Adam 0.0751 0.0178 0.0575 0.0257 . MI . VOI . Method . Mean . SD . Mean . SD . U-Net 0.0489 0.0136 0.2331 0.0831 Mask-RCNN 0.0653 0.0417 0.1684 0.0378 MOM-RCNN with SGD 0.0721 0.0165 0.0677 0.0597 MOM-RCNN with Adam 0.0714 0.0118 0.0699 0.0715 MOM-RCNN with SGD+Adam 0.0751 0.0178 0.0575 0.0257 Open in new tab Table 6: The probabilistic-based measurement results. . KAP . AUC . MCC . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4569 0.1656 0.5687 0.0705 0.5687 0.0705 Mask-RCNN 0.7601 0.2467 0.6687 0.3587 0.6687 0.3587 MOM-RCNN with SGD 0.8537 0.0799 0.7751 0.0978 0.7751 0.0978 MOM-RCNN with Adam 0.8145 0.0897 0.7611 0.1578 0.7611 0.1578 MOM-RCNN with SGD+Adam 0.8711 0.1511 0.8751 0.0911 0.8751 0.0911 . KAP . AUC . MCC . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4569 0.1656 0.5687 0.0705 0.5687 0.0705 Mask-RCNN 0.7601 0.2467 0.6687 0.3587 0.6687 0.3587 MOM-RCNN with SGD 0.8537 0.0799 0.7751 0.0978 0.7751 0.0978 MOM-RCNN with Adam 0.8145 0.0897 0.7611 0.1578 0.7611 0.1578 MOM-RCNN with SGD+Adam 0.8711 0.1511 0.8751 0.0911 0.8751 0.0911 Open in new tab Table 6: The probabilistic-based measurement results. . KAP . AUC . MCC . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4569 0.1656 0.5687 0.0705 0.5687 0.0705 Mask-RCNN 0.7601 0.2467 0.6687 0.3587 0.6687 0.3587 MOM-RCNN with SGD 0.8537 0.0799 0.7751 0.0978 0.7751 0.0978 MOM-RCNN with Adam 0.8145 0.0897 0.7611 0.1578 0.7611 0.1578 MOM-RCNN with SGD+Adam 0.8711 0.1511 0.8751 0.0911 0.8751 0.0911 . KAP . AUC . MCC . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4569 0.1656 0.5687 0.0705 0.5687 0.0705 Mask-RCNN 0.7601 0.2467 0.6687 0.3587 0.6687 0.3587 MOM-RCNN with SGD 0.8537 0.0799 0.7751 0.0978 0.7751 0.0978 MOM-RCNN with Adam 0.8145 0.0897 0.7611 0.1578 0.7611 0.1578 MOM-RCNN with SGD+Adam 0.8711 0.1511 0.8751 0.0911 0.8751 0.0911 Open in new tab Table 7: Spatial distance measures for evaluated segmentation methods. . RMSE . HD . ASSD . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4071 0.0671 12.3137 3.0157 1.9440 0.8505 Mask-RCNN 0.3651 0.0726 10.6771 2.5071 1.2240 0.6975 MOM-RCNN with SGD 0.1098 0.1567 8.3521 1.1121 0.8022 0.4302 MOM-RCNN with Adam 0.1541 0.1309 7.4897 2.6109 0.6652 0.5722 MOM-RCNN with SGD+Adam 0.0952 0.0269 5.1559 1.5611 0.4993 0.2328 . RMSE . HD . ASSD . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4071 0.0671 12.3137 3.0157 1.9440 0.8505 Mask-RCNN 0.3651 0.0726 10.6771 2.5071 1.2240 0.6975 MOM-RCNN with SGD 0.1098 0.1567 8.3521 1.1121 0.8022 0.4302 MOM-RCNN with Adam 0.1541 0.1309 7.4897 2.6109 0.6652 0.5722 MOM-RCNN with SGD+Adam 0.0952 0.0269 5.1559 1.5611 0.4993 0.2328 Open in new tab Table 7: Spatial distance measures for evaluated segmentation methods. . RMSE . HD . ASSD . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4071 0.0671 12.3137 3.0157 1.9440 0.8505 Mask-RCNN 0.3651 0.0726 10.6771 2.5071 1.2240 0.6975 MOM-RCNN with SGD 0.1098 0.1567 8.3521 1.1121 0.8022 0.4302 MOM-RCNN with Adam 0.1541 0.1309 7.4897 2.6109 0.6652 0.5722 MOM-RCNN with SGD+Adam 0.0952 0.0269 5.1559 1.5611 0.4993 0.2328 . RMSE . HD . ASSD . Method . Mean . SD . Mean . SD . Mean . SD . U-Net 0.4071 0.0671 12.3137 3.0157 1.9440 0.8505 Mask-RCNN 0.3651 0.0726 10.6771 2.5071 1.2240 0.6975 MOM-RCNN with SGD 0.1098 0.1567 8.3521 1.1121 0.8022 0.4302 MOM-RCNN with Adam 0.1541 0.1309 7.4897 2.6109 0.6652 0.5722 MOM-RCNN with SGD+Adam 0.0952 0.0269 5.1559 1.5611 0.4993 0.2328 Open in new tab Table 8: p-values between different methods for the DCs, sensitivity, specificity, and RIs. Method . DC p-value . Sensitivity p-value . Specificity p-value . RI p-value . ASSD p-value . U-Net vs Mask-RCNN 0.005 <0.001 0.015 0.034 0.003 U-Net vs MOM-RCNN with SGD <0.001 <0.001 0.003 0.49 <0.001 U-Net vs MOM-RCNN with Adam <0.001 <0.001 0.139 <0.001 <0.001 U-Net vs MOM-RCNN with SGD+Adam <0.001 <0.001 0.011 <0.001 <0.001 Mask-RCNN vs MOM-RCNN with SGD 0.002 <0.001 0.003 0.042 0.011 Mask-RCNN vs MOM-RCNN with Adam <0.001 0.189 <0.001 <0.001 0.015 Mask-RCNN vs MOM-RCNN with SGD+Adam <0.001 <0.001 <0.001 <0.001 <0.001 MOM-RCNN with SGD vs MOM-RCNN with Adam 0.067 0.086 <0.001 0.003 0.215 MOM-RCNN with SGD vs MOM-RCNN with SGD+Adam <0.001 0.032 <0.001 <0.001 0.002 MOM-RCNN with Adam vs MOM-RCNN with SGD+Adam 0.028 0.043 <0.001 0.022 0.015 Method . DC p-value . Sensitivity p-value . Specificity p-value . RI p-value . ASSD p-value . U-Net vs Mask-RCNN 0.005 <0.001 0.015 0.034 0.003 U-Net vs MOM-RCNN with SGD <0.001 <0.001 0.003 0.49 <0.001 U-Net vs MOM-RCNN with Adam <0.001 <0.001 0.139 <0.001 <0.001 U-Net vs MOM-RCNN with SGD+Adam <0.001 <0.001 0.011 <0.001 <0.001 Mask-RCNN vs MOM-RCNN with SGD 0.002 <0.001 0.003 0.042 0.011 Mask-RCNN vs MOM-RCNN with Adam <0.001 0.189 <0.001 <0.001 0.015 Mask-RCNN vs MOM-RCNN with SGD+Adam <0.001 <0.001 <0.001 <0.001 <0.001 MOM-RCNN with SGD vs MOM-RCNN with Adam 0.067 0.086 <0.001 0.003 0.215 MOM-RCNN with SGD vs MOM-RCNN with SGD+Adam <0.001 0.032 <0.001 <0.001 0.002 MOM-RCNN with Adam vs MOM-RCNN with SGD+Adam 0.028 0.043 <0.001 0.022 0.015 Open in new tab Table 8: p-values between different methods for the DCs, sensitivity, specificity, and RIs. Method . DC p-value . Sensitivity p-value . Specificity p-value . RI p-value . ASSD p-value . U-Net vs Mask-RCNN 0.005 <0.001 0.015 0.034 0.003 U-Net vs MOM-RCNN with SGD <0.001 <0.001 0.003 0.49 <0.001 U-Net vs MOM-RCNN with Adam <0.001 <0.001 0.139 <0.001 <0.001 U-Net vs MOM-RCNN with SGD+Adam <0.001 <0.001 0.011 <0.001 <0.001 Mask-RCNN vs MOM-RCNN with SGD 0.002 <0.001 0.003 0.042 0.011 Mask-RCNN vs MOM-RCNN with Adam <0.001 0.189 <0.001 <0.001 0.015 Mask-RCNN vs MOM-RCNN with SGD+Adam <0.001 <0.001 <0.001 <0.001 <0.001 MOM-RCNN with SGD vs MOM-RCNN with Adam 0.067 0.086 <0.001 0.003 0.215 MOM-RCNN with SGD vs MOM-RCNN with SGD+Adam <0.001 0.032 <0.001 <0.001 0.002 MOM-RCNN with Adam vs MOM-RCNN with SGD+Adam 0.028 0.043 <0.001 0.022 0.015 Method . DC p-value . Sensitivity p-value . Specificity p-value . RI p-value . ASSD p-value . U-Net vs Mask-RCNN 0.005 <0.001 0.015 0.034 0.003 U-Net vs MOM-RCNN with SGD <0.001 <0.001 0.003 0.49 <0.001 U-Net vs MOM-RCNN with Adam <0.001 <0.001 0.139 <0.001 <0.001 U-Net vs MOM-RCNN with SGD+Adam <0.001 <0.001 0.011 <0.001 <0.001 Mask-RCNN vs MOM-RCNN with SGD 0.002 <0.001 0.003 0.042 0.011 Mask-RCNN vs MOM-RCNN with Adam <0.001 0.189 <0.001 <0.001 0.015 Mask-RCNN vs MOM-RCNN with SGD+Adam <0.001 <0.001 <0.001 <0.001 <0.001 MOM-RCNN with SGD vs MOM-RCNN with Adam 0.067 0.086 <0.001 0.003 0.215 MOM-RCNN with SGD vs MOM-RCNN with SGD+Adam <0.001 0.032 <0.001 <0.001 0.002 MOM-RCNN with Adam vs MOM-RCNN with SGD+Adam 0.028 0.043 <0.001 0.022 0.015 Open in new tab 5.3 Ablation study In recent years, ablation studies are included in several notable publications to evaluate deep learning performance (Horvitz & Apacible, 2003; Hessel et al., 2018; Meyes et al., 2019). Following the original ablation study in physiology, the ablation studies/experiments were conducted by removing specific features or specific components of the model from the training process and observing the performance results. Additionally, the ablation studies can increase efforts towards designing explainable and interpretable machine learning systems (Lipton & Steinhardt, 2018; Meyes et al., 2019). Layer ablation: Another example of ablation study experimentation with our systems is usually a result of adding new layers or new types of connections between the neurons or layers. This process, known as layer ablation study, conducts and inspects the addition of the layer by using Res-Net101. The training time increases tremendously to approximately 1 week with little to no improvement in accuracy. Feature ablation: We conduct ablation feature study by adding group normalization layer. To verify the effectiveness of the feature, we conduct two kinds of experiments. First, we directly leave out the batch normalization layer and see how the network performs. Secondly, we switch the batch normalization layer to group normalization and adding dropout. We set dropout ratio as 0.5. The early-stopping mechanism of arbitrary trials is also implemented. Effective application of a global experiment controller implies that we will be also able to stop arbitrary trials and assign continuous new trials/training to the executors without interrupting the whole training experiment. Our systems show its effectiveness, presented in Fig. 8. Ablation study on using Adam: To verify the effectiveness of our algorithm, we replace SGD optimizer with Adam optimizer. Then, we replace the optimizer using SGD + Adam optimizer. Also, the initial learning rate for SGD is set as 0.01 and the momentum is set as 0.9. The learning rate decay is set as 0.1 every 20 epochs. The corresponding results are reported in Tables 1–6. As we can see, SGD + Adam outperforms SGD consistently. Figure 8: Open in new tabDownload slide Comparison of using batch normalization (BN) and group normalization + dropout (GN). Figure 8: Open in new tabDownload slide Comparison of using batch normalization (BN) and group normalization + dropout (GN). 6 Discussion and Conclusion We proposed a new deep learning method called MOM-RCNN. The proposed method uses the multi-optimization of SGD and Adam, generating MRI segmentation labels from T1 and T2 of the IVD images. The proposed method uses the Class prediction classification loss (Lr,class and Lm,class), bounding box refinement regression loss (Lr,box and Lm,box), and mask prediction generation loss (Lmask) obtained from related stages. The proposed method solves the generalization problem with its capability to generate segmentation labels for both T1 and T2 IVD images. Our proposed method improves processing performance, consuming only 5 seconds to process one image, while current state-of-the-art medical image segmentation networks require 12–25 seconds per image and are also competitive enough with the performance of SpineParseNet by Pang et al. (2021) that consumes 4.17 seconds on higher GPU. The proposed method was tested on real public medical data and demonstrated its effectiveness in the qualitative assessment (Fig. 8). The MOM-RCNN achieves competitive results compared to several other widely used segmentation methods, shown in the qualitative results. The proposed method was evaluated through similarity metrics such as the DC, sensitivity, specificity, F-measure, precision, GCE, RI, ARI, MI, VOI, KAP, AUC, MCC, RMSE, HD, and ASSD. These similarity metrics generated results of (99%), (88%), (98%), (0.9), (92%), (0.03), (0.98), (0.89), (0.07), (0.05), (0.87), (0.87), (0.87), (0.095 ± 0.026 mm), (5.155 ± 1.561 mm), and (0.49 ± 0.23 mm), respectively. Both qualitative and quantitative experimental results demonstrate that our proposed method improves accuracy, corrects segmentation errors, and exhibits better segmentation performance than previous methods. The results show high specificity and sensitivity, high overlap, and a short distance between the manual annotation and the proposed method. Although the proposed method exhibits competitive results in the test case, we intend to conduct further assessments of our approach using different backbone architecture combinations with broader training data to improve the performance. The segmentation methods are expected to have a broad impact by supporting clinician decisions. Furthermore, this research will be carried on further to perform experiments at other tasks; we aim to try it to the medical segmentation decathlon dataset for further improvement. The main reason the medical image segmentation is conducted in 2D is that manual segmentations were performed slice-wise in the sagittal plane; the same problem was faced by (Dolz et al. 2019). Computational limitation could be another reason, and sometimes the 3D network does not always perform better than 2D segmentation (Abulnaga & Rubin, 2019). SpineParseNet utilized 3D graph convolutional segmentation; however, it still required an additional stage conducted in a 2D image for refining the IVD segmentation (Pang et al., 2021). However, it does show the possibilities to extend the idea into a 3D network after the scarcity of computational consumption cost is solved at an affordable price. We seek to develop a system capable of handling a more comprehensive range of medical data at affordable cost and further reducing training time through code optimization. Acknowledgments This work was supported by the World Class 300 Project (R&D) (no. S2482672) funded by the Ministry of SMEs and Startups. This research was also partially supported by the Ministry of Trade Industry & Energy (MOTIE, Korea), Ministry of Science & ICT (MSIT, Korea), and Ministry of Health & Welfare (MOHW, Korea) under the Technology Development Program for AI−Bio−Robot−Medicine Convergence (20001655). Conflict of interest statement None declared. Footnotes 1 Available at: https://www.python.org/downloads/release/python-373/. 2 Available at: https://github.com/fchollet/keras. 3 Available at: https://www.tensorflow.org/. REFERENCES Abulnaga S. M. , Rubin J. ( 2019 ). Ischemic stroke lesion segmentation in CT perfusion scans using pyramid pooling and focal loss . Lecture Notes in Computer Science , 11383 , 352 – 363 . Google Scholar Crossref Search ADS WorldCat Belavy D. , Bock O., Börst H., Armbrecht G., Gast U., Degner C., Beller G., Soll H., Salanova M., Habazettl H., Heer M., de Haan A., Stegeman D. F., Cerretelli P., Blottner D., Rittweger J., Gelfi C., Kornak U., Felsenberg D. ( 2010 ). The 2nd Berlin bedrest study: Protocol and implementation . Journal of Musculoskeletal and Neuronal Interactions , 10 ( 3 ), 207 – 219 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Bressler H. B. , Keyes W. J., Rochon P. A., Badley E. ( 1999 ). The prevalence of low back pain in the elderly: A systematic review of the literature . Spine , 24 ( 17 ), 1813 – 1819 . Google Scholar Crossref Search ADS PubMed WorldCat Chan H.-P. , Hadjiiski L. M., Samala R. K. ( 2020 ). Computer-aided diagnosis in the era of deep learning . Medical Physics , 47 ( 5 ), e218 – e227 . Google Scholar Crossref Search ADS PubMed WorldCat Chen C. , Belavy D., Zheng G. ( 2014 ). 3D intervertebral disc localization and segmentation from MR images by data-driven regression and classification . In Machine Learning in Medical Imaging (pp. 50 – 58 .). Springer International Publishing . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Dahl G. , Sainath T., Hinton G. E. ( 2013 ). Improving deep neural networks for lvcsr using rectified linear units and dropout . In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8609 – 8613 .). Google Scholar OpenURL Placeholder Text WorldCat Deng J. , Dong W., Socher R., Li L., Li K., Li F.-F. ( 2009 ). Imagenet: A large-scale hierarchical image database . In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248 – 255 .). Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Dolz J. , Desrosiers C., Ben Ayed I. ( 2019 ). IVD-net: Intervertebral disc localization and segmentation in MRI with a multi-modal unet . In Computational Methods and Clinical Applications for Spine Imaging (pp. 130 – 143 .). Springer International Publishing . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Duchi J. , Hazan E., Singer Y. ( 2011 ). Adaptive subgradient methods for online learning and stochastic optimization . Journal of Machine Learning Research , 12 , 2121 – 2159 . Google Scholar OpenURL Placeholder Text WorldCat Fallah F. , Walter S. S., Bamberg F., Yang B. ( 2019 ). Simultaneous volumetric segmentation of vertebral bodies and intervertebral discs on fat-water MR images . IEEE Journal of Biomedical and Health Informatics , 23 ( 4 ), 1692 – 1701 . Google Scholar Crossref Search ADS PubMed WorldCat Gao Y. , Mas J., Kerle N., Mas J., Navarrete Pacheco J. ( 2011 ). Optimal region growing segmentation and its effect on classification accuracy . International Journal of Remote Sensing , 32 ( 13 ), 3747 – 3763 . Google Scholar Crossref Search ADS WorldCat Glorot X. , Bengio Y. ( 2010 ). Understanding the difficulty of training deep feedforward neural networks . Journal of Machine Learning Research - Proceedings Track , 9 , 249 – 256 . Google Scholar OpenURL Placeholder Text WorldCat Gopalakrishnan K. , Khaitan S. K., Choudhary A., Agrawal A. ( 2017 ). Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection . Construction and Building Materials , 157 , 322 – 330 . Google Scholar Crossref Search ADS WorldCat He K. , Zhang X., Ren S., Sun J. ( 2016 ). Deep residual learning for image recognition . In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770 – 778 .). Google Scholar Crossref Search ADS Google Preview WorldCat COPAC He K. , Gkioxari G., Dollr P., Girshick R. ( 2017 ). Mask R-CNN . In 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2980 – 2988 .). Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Helmbold D. P. , Long P. M. ( 2017 ). Surprising properties of dropout in deep networks . Journal of Machine Learning Research , 18 ( 1 ), 7284 – 7311 . Google Scholar OpenURL Placeholder Text WorldCat Hessel M. , Modayil J., Hasselt H. V., Schaul T., Ostrovski G., Dabney W., Horgan D., Piot B., Azar M. G., Silver D. ( 2018 ). Rainbow: Combining improvements in deep reinforcement learning . In AAAI . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hinton G. , Srivastava N., Krizhevsky A., Sutskever I., Salakhutdinov R. ( 2012 ). Improving neural networks by preventing co-adaptation of feature detectors . arXiv: abs/1207.0580. OpenURL Placeholder Text WorldCat Horvitz E. , Apacible J. ( 2003 ). Learning and reasoning about interruption . In Proceedings of the 5th International Conference on Multimodal Interfaces, ICMI ’03 (pp. 20 – 27 .). Association for Computing Machinery . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Huang Z. , Pan Z., Lei B. ( 2017 ). Transfer learning with deep convolutional neural network for sar target classification with limited labeled data . Remote Sensing , 9 , 907 . Google Scholar Crossref Search ADS WorldCat Ibtehaz N. , Rahman M. S. ( 2020 ). Multiresunet : Rethinking the u-net architecture for multimodal biomedical image segmentation . Neural Networks , 121 , 74 – 87 . Google Scholar Crossref Search ADS PubMed WorldCat Inoue H. ( 2019 ). Multi-sample dropout for accelerated training and better generalization . arXiv: abs/1905.09788. OpenURL Placeholder Text WorldCat Jain V. , Seung H. S., Turaga S. C. ( 2010 ). Machines that learn to segment images: a crucial technology for connectomics . Current Opinion in Neurobiology , 20 ( 5 ), 653 – 666 . Google Scholar Crossref Search ADS PubMed WorldCat Kamnitsas K. , Ledig C., Newcombe V. F., Simpson J. P., Kane A. D., Menon D. K., Rueckert D., Glocker B. ( 2017 ). Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation . Medical Image Analysis , 36 , 61 – 78 . Google Scholar Crossref Search ADS PubMed WorldCat Kingma D. P. , Ba J. ( 2015 ). Adam: A method for stochastic optimization . In 3rd International Conference on Learning Representations, ICLR 2015 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Korez R. , Ibragimov B., Likar B., Pernus F., Vrtovec T. ( 2017 ). Intervertebral disc segmentation in MR images with 3D convolutional networks . In Medical Imaging 2017: Image Processing , (Vol. 10133 , pp. 43 – 52 .). International Society for Optics and Photonics . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Kumar D. , Jain N., Khurana A., Mittal S., Satapathy S. C., Senkerik R., Hemanth J. D. ( 2020 ). Automatic detection of white blood cancer from bone marrow microscopic images using convolutional neural networks . IEEE Access , 8 , 142521 – 142531 . Google Scholar Crossref Search ADS WorldCat Li C. , Tong R., Tang M. ( 2018 ). Modelling human body pose for action recognition using deep neural networks . Arabian Journal for Science and Engineering , 43 ( 12 ), 7777 – 7788 . Google Scholar Crossref Search ADS WorldCat Lin J. , Camoriano R., Rosasco L. ( 2016 ). Generalization properties and implicit regularization for multiple passes SGM . In Proceedings of Machine Learning Research (Vol. 48 , pp. 2340 – 2348 .). Google Scholar OpenURL Placeholder Text WorldCat Lin T. , Dollr P., Girshick R., He K., Hariharan B., Belongie S. ( 2017 ). Feature pyramid networks for object detection . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 936 – 944 .). Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Lipton Z. C. , Steinhardt J. ( 2018 ). Troubling trends in machine learning scholarship . arXiv: abs/1807.03341. OpenURL Placeholder Text WorldCat Liu X. , Faes L., Kale A. U., Wagner S. K., Fu D. J., Bruynseels A., Mahendiran T., Moraes G., Shamdas M., Kern C., Ledsam J. R., Schmid M. K., Balaskas K., Topol E. J., Bachmann L. M., Keane P. A., Denniston A. K. ( 2019 ). A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis . The Lancet Digital Health , 1 ( 6 ), e271 – e297 . Google Scholar Crossref Search ADS PubMed WorldCat Liu L. , Cheng J., Quan Q., Wu F.-X., Wang Y.-P., Wang J. ( 2020 ). A survey on u-shaped networks in medical image segmentations . Neurocomputing , 409 , 244 – 258 . Google Scholar Crossref Search ADS WorldCat Liu X. , Song L., Liu S., Zhang Y. ( 2021 ). A review of deep-learning-based medical image segmentation methods . Sustainability , 13 ( 3 ), 1 – 29 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Lundervold A. S. , Lundervold A. ( 2019 ). An overview of deep learning in medical imaging focusing on MRI . Zeitschrift für Medizinische Physik , 29 ( 2 ), 102 – 127 . Google Scholar Crossref Search ADS PubMed WorldCat Mansilla L. , Milone D. H., Ferrante E. ( 2020 ). Learning deformable registration of medical images with anatomical constraints . Neural Networks , 124 , 269 – 279 . Google Scholar Crossref Search ADS PubMed WorldCat Masi I. , Wu Y., Hassner T., Natarajan P. ( 2018 ). Deep face recognition: A survey . In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) (pp. 471 – 478 .). Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Meyes R. , Lu M., de Puiseau C. W., Meisen T. ( 2019 ). Ablation studies in artificial neural networks . arXiv: abs/1901.08644 . OpenURL Placeholder Text WorldCat Pan S. J. , Yang Q. ( 2010 ). A survey on transfer learning . IEEE Transactions on Knowledge and Data Engineering , 22 ( 10 ), 1345 – 1359 . Google Scholar Crossref Search ADS WorldCat Pang S. , Pang C., Zhao L., Chen Y., Su Z., Zhou Y., Huang M., Yang W., Lu H., Feng Q. ( 2021 ). Spineparsenet: Spine parsing for volumetric MR image by a two-stage segmentation framework with semantic image representation . IEEE Transactions on Medical Imaging , 40 ( 1 ), 262 – 273 . Google Scholar Crossref Search ADS PubMed WorldCat Park C.-W. , Seo S. W., Kang N., Ko B., Choi B. W., Park C. M., Chang D. K., Kim H., Kim H., Lee H., Jang J., Ye J. C., Jeon J. H., Seo J. B., Kim K. J., Jung K.-H., Kim N., Paek S., Shin S.-Y., ( 2020 ). Artificial intelligence in health care: Current applications and issues . Journal of Korean Medical Science , 35 ( 42 ), e379 , https://doi.org/10.3346/jkms.2020.35.e379. Google Scholar Crossref Search ADS PubMed WorldCat Podichetty V. K. , Mazanec D. J., Biscup R. S. ( 2003 ). Chronic non-malignant musculoskeletal pain in older adults: clinical issues and opioid intervention . Postgraduate Medical Journal , 79 ( 937 ), 627 – 633 . Google Scholar Crossref Search ADS PubMed WorldCat Prince M. J. , Wu F., Guo Y., Gutierrez Robledo L. M., O’Donnell M., Sullivan R., Yusuf S. ( 2015 ). The burden of disease in older people and implications for health policy and practice . Lancet (London, England) , 385 ( 9967 ), 549 – 562 . Google Scholar Crossref Search ADS PubMed WorldCat Robbins H. , Monro S. ( 1951 ). A stochastic approximation method . Annals of Mathematical Statistics , 22 ( 3 ), 400 – 407 . Google Scholar Crossref Search ADS WorldCat Ronneberger O. , Fischer P., Brox T. ( 2015 ). U-net: Convolutional networks for biomedical image segmentation . In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 9351 of LNCS (pp. 234 – 241 .). Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Ruder S. ( 2016 ). An overview of gradient descent optimization algorithms . arXiv :abs/1609.04747. OpenURL Placeholder Text WorldCat Rundo L. , Han C., Nagano Y., Zhang J., Hataya R., Militello C., Tangherloni A., Nobile M. S., Ferretti C., Besozzi D., Gilardi M. C., Vitabile S., Mauri G., Nakayama H., Cazzaniga P. ( 2019 ). Use-net: Incorporating squeeze-and-excitation blocks into u-net for prostate zonal segmentation of multi-institutional MRI datasets . Neurocomputing , 365 , 31 – 43 . Google Scholar Crossref Search ADS WorldCat Schlemper J. , Oktay O., Schaap M., Heinrich M., Kainz B., Glocker B., Rueckert D. ( 2019 ). Attention gated networks: Learning to leverage salient regions in medical images . Medical Image Analysis , 53 , 197 – 207 . Google Scholar Crossref Search ADS PubMed WorldCat Seon-Yu K. , In-Sik L., Bo-Ram K., Jeong-Hoon L., Jongmin L., Seong-Eun K., Beom K. S., Lee P. S. ( 2012 ). Magnetic resonance findings of acute severe lower back pain . Annals of Rehabilitation Medicine , 36 ( 1 ), 47 – 54 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Shin H. , Roth H. R., Gao M., Lu L., Xu Z., Nogues I., Yao J., Mollura D., Summers R. M. ( 2016 ). Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning . IEEE Transactions on Medical Imaging , 35 ( 5 ), 1285 – 1298 . Google Scholar Crossref Search ADS PubMed WorldCat Srivastava N. , Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. ( 2014 ). Dropout: A simple way to prevent neural networks from overfitting . Journal of Machine Learning Research , 15 ( 56 ), 1929 – 1958 . Google Scholar OpenURL Placeholder Text WorldCat Taha A. A. , Hanbury A. ( 2015 ). Metrics for evaluating 3d medical image segmentation: Analysis, selection, and tool . BMC Medical Imaging , 15 , 29 . Google Scholar Crossref Search ADS PubMed WorldCat Takatalo J. , Karppinen J., Niinimäki J., Taimela S., Näyhä S., Järvelin M.-R., Kyllönen E., Tervonen O. ( 2009 ). Prevalence of degenerative imaging findings in lumbar magnetic resonance imaging among young adults . Spine , 34 ( 16 ), 1716 – 1721 . Google Scholar Crossref Search ADS PubMed WorldCat Tang X. ( 2020 ). The role of artificial intelligence in medical imaging research . BJR—Open , 2 ( 1 ), 20190031 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Tieleman T. , Hinton G. ( 2012 ). Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude . COURSERA: Neural Networks for Machine Learning . OpenURL Placeholder Text WorldCat Vania M. , Mureja D., Lee D. ( 2019 ). Automatic spine segmentation from CT images using convolutional neural network via redundant generation of class labels . Journal of Computational Design and Engineering , 6 ( 2 ), 224 – 232 . Google Scholar Crossref Search ADS WorldCat Vidal P. L. , de Moura J., Novo J., Ortega M. ( 2021 ). Multi-stage transfer learning for lung segmentation using portable X-ray devices for patients with covid-19 . Expert Systems with Applications , 173 , 114677 . Google Scholar Crossref Search ADS PubMed WorldCat Wang C. , Guo Y., Chen W., Yu Z. ( 2019 ). Fully automatic intervertebral disc segmentation using multimodal 3D u-net . In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1 , pp. 730 – 739 .). Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Wu Y. , He K. ( 2018 ). Group normalization . In ECCV . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Zhang C. , Vinyals O., Munos R., Bengio S. ( 2018 ). A study on overfitting in deep reinforcement learning . arXiv :abs/1804.06893. OpenURL Placeholder Text WorldCat Zhang Y.-D. , Satapathy S. C., Liu S., Li G.-R. ( 2021 ). A five-layer deep convolutional neural network with stochastic pooling for chest CT-based covid-19 diagnosis . Machine Vision and Applications , 32 ( 1 ), 14 . Google Scholar Crossref Search ADS PubMed WorldCat Zheng G. , Chu C., Belav D. L., Ibragimov B., Korez R., Vrtovec T., Hutt H., Everson R., Meakin J., Andrade I. L., Glocker B., Chen H., Dou Q., Heng P.-A., Wang C., Forsberg D., Neubert A., Fripp J., Urschler M., Li S. ( 2017 ). Evaluation and comparison of 3D intervertebral disc localization and segmentation methods for 3D t2 MR data: A grand challenge . Medical Image Analysis , 35 , 327 – 344 . Google Scholar Crossref Search ADS PubMed WorldCat Zhou T. , Ruan S., Canu S. ( 2019 ). A review: Deep learning for medical image segmentation using multi-modality fusion . Array , 3-4 , 100004 . Google Scholar Crossref Search ADS WorldCat © The Author(s) 2021. Published by Oxford University Press on behalf of the Society for Computational Design and Engineering. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com TI - Intervertebral disc instance segmentation using a multistage optimization mask-RCNN (MOM-RCNN) JF - Journal of Computational Design and Engineering DO - 10.1093/jcde/qwab030 DA - 2021-06-18 UR - https://www.deepdyve.com/lp/oxford-university-press/intervertebral-disc-instance-segmentation-using-a-multistage-4buPNERhlM SP - 1023 EP - 1036 VL - 8 IS - 4 DP - DeepDyve ER -