TY - JOUR
AU - Zhang, Jian J.
AB - 1 Introduction Medical visualisation commonly involves volumetric medical data such as CT, MRI, PET scans, and confocal spectral microscopy images. This technique is essential in clinical practices across various biomedical disciplines, like radiology, nuclear medicine, surgery planning, and nearly all neuroscience sub-fields. However, the generated volume data often reaches enormous sizes. The generated data often becomes very large, sometimes reaching terabyte-scale. For instance, biological volumetric datasets that capture microscale details of cells or tissues are commonly produced [1–5]. The emerging challenges lie in organising, storing, transmitting, manipulating, and rendering such terabyte-scale volume data. Recent advances in deep neural networks have led to their rapid application in medical imaging [6–8]. In particular, implicit neural representations have become an approach for compressing volumetric medical images by storing the parameters of trained neural networks instead of explicit voxel data such as SIREN [9]. However, the compression rate is often limited and volumetric data still require considerable memory, especially GPU memory. This results in high memory demands and longer training times for deep learning applications. In addition, there is currently a scarcity of research addressing these specific challenges. To address these challenges, this paper presents an End-to-End architecture that improves compression rates and reduces GPU memory usage, based on our previous work [10]. The proposed architecture consists of three key modules: a downsampling module, an Implicit Neural Representation (INR) module, and a 3D Super-Resolution (SR) module (e.g., [11]). The downsampling module reduces data size, enabling the INR module to represent the volume using a compact deep neural network. The SR module then reconstructs the original high-resolution volume from the INR module output. This architecture reduces memory needs and allows for more efficient neural network training. The main challenge lies in achieving a high compression rate and minimal reconstruction loss. To address this, we propose a trade-off point method that optimises the configuration of each module to achieve peak performance. This approach can be generalised to a wide range of deep network designs. Our key contributions include: We propose an End-to-End architecture with three computational modules, designed to optimise volumetric data compression by achieving a high compression rate while maintaining superior reconstruction quality and minimising GPU memory consumption. We introduce a trade-off point method to determine the optimal configuration for the proposed End-to-End architecture, balancing key performance metrics such as compression rate and reconstruction quality. The rest of the paper is structured as follows. Section 2 briefly reviews related work. Section 3 presents the proposed architecture and the trade-off point method. Section 4 presents experimental results and analysis. Finally, Section 5 concludes our work. 2 Background and relevant literature In our previous work [10], we developed an architecture that leveraged existing pre-trained deep networks to decrease the volume data size. The basic idea is to transform volume data into an implicit neural network representation, such as SIREN [9], to compress the data while maintaining reconstruction accuracy. However, pre-trained deep networks often struggle to generalise well, especially with medical volume data. Many pre-trained Super-Resolution deep networks require fine-tuning for different medical datasets. A “one-size-fits-all” approach does not work, since each dataset has its own characteristics. The existing deep networks do not generalise well to diverse volume data. Therefore, this paper aims to train an end-to-end deep network, rather than simply piecing together multiple pre-trained networks. 2.1 Implicit neural representation Representing 3D geometry for rendering and reconstruction involves trade-offs across fidelity, efficiency, and compression capabilities. The DeepSDF model [12] uses a continuous Signed Distance Function (SDF) to represent shapes. Another approach [13] employs an encoder-decoder neural architecture for lossless compression. However, this method has a high inference time due to explicit optimisation requirements. MedZip [14] proposes a lossless compression technique employing Long Short-Term Memory (LSTM) for volumetric MRI and CT. NeRF [15] presents a notable method for synthesising new views of a volumetric scene through implicit neural representation as a continuous function. However, it is outperformed by SIRENs [9] due to its time consumption. [16] presents a 3D representation technique to reduce memory usage by predicting an occupancy function for a continuous volume. COIN [17] applies a multi-layer perceptron (MLP) to implicit neural network compression by encoding geometric inputs. However, it demonstrates inferior performance compared to state-of-the-art compression methods. INR-GAN [18] applies a GAN model to multi-scale Implicit Neural Representations (INRs) but struggles with artefacts when dealing with high-frequency features. NeRP [19] introduces a novel approach to generate a computational image from sampled sensor data. However, dealing with sparsely sampled images encounters additional hurdles due to limited data points. Unlike previous deep learning methods for image reconstruction, NeRP leverages both the internal structure of an image prior and the physics governing sparsely sampled measurements to represent the entire subject. 2.2 Super-resolution techniques Numerous techniques leveraging convolutional neural networks (CNNs) have demonstrated exceptional performance in image super-resolution (SR). The pioneering work of SRCNN [20] introduced CNNs to SR by learning a non-linear mapping from low-resolution to high-resolution images with only three convolution layers. CNN-based methods illustrated their impressive performance in SR. Still, they became impractical when taking into account constraints on time and memory resources [21–30]. SRNO [11] designed for continuous super-resolution tasks. It treats each image as a function and learns a mapping between finite-dimensional function spaces, enabling it to train and generalise across various discretisation levels. Experiments demonstrate that SRNO surpasses other arbitrary-scale super-resolution methods in terms of both performance and computational time, particularly excelling in capturing global image structures, which is important in medical imaging. Table 1 highlights the gaps between the proposed method and four state-of-the-art models—SIREN [9], MedZip [14], NeRF [15], and COIN [17]—across several key metrics: high compression rate, low GPU memory consumption, high reconstruction quality (PSNR > 40), good visual similarity (SSIM > 0.9), scalability to large datasets, fast training time, adaptability to medical imaging, and handling high-frequency features. The proposed method addresses several limitations of existing models, particularly in achieving high compression rates and excellent reconstruction quality, while maintaining efficiency in GPU memory usage and adaptability to medical imaging tasks. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Identifying gaps in state-of-the-art models compared to the proposed method. https://doi.org/10.1371/journal.pone.0314944.t001 2.1 Implicit neural representation Representing 3D geometry for rendering and reconstruction involves trade-offs across fidelity, efficiency, and compression capabilities. The DeepSDF model [12] uses a continuous Signed Distance Function (SDF) to represent shapes. Another approach [13] employs an encoder-decoder neural architecture for lossless compression. However, this method has a high inference time due to explicit optimisation requirements. MedZip [14] proposes a lossless compression technique employing Long Short-Term Memory (LSTM) for volumetric MRI and CT. NeRF [15] presents a notable method for synthesising new views of a volumetric scene through implicit neural representation as a continuous function. However, it is outperformed by SIRENs [9] due to its time consumption. [16] presents a 3D representation technique to reduce memory usage by predicting an occupancy function for a continuous volume. COIN [17] applies a multi-layer perceptron (MLP) to implicit neural network compression by encoding geometric inputs. However, it demonstrates inferior performance compared to state-of-the-art compression methods. INR-GAN [18] applies a GAN model to multi-scale Implicit Neural Representations (INRs) but struggles with artefacts when dealing with high-frequency features. NeRP [19] introduces a novel approach to generate a computational image from sampled sensor data. However, dealing with sparsely sampled images encounters additional hurdles due to limited data points. Unlike previous deep learning methods for image reconstruction, NeRP leverages both the internal structure of an image prior and the physics governing sparsely sampled measurements to represent the entire subject. 2.2 Super-resolution techniques Numerous techniques leveraging convolutional neural networks (CNNs) have demonstrated exceptional performance in image super-resolution (SR). The pioneering work of SRCNN [20] introduced CNNs to SR by learning a non-linear mapping from low-resolution to high-resolution images with only three convolution layers. CNN-based methods illustrated their impressive performance in SR. Still, they became impractical when taking into account constraints on time and memory resources [21–30]. SRNO [11] designed for continuous super-resolution tasks. It treats each image as a function and learns a mapping between finite-dimensional function spaces, enabling it to train and generalise across various discretisation levels. Experiments demonstrate that SRNO surpasses other arbitrary-scale super-resolution methods in terms of both performance and computational time, particularly excelling in capturing global image structures, which is important in medical imaging. Table 1 highlights the gaps between the proposed method and four state-of-the-art models—SIREN [9], MedZip [14], NeRF [15], and COIN [17]—across several key metrics: high compression rate, low GPU memory consumption, high reconstruction quality (PSNR > 40), good visual similarity (SSIM > 0.9), scalability to large datasets, fast training time, adaptability to medical imaging, and handling high-frequency features. The proposed method addresses several limitations of existing models, particularly in achieving high compression rates and excellent reconstruction quality, while maintaining efficiency in GPU memory usage and adaptability to medical imaging tasks. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Identifying gaps in state-of-the-art models compared to the proposed method. https://doi.org/10.1371/journal.pone.0314944.t001 3 Methodology In this section, we first present the end-to-end architecture and then introduce the trade-off point approach to evaluate the proposed architecture in terms of compression efficiency and reconstruction accuracy. 3.1 Proposed end-to-end architecture Our end-to-end architecture, shown in Fig 1, is composed of three core modules: Downsampling, Implicit Neural Representation (INR), and Super-Resolution (SR). The Downsampling module does not require training. We need to train the INR and SR modules in an end-to-end way. We employ a L1 loss function to evaluate reconstruction quality here. In the following sections, we will explain each module individually. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Workflow of the proposed end-to-end architecture, including downsampling, implicit neural representation (INR), and super-resolution (SR) modules. https://doi.org/10.1371/journal.pone.0314944.g001 3.1.1 3D downsampling module. Given a high-resolution volume of x, this module aims to acquire its low-resolution counterpart y. The relationship between x and y can be modelled as follows, (1) where, is the FFT operator for the high-resolution regime, is the inverse FFT operator for the low-resolution regime, is the low-pass operator on the frequency domain, and n is the noise. Fourier Transform technique is widely employed in medical imaging [31]. We hope to point out that the operator in the frequency domain is both controllable and easy to implement. In our case, it effectively generates low-resolution volumes at downsampling scales of , , and . Additionally, it can be noted that this module does not need training. 3.1.2 3D implicit neural representation (INR). The INR module harnesses the capabilities of implicit neural networks to efficiently encode volumetric data. Specifically, using INR for low-resolution volumes helps prevent memory overflow. Unlike conventional explicit representations, INRs depict the volume as a continuous function that maps spatial coordinates to voxel intensity values. This enables a concise representation that can be readily adjusted to different levels of detail. Drawing inspiration from recent breakthroughs in implicit neural representations, we employed a multi-layer perceptron (MLP) architecture with periodic activation functions (i.e., SIREN [9]) to effectively capture the intricate structures within the volumetric data. 3.1.3 3D super resolution (SR) module. The SR module employs the super-resolution model, SRNO [11]. SRNO model utilises deep learning to learn intricate transformations from low-resolution to high-resolution data. Beyond enhancing resolution, SRNO models frequently possess intrinsic denoising abilities, resulting in cleaner and clearer images. Compared to other super-resolution techniques, SRNO models can produce images with fewer artefacts, such as ringing and blurring [11]. Moreover, the number of channels in the attention structure can significantly influence the SRNO model’s performance. Thus, we regard it as a hyper-parameter of the SRNO models and evaluate the SRNO by it. 3.2 Trade-off point approach To achieve an overall optimal performance for our proposed end-to-end architecture, we propose a metric system to measure overall performance and further determine the optimal setting for each module accordingly. This design method is called the Trade-off Point Method. Our metric system includes four measurements: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Bitrate, and Compression Rate (CR) as below. PSNR provides a measure of pixel-level accuracy by calculating the ratio of signal power to noise power, yet it often does not correspond to human visual perception. In contrast, SSIM assesses perceptual quality by comparing luminance, contrast, and structure, but may overlook precise pixel-wise errors. Recognising the limitations of using PSNR or SSIM alone for performance measurement, we combine both metrics to evaluate image quality thoroughly. 3.2.1 Metric definition. Peak Signal to Noise Ratio(PSNR) is a metric used to measure the quality of a reconstructed or compressed signal compared to the original signal. It is expressed in decibels (dB) and is calculated using the following formula: (2) where: MAX is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image), and MSE is the Mean-Squared Error between the original and reconstructed images. A high PSNR value indicates a high-quality reconstruction, as it signifies that the reconstructed signal is closer to the original signal in terms of fidelity. Structural Similarity Index Measurement(SSIM): The Structural Similarity Index Measurement(SSIM) is a metric to assess the similarity between a reference image (original) and a distorted or processed image. SSIM quantifies similarity by considering three key components: luminance, contrast, and structure. SSIM is defined as, (3) where: μx and μy are the means of the original and distorted images, respectively, and are the variances of the original and distorted images, respectively, σxy is the covariance of the original and distorted images, C1 and C2 are small constants added for numerical stability. The SSIM value ranges from -1 to 1, with 1 indicating perfect similarity. High SSIM values indicate high similarity between the images, while low values suggest more significant differences or distortions. Bitrate: Bitrate is a metric used in digital imaging to quantify the amount of data assigned to each pixel in a raster image. Bpp indicates the level of detail or precision in representing colour or intensity information for each pixel. High Bpp values typically result in high image quality but large file size, while low Bpp values lead to low quality but small files. It is computed as, (4) In greyscale images, each pixel is represented by a single channel (e.g., luminance). Bpp is degraded as, (5) When compression techniques are applied, the Bitrate measures the density of the pixel value of the image to assess the trade-off between image quality and file size. High Bitrate values generally result in high-quality but large image files, while low Bitrate values lead to more aggressive compression and small files but with potential quality loss. Downsampling Scale (DS): Let Dx, Dy, and Dz be the original dimensions of the 3D image stacks in a (x, y, z) coordinate system, respectively; and the new dimensions be (dx, dy, dz) after downsampling. The DS (sx, sy, sz) is defined as, We may simply set (sx, sy, sz) identically. Number of the neurons in SIREN (SN): With SIREN’s layer count set at 3, each layer contains an identical number of neurons. We adjust the neuron count per layer from 30 to 230, using this to represent SIREN’s size. Number of Channels (NC): We incorporate the 3D version of SRNO into the SR module. The cornerstone of a super-resolution network lies in its feature extractor. Existing super-resolution models possess their own topologies for their feature extractors. The number of Channels indicates the feature extractor’s size, thereby reflecting the complexity of the super-resolution network. This complexity is particularly influenced by the downsampling scale within our proposed architecture, leading to a significant increase in channel numbers due to the abundance of volume data. To minimise the size of the SR module in our proposed architecture, we initially assess the performance of the SR module with different sizes of attention mechanisms and fully connected layer submodules, after which we fix the topologies and sizes of these two submodules. However, the channel number of the feature extractor remains adaptable to accommodate varying reconstruction accuracy requirements. Compression Rate (CR): The CR refers to the ratio of the compressed data’s size over the uncompressed data’s size. A high compression rate indicates an efficient compression process, as it signifies a remarkable reduction in data size. It is defined as, (6) In this paper, we define the size of a deep network by its weight count and the size of a volume by its voxel number. 3.2.2 Trade-off settings. To find the trade-off settings for the individual modules, we first apply the metrics of PSNR, SSIM, and CR defined in the above section separately to a specific volume of data concerning three dimensions: DS, NC, and SN. The different combinations of DS, NC, and SN result in different measurements, which are stored in a 3D array, as shown in Fig 2. We need to balance the performance of (PSNR, SSIM, and CR) associated with the combination of three dimensions (DS, NC, SN) to determine the trade-off point for our end-to-end architecture. This may be described as, (7) where, 3DA denotes the 3D array with 3 dimensions, DS, NC, SN, and DSmax denotes the given maximum value for DS, and others have a similar definition. Applying the Augmented Lagrangian method here yields, (8) where α are Lagrange factors and β is the penalty parameter. The resulting (x,y,z) is called the trade-off point. To visualise it, we compute the marginal distributions concerning three dimensions separately on 3DA as below, (9) Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Illustration of the data structure in the context of the metrics, PSNR, SSIM and CR, according to the DS, NC and SN dimensions. https://doi.org/10.1371/journal.pone.0314944.g002 There are a total of three sets of marginal distributions. Each set illustrates the PSNR bounds, SSIM bounds, and CR bounds concerning the scale at each dimension specified by the trade-off point, one after another. Theoretical equivalence is expected among these three sets of PSNR, SSIM and CR bounds at the trade-off point. The trade-off point indicates the tolerance of the proposed architecture in three dimensions at an expected PSNR, SSIM and CR bounds level. The area delimited by the trade-off point intuitively and quantitatively illustrates the proposed architecture’s performance. 3.1 Proposed end-to-end architecture Our end-to-end architecture, shown in Fig 1, is composed of three core modules: Downsampling, Implicit Neural Representation (INR), and Super-Resolution (SR). The Downsampling module does not require training. We need to train the INR and SR modules in an end-to-end way. We employ a L1 loss function to evaluate reconstruction quality here. In the following sections, we will explain each module individually. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Workflow of the proposed end-to-end architecture, including downsampling, implicit neural representation (INR), and super-resolution (SR) modules. https://doi.org/10.1371/journal.pone.0314944.g001 3.1.1 3D downsampling module. Given a high-resolution volume of x, this module aims to acquire its low-resolution counterpart y. The relationship between x and y can be modelled as follows, (1) where, is the FFT operator for the high-resolution regime, is the inverse FFT operator for the low-resolution regime, is the low-pass operator on the frequency domain, and n is the noise. Fourier Transform technique is widely employed in medical imaging [31]. We hope to point out that the operator in the frequency domain is both controllable and easy to implement. In our case, it effectively generates low-resolution volumes at downsampling scales of , , and . Additionally, it can be noted that this module does not need training. 3.1.2 3D implicit neural representation (INR). The INR module harnesses the capabilities of implicit neural networks to efficiently encode volumetric data. Specifically, using INR for low-resolution volumes helps prevent memory overflow. Unlike conventional explicit representations, INRs depict the volume as a continuous function that maps spatial coordinates to voxel intensity values. This enables a concise representation that can be readily adjusted to different levels of detail. Drawing inspiration from recent breakthroughs in implicit neural representations, we employed a multi-layer perceptron (MLP) architecture with periodic activation functions (i.e., SIREN [9]) to effectively capture the intricate structures within the volumetric data. 3.1.3 3D super resolution (SR) module. The SR module employs the super-resolution model, SRNO [11]. SRNO model utilises deep learning to learn intricate transformations from low-resolution to high-resolution data. Beyond enhancing resolution, SRNO models frequently possess intrinsic denoising abilities, resulting in cleaner and clearer images. Compared to other super-resolution techniques, SRNO models can produce images with fewer artefacts, such as ringing and blurring [11]. Moreover, the number of channels in the attention structure can significantly influence the SRNO model’s performance. Thus, we regard it as a hyper-parameter of the SRNO models and evaluate the SRNO by it. 3.1.1 3D downsampling module. Given a high-resolution volume of x, this module aims to acquire its low-resolution counterpart y. The relationship between x and y can be modelled as follows, (1) where, is the FFT operator for the high-resolution regime, is the inverse FFT operator for the low-resolution regime, is the low-pass operator on the frequency domain, and n is the noise. Fourier Transform technique is widely employed in medical imaging [31]. We hope to point out that the operator in the frequency domain is both controllable and easy to implement. In our case, it effectively generates low-resolution volumes at downsampling scales of , , and . Additionally, it can be noted that this module does not need training. 3.1.2 3D implicit neural representation (INR). The INR module harnesses the capabilities of implicit neural networks to efficiently encode volumetric data. Specifically, using INR for low-resolution volumes helps prevent memory overflow. Unlike conventional explicit representations, INRs depict the volume as a continuous function that maps spatial coordinates to voxel intensity values. This enables a concise representation that can be readily adjusted to different levels of detail. Drawing inspiration from recent breakthroughs in implicit neural representations, we employed a multi-layer perceptron (MLP) architecture with periodic activation functions (i.e., SIREN [9]) to effectively capture the intricate structures within the volumetric data. 3.1.3 3D super resolution (SR) module. The SR module employs the super-resolution model, SRNO [11]. SRNO model utilises deep learning to learn intricate transformations from low-resolution to high-resolution data. Beyond enhancing resolution, SRNO models frequently possess intrinsic denoising abilities, resulting in cleaner and clearer images. Compared to other super-resolution techniques, SRNO models can produce images with fewer artefacts, such as ringing and blurring [11]. Moreover, the number of channels in the attention structure can significantly influence the SRNO model’s performance. Thus, we regard it as a hyper-parameter of the SRNO models and evaluate the SRNO by it. 3.2 Trade-off point approach To achieve an overall optimal performance for our proposed end-to-end architecture, we propose a metric system to measure overall performance and further determine the optimal setting for each module accordingly. This design method is called the Trade-off Point Method. Our metric system includes four measurements: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Bitrate, and Compression Rate (CR) as below. PSNR provides a measure of pixel-level accuracy by calculating the ratio of signal power to noise power, yet it often does not correspond to human visual perception. In contrast, SSIM assesses perceptual quality by comparing luminance, contrast, and structure, but may overlook precise pixel-wise errors. Recognising the limitations of using PSNR or SSIM alone for performance measurement, we combine both metrics to evaluate image quality thoroughly. 3.2.1 Metric definition. Peak Signal to Noise Ratio(PSNR) is a metric used to measure the quality of a reconstructed or compressed signal compared to the original signal. It is expressed in decibels (dB) and is calculated using the following formula: (2) where: MAX is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image), and MSE is the Mean-Squared Error between the original and reconstructed images. A high PSNR value indicates a high-quality reconstruction, as it signifies that the reconstructed signal is closer to the original signal in terms of fidelity. Structural Similarity Index Measurement(SSIM): The Structural Similarity Index Measurement(SSIM) is a metric to assess the similarity between a reference image (original) and a distorted or processed image. SSIM quantifies similarity by considering three key components: luminance, contrast, and structure. SSIM is defined as, (3) where: μx and μy are the means of the original and distorted images, respectively, and are the variances of the original and distorted images, respectively, σxy is the covariance of the original and distorted images, C1 and C2 are small constants added for numerical stability. The SSIM value ranges from -1 to 1, with 1 indicating perfect similarity. High SSIM values indicate high similarity between the images, while low values suggest more significant differences or distortions. Bitrate: Bitrate is a metric used in digital imaging to quantify the amount of data assigned to each pixel in a raster image. Bpp indicates the level of detail or precision in representing colour or intensity information for each pixel. High Bpp values typically result in high image quality but large file size, while low Bpp values lead to low quality but small files. It is computed as, (4) In greyscale images, each pixel is represented by a single channel (e.g., luminance). Bpp is degraded as, (5) When compression techniques are applied, the Bitrate measures the density of the pixel value of the image to assess the trade-off between image quality and file size. High Bitrate values generally result in high-quality but large image files, while low Bitrate values lead to more aggressive compression and small files but with potential quality loss. Downsampling Scale (DS): Let Dx, Dy, and Dz be the original dimensions of the 3D image stacks in a (x, y, z) coordinate system, respectively; and the new dimensions be (dx, dy, dz) after downsampling. The DS (sx, sy, sz) is defined as, We may simply set (sx, sy, sz) identically. Number of the neurons in SIREN (SN): With SIREN’s layer count set at 3, each layer contains an identical number of neurons. We adjust the neuron count per layer from 30 to 230, using this to represent SIREN’s size. Number of Channels (NC): We incorporate the 3D version of SRNO into the SR module. The cornerstone of a super-resolution network lies in its feature extractor. Existing super-resolution models possess their own topologies for their feature extractors. The number of Channels indicates the feature extractor’s size, thereby reflecting the complexity of the super-resolution network. This complexity is particularly influenced by the downsampling scale within our proposed architecture, leading to a significant increase in channel numbers due to the abundance of volume data. To minimise the size of the SR module in our proposed architecture, we initially assess the performance of the SR module with different sizes of attention mechanisms and fully connected layer submodules, after which we fix the topologies and sizes of these two submodules. However, the channel number of the feature extractor remains adaptable to accommodate varying reconstruction accuracy requirements. Compression Rate (CR): The CR refers to the ratio of the compressed data’s size over the uncompressed data’s size. A high compression rate indicates an efficient compression process, as it signifies a remarkable reduction in data size. It is defined as, (6) In this paper, we define the size of a deep network by its weight count and the size of a volume by its voxel number. 3.2.2 Trade-off settings. To find the trade-off settings for the individual modules, we first apply the metrics of PSNR, SSIM, and CR defined in the above section separately to a specific volume of data concerning three dimensions: DS, NC, and SN. The different combinations of DS, NC, and SN result in different measurements, which are stored in a 3D array, as shown in Fig 2. We need to balance the performance of (PSNR, SSIM, and CR) associated with the combination of three dimensions (DS, NC, SN) to determine the trade-off point for our end-to-end architecture. This may be described as, (7) where, 3DA denotes the 3D array with 3 dimensions, DS, NC, SN, and DSmax denotes the given maximum value for DS, and others have a similar definition. Applying the Augmented Lagrangian method here yields, (8) where α are Lagrange factors and β is the penalty parameter. The resulting (x,y,z) is called the trade-off point. To visualise it, we compute the marginal distributions concerning three dimensions separately on 3DA as below, (9) Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Illustration of the data structure in the context of the metrics, PSNR, SSIM and CR, according to the DS, NC and SN dimensions. https://doi.org/10.1371/journal.pone.0314944.g002 There are a total of three sets of marginal distributions. Each set illustrates the PSNR bounds, SSIM bounds, and CR bounds concerning the scale at each dimension specified by the trade-off point, one after another. Theoretical equivalence is expected among these three sets of PSNR, SSIM and CR bounds at the trade-off point. The trade-off point indicates the tolerance of the proposed architecture in three dimensions at an expected PSNR, SSIM and CR bounds level. The area delimited by the trade-off point intuitively and quantitatively illustrates the proposed architecture’s performance. 3.2.1 Metric definition. Peak Signal to Noise Ratio(PSNR) is a metric used to measure the quality of a reconstructed or compressed signal compared to the original signal. It is expressed in decibels (dB) and is calculated using the following formula: (2) where: MAX is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image), and MSE is the Mean-Squared Error between the original and reconstructed images. A high PSNR value indicates a high-quality reconstruction, as it signifies that the reconstructed signal is closer to the original signal in terms of fidelity. Structural Similarity Index Measurement(SSIM): The Structural Similarity Index Measurement(SSIM) is a metric to assess the similarity between a reference image (original) and a distorted or processed image. SSIM quantifies similarity by considering three key components: luminance, contrast, and structure. SSIM is defined as, (3) where: μx and μy are the means of the original and distorted images, respectively, and are the variances of the original and distorted images, respectively, σxy is the covariance of the original and distorted images, C1 and C2 are small constants added for numerical stability. The SSIM value ranges from -1 to 1, with 1 indicating perfect similarity. High SSIM values indicate high similarity between the images, while low values suggest more significant differences or distortions. Bitrate: Bitrate is a metric used in digital imaging to quantify the amount of data assigned to each pixel in a raster image. Bpp indicates the level of detail or precision in representing colour or intensity information for each pixel. High Bpp values typically result in high image quality but large file size, while low Bpp values lead to low quality but small files. It is computed as, (4) In greyscale images, each pixel is represented by a single channel (e.g., luminance). Bpp is degraded as, (5) When compression techniques are applied, the Bitrate measures the density of the pixel value of the image to assess the trade-off between image quality and file size. High Bitrate values generally result in high-quality but large image files, while low Bitrate values lead to more aggressive compression and small files but with potential quality loss. Downsampling Scale (DS): Let Dx, Dy, and Dz be the original dimensions of the 3D image stacks in a (x, y, z) coordinate system, respectively; and the new dimensions be (dx, dy, dz) after downsampling. The DS (sx, sy, sz) is defined as, We may simply set (sx, sy, sz) identically. Number of the neurons in SIREN (SN): With SIREN’s layer count set at 3, each layer contains an identical number of neurons. We adjust the neuron count per layer from 30 to 230, using this to represent SIREN’s size. Number of Channels (NC): We incorporate the 3D version of SRNO into the SR module. The cornerstone of a super-resolution network lies in its feature extractor. Existing super-resolution models possess their own topologies for their feature extractors. The number of Channels indicates the feature extractor’s size, thereby reflecting the complexity of the super-resolution network. This complexity is particularly influenced by the downsampling scale within our proposed architecture, leading to a significant increase in channel numbers due to the abundance of volume data. To minimise the size of the SR module in our proposed architecture, we initially assess the performance of the SR module with different sizes of attention mechanisms and fully connected layer submodules, after which we fix the topologies and sizes of these two submodules. However, the channel number of the feature extractor remains adaptable to accommodate varying reconstruction accuracy requirements. Compression Rate (CR): The CR refers to the ratio of the compressed data’s size over the uncompressed data’s size. A high compression rate indicates an efficient compression process, as it signifies a remarkable reduction in data size. It is defined as, (6) In this paper, we define the size of a deep network by its weight count and the size of a volume by its voxel number. 3.2.2 Trade-off settings. To find the trade-off settings for the individual modules, we first apply the metrics of PSNR, SSIM, and CR defined in the above section separately to a specific volume of data concerning three dimensions: DS, NC, and SN. The different combinations of DS, NC, and SN result in different measurements, which are stored in a 3D array, as shown in Fig 2. We need to balance the performance of (PSNR, SSIM, and CR) associated with the combination of three dimensions (DS, NC, SN) to determine the trade-off point for our end-to-end architecture. This may be described as, (7) where, 3DA denotes the 3D array with 3 dimensions, DS, NC, SN, and DSmax denotes the given maximum value for DS, and others have a similar definition. Applying the Augmented Lagrangian method here yields, (8) where α are Lagrange factors and β is the penalty parameter. The resulting (x,y,z) is called the trade-off point. To visualise it, we compute the marginal distributions concerning three dimensions separately on 3DA as below, (9) Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Illustration of the data structure in the context of the metrics, PSNR, SSIM and CR, according to the DS, NC and SN dimensions. https://doi.org/10.1371/journal.pone.0314944.g002 There are a total of three sets of marginal distributions. Each set illustrates the PSNR bounds, SSIM bounds, and CR bounds concerning the scale at each dimension specified by the trade-off point, one after another. Theoretical equivalence is expected among these three sets of PSNR, SSIM and CR bounds at the trade-off point. The trade-off point indicates the tolerance of the proposed architecture in three dimensions at an expected PSNR, SSIM and CR bounds level. The area delimited by the trade-off point intuitively and quantitatively illustrates the proposed architecture’s performance. 4 Materials and experimental results Our experiments can be categorised into two parts. The first part aims to justify the selection of each module in our proposed end-to-end architecture. The second part involves applying the trade-off point method to determine an optimal architecture that balances various considerations. 4.1 Data and implementation setup The dataset comprises 750 multi-parametric magnetic resonance images (mp-MRI) collected from patients diagnosed with either glioblastoma or lower-grade glioma [32]. We select T2 Fluid-Attenuated Inversion Recovery (FLAIR) 3D scan from a random patient with the size of 155 x 240 x 240. The implementation of our architecture starts with a high-resolution 3D volumetric input, such as a medical scan, denoted as x. Initially, the input volume undergoes normalisation, scaling the voxel values to a range between 0 and 1. To streamline computations, the volume is segmented into smaller patches, each measuring 64 × 64 × 64. Patches with 70% or more non-zero voxels containing more information are classified as High-Resolution (HR) patches. From these, one HR patch is selected as the high-resolution input for further processing. Once the data are prepared, the 3D Downsampling module applies a Fourier Transform to convert the high-resolution volume from the spatial domain to the frequency domain. A low-pass filter is then used to eliminate high-frequency components, thereby reducing resolution. This removal process is crucial in medical imaging, as it decreases the data size while preserving essential information, ultimately easing the model processing load. The Inverse Fourier Transform reverts the data to the spatial domain, yielding a low-resolution version of the original volume. Next, the downsampled volume is processed through the 3D Implicit Neural Representation (INR) module. Here, a Multi-Layer Perceptron (MLP) utilising Sinusoidal Activation Functions (SIREN) maps input coordinates to output voxel intensities, enabling the neural network to represent complex structures as continuous functions. These functions are then converted into voxel intensities. Following this, the 3D Super-Resolution (SR) module employs a 3D Convolutional Neural Network (CNN) for feature extraction, incorporating an Attention Mechanism to prioritise significant features. This SR module improves the resolution of the volume, restoring it to a level close to the original. The reconstructed volume, denoted as y, is compared to the original x using an L1 loss function to assess and optimise reconstruction quality. The entire system is trained using the Adam optimiser with a learning rate of 0.0015 for 5,000 epochs on an NVIDIA A4000 16GB GPU with CUDA support in the PyTorch framework. All source codes and results are available at https://github.com/asheibanifard/EndtoEndCompression. 4.2 Trade-off architecture 4.2.1 3D downsampling module. The Downsampling module does not require training. This implies that the downsampling scale is per set without consideration of the final result quality. We select three downsampling scales of 1/2, 1/4, and 1/8 in our experiments. It is necessary to test the performance of the proposed architecture at three downsampling scales, particularly the INR module. Table 2 presents a comprehensive comparison of reconstruction results for different downsampling scales, illustrating the effectiveness of our proposed architecture in maintaining a high reconstruction quality across various compression levels. It can be noted that decreasing the downsampling scales does not significantly degenerate the quality of the reconstruction. Additionally, non-standard sampling scales like 1/3, 1/5, or 1/7 would introduce unnecessary complexity and inconsistencies without offering meaningful improvements, making them less suitable for the architecture’s goals. Thus, these three downsampling scales are acceptable. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Performance of the INR module and the whole end-to-end architecture. (The upper row shows the performance of a single SIREN and the lower row shows that of the whole end-to-end architecture). https://doi.org/10.1371/journal.pone.0314944.t002 4.2.2 3D INR module. We opt for the SIREN model [9] as our INR module, focusing on two primary aspects of the SIREN structure: the number of layers and the number of neurons per layer. The goal is to use a compact SIREN model to enhance the compression rate (CR). We experiment with various configurations of the SIREN model, altering the layer count and neuron count per layer, as detailed in Table 3. We find that a SIREN network with 3 layers and between 30 and 230 neurons per layer offers satisfactory performance, especially for small volume data inputs, while substantially cutting down on GPU memory usage. Furthermore, we compare the performance of a single SIREN model against our proposed architecture, as shown in Table 2. The notable benefit is a dramatic reduction in GPU memory consumption while maintaining comparable reconstruction quality. Additionally, using more than 230 neurons per layer increases the model’s capacity to represent detailed structures but leads to diminishing returns in terms of reconstruction quality. Beyond 230 neurons, the gains in PSNR and SSIM are marginal, while the computational cost and GPU memory usage increase significantly. This increased complexity does not translate into substantial improvements in performance, making the additional computational overhead unjustified. Thus, we prefer the SIREN model with 3 layers in the INR module. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Average values for different INR layers and neurons. https://doi.org/10.1371/journal.pone.0314944.t003 4.2.3 3D super-resolution module. We utilise the SRNO [11] for the SR module due to its compact size, as evidenced by the average number of parameters of deep networks in Table 2. We also compare our end-to-end architecture with cutting-edge methods [32–37]. Table 8 reveals that (1) the SR module performs effectively, as our architecture, using a 3-layer SIREN, matches the reconstruction quality of a standalone 5-layer SIREN; and (2) our architecture surpasses other state-of-the-art image compression methods in terms of PSNR and SSIM. 4.2.4 Find a trade-off architecture by trade-off point approach. To find the trade-off point for our proposed architecture, firstly, our proposed architecture is tested in terms of all combinations of NC, DS and SN, which is presented separately in Table 4 with 4 channels of feature extraction in the SRNO model, Table 5 with 8 channels of feature extraction in the SRNO model, and Table 6 with 16 channels of feature extraction in the SRNO model. The trade-off point of the proposed architecture is then calculated using Eq 8, that is, the trade-off point (NC = 4, DS = 1/2, SN = 30). At the trade-off point, the PSNR upper bound is around 38, the SSIM upper bound is around 0.94, and the CR upper bound is around 76.6%, as shown in Table 7. This is a good setting for the proposed architecture, as it reaches a high compression rate and good quality for reconstruction. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. The results of our proposed architecture with 4 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t004 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. The results of our proposed network with 8 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t005 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. The results of our proposed network with 16 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t006 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Our proposed architecture’s trade-off point. https://doi.org/10.1371/journal.pone.0314944.t007 Moreover, it is further illustrated by Eq 9. We show the three sets of marginal distributions concerning dimensions (NC, DS, SN), in Figs 3–5, respectively. If CR is decreased, the SIREN size (SN) or channel number (NC) can be increased. However, the reconstruction quality (i.e. PSNR or SSIM) shows a slight improvement. Thus, enlarging the model size or channel number will not significantly improve reconstruction quality. Additionally, compared to other existing approaches in Table 8, our architecture excels in maintaining a low Bitrate(bpp), ensuring that the compressed file size is significantly smaller. Our results (PSNR and SSIM) are still comparable with those of the “3D-VOI-OMLSVD [34]”. Fig 6 further shows the reconstructed slices of volume data. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Illustrates the trade-off point for the number of channels (NC) in the SR module concerning the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines indicate the intersection where the optimal trade-off is achieved, balancing compression efficiency and reconstruction quality. https://doi.org/10.1371/journal.pone.0314944.g003 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. The trade-off point for the downsampling scale (DS) is based on the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines highlight where the downsampling scale achieves an optimal balance between compression rate and reconstruction accuracy. https://doi.org/10.1371/journal.pone.0314944.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. The trade-off point for the number of neurons (SN) in the SIREN model, plotted against the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines indicate the optimal configuration of neurons in the SIREN model for achieving high reconstruction quality with minimal compression loss. https://doi.org/10.1371/journal.pone.0314944.g005 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. The Left column shows the different original slices of the volume with sizes of (155, 240, 240); the middle column shows the labelled patches of the slices with sizes of (64, 64, 64); the right column shows the reconstructed patches by our architecture. https://doi.org/10.1371/journal.pone.0314944.g006 Additionally, Fig 7 shows a steady optimisation process over 5000 epochs, with continuous improvements in reconstruction accuracy and structural similarity. The PSNR curve exceeds 40 dB, indicating high reconstruction quality with minimal error. The SSIM curve approaches 0.96, demonstrating the model’s effectiveness in preserving perceptual and structural fidelity. The steady decrease in the loss function, alongside the PSNR and SSIM improvements, confirms effective convergence. These results, consistent with the final performance metrics in Table 8, highlight the architecture’s ability to balance compression efficiency and high-quality reconstruction, making it ideal for medical imaging. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Training procedure of the architecture according to the trade-off point setting. https://doi.org/10.1371/journal.pone.0314944.g007 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Comparison of our techniques with other state-of-the-art methods in terms of PSNR and SSIM in volume reconstruction. https://doi.org/10.1371/journal.pone.0314944.t008 Remark: The proposed trade-off point approach serves as a pragmatic optimisation strategy. In the context of the compression problem, it is essential to balance various requirements, including downsampling scales, INR module size, SR module structure, etc., rather than overemphasising one or two factors. The trade-off point approach addresses this challenge by elegantly optimising the parameters involved. 4.1 Data and implementation setup The dataset comprises 750 multi-parametric magnetic resonance images (mp-MRI) collected from patients diagnosed with either glioblastoma or lower-grade glioma [32]. We select T2 Fluid-Attenuated Inversion Recovery (FLAIR) 3D scan from a random patient with the size of 155 x 240 x 240. The implementation of our architecture starts with a high-resolution 3D volumetric input, such as a medical scan, denoted as x. Initially, the input volume undergoes normalisation, scaling the voxel values to a range between 0 and 1. To streamline computations, the volume is segmented into smaller patches, each measuring 64 × 64 × 64. Patches with 70% or more non-zero voxels containing more information are classified as High-Resolution (HR) patches. From these, one HR patch is selected as the high-resolution input for further processing. Once the data are prepared, the 3D Downsampling module applies a Fourier Transform to convert the high-resolution volume from the spatial domain to the frequency domain. A low-pass filter is then used to eliminate high-frequency components, thereby reducing resolution. This removal process is crucial in medical imaging, as it decreases the data size while preserving essential information, ultimately easing the model processing load. The Inverse Fourier Transform reverts the data to the spatial domain, yielding a low-resolution version of the original volume. Next, the downsampled volume is processed through the 3D Implicit Neural Representation (INR) module. Here, a Multi-Layer Perceptron (MLP) utilising Sinusoidal Activation Functions (SIREN) maps input coordinates to output voxel intensities, enabling the neural network to represent complex structures as continuous functions. These functions are then converted into voxel intensities. Following this, the 3D Super-Resolution (SR) module employs a 3D Convolutional Neural Network (CNN) for feature extraction, incorporating an Attention Mechanism to prioritise significant features. This SR module improves the resolution of the volume, restoring it to a level close to the original. The reconstructed volume, denoted as y, is compared to the original x using an L1 loss function to assess and optimise reconstruction quality. The entire system is trained using the Adam optimiser with a learning rate of 0.0015 for 5,000 epochs on an NVIDIA A4000 16GB GPU with CUDA support in the PyTorch framework. All source codes and results are available at https://github.com/asheibanifard/EndtoEndCompression. 4.2 Trade-off architecture 4.2.1 3D downsampling module. The Downsampling module does not require training. This implies that the downsampling scale is per set without consideration of the final result quality. We select three downsampling scales of 1/2, 1/4, and 1/8 in our experiments. It is necessary to test the performance of the proposed architecture at three downsampling scales, particularly the INR module. Table 2 presents a comprehensive comparison of reconstruction results for different downsampling scales, illustrating the effectiveness of our proposed architecture in maintaining a high reconstruction quality across various compression levels. It can be noted that decreasing the downsampling scales does not significantly degenerate the quality of the reconstruction. Additionally, non-standard sampling scales like 1/3, 1/5, or 1/7 would introduce unnecessary complexity and inconsistencies without offering meaningful improvements, making them less suitable for the architecture’s goals. Thus, these three downsampling scales are acceptable. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Performance of the INR module and the whole end-to-end architecture. (The upper row shows the performance of a single SIREN and the lower row shows that of the whole end-to-end architecture). https://doi.org/10.1371/journal.pone.0314944.t002 4.2.2 3D INR module. We opt for the SIREN model [9] as our INR module, focusing on two primary aspects of the SIREN structure: the number of layers and the number of neurons per layer. The goal is to use a compact SIREN model to enhance the compression rate (CR). We experiment with various configurations of the SIREN model, altering the layer count and neuron count per layer, as detailed in Table 3. We find that a SIREN network with 3 layers and between 30 and 230 neurons per layer offers satisfactory performance, especially for small volume data inputs, while substantially cutting down on GPU memory usage. Furthermore, we compare the performance of a single SIREN model against our proposed architecture, as shown in Table 2. The notable benefit is a dramatic reduction in GPU memory consumption while maintaining comparable reconstruction quality. Additionally, using more than 230 neurons per layer increases the model’s capacity to represent detailed structures but leads to diminishing returns in terms of reconstruction quality. Beyond 230 neurons, the gains in PSNR and SSIM are marginal, while the computational cost and GPU memory usage increase significantly. This increased complexity does not translate into substantial improvements in performance, making the additional computational overhead unjustified. Thus, we prefer the SIREN model with 3 layers in the INR module. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Average values for different INR layers and neurons. https://doi.org/10.1371/journal.pone.0314944.t003 4.2.3 3D super-resolution module. We utilise the SRNO [11] for the SR module due to its compact size, as evidenced by the average number of parameters of deep networks in Table 2. We also compare our end-to-end architecture with cutting-edge methods [32–37]. Table 8 reveals that (1) the SR module performs effectively, as our architecture, using a 3-layer SIREN, matches the reconstruction quality of a standalone 5-layer SIREN; and (2) our architecture surpasses other state-of-the-art image compression methods in terms of PSNR and SSIM. 4.2.4 Find a trade-off architecture by trade-off point approach. To find the trade-off point for our proposed architecture, firstly, our proposed architecture is tested in terms of all combinations of NC, DS and SN, which is presented separately in Table 4 with 4 channels of feature extraction in the SRNO model, Table 5 with 8 channels of feature extraction in the SRNO model, and Table 6 with 16 channels of feature extraction in the SRNO model. The trade-off point of the proposed architecture is then calculated using Eq 8, that is, the trade-off point (NC = 4, DS = 1/2, SN = 30). At the trade-off point, the PSNR upper bound is around 38, the SSIM upper bound is around 0.94, and the CR upper bound is around 76.6%, as shown in Table 7. This is a good setting for the proposed architecture, as it reaches a high compression rate and good quality for reconstruction. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. The results of our proposed architecture with 4 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t004 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. The results of our proposed network with 8 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t005 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. The results of our proposed network with 16 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t006 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Our proposed architecture’s trade-off point. https://doi.org/10.1371/journal.pone.0314944.t007 Moreover, it is further illustrated by Eq 9. We show the three sets of marginal distributions concerning dimensions (NC, DS, SN), in Figs 3–5, respectively. If CR is decreased, the SIREN size (SN) or channel number (NC) can be increased. However, the reconstruction quality (i.e. PSNR or SSIM) shows a slight improvement. Thus, enlarging the model size or channel number will not significantly improve reconstruction quality. Additionally, compared to other existing approaches in Table 8, our architecture excels in maintaining a low Bitrate(bpp), ensuring that the compressed file size is significantly smaller. Our results (PSNR and SSIM) are still comparable with those of the “3D-VOI-OMLSVD [34]”. Fig 6 further shows the reconstructed slices of volume data. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Illustrates the trade-off point for the number of channels (NC) in the SR module concerning the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines indicate the intersection where the optimal trade-off is achieved, balancing compression efficiency and reconstruction quality. https://doi.org/10.1371/journal.pone.0314944.g003 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. The trade-off point for the downsampling scale (DS) is based on the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines highlight where the downsampling scale achieves an optimal balance between compression rate and reconstruction accuracy. https://doi.org/10.1371/journal.pone.0314944.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. The trade-off point for the number of neurons (SN) in the SIREN model, plotted against the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines indicate the optimal configuration of neurons in the SIREN model for achieving high reconstruction quality with minimal compression loss. https://doi.org/10.1371/journal.pone.0314944.g005 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. The Left column shows the different original slices of the volume with sizes of (155, 240, 240); the middle column shows the labelled patches of the slices with sizes of (64, 64, 64); the right column shows the reconstructed patches by our architecture. https://doi.org/10.1371/journal.pone.0314944.g006 Additionally, Fig 7 shows a steady optimisation process over 5000 epochs, with continuous improvements in reconstruction accuracy and structural similarity. The PSNR curve exceeds 40 dB, indicating high reconstruction quality with minimal error. The SSIM curve approaches 0.96, demonstrating the model’s effectiveness in preserving perceptual and structural fidelity. The steady decrease in the loss function, alongside the PSNR and SSIM improvements, confirms effective convergence. These results, consistent with the final performance metrics in Table 8, highlight the architecture’s ability to balance compression efficiency and high-quality reconstruction, making it ideal for medical imaging. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Training procedure of the architecture according to the trade-off point setting. https://doi.org/10.1371/journal.pone.0314944.g007 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Comparison of our techniques with other state-of-the-art methods in terms of PSNR and SSIM in volume reconstruction. https://doi.org/10.1371/journal.pone.0314944.t008 Remark: The proposed trade-off point approach serves as a pragmatic optimisation strategy. In the context of the compression problem, it is essential to balance various requirements, including downsampling scales, INR module size, SR module structure, etc., rather than overemphasising one or two factors. The trade-off point approach addresses this challenge by elegantly optimising the parameters involved. 4.2.1 3D downsampling module. The Downsampling module does not require training. This implies that the downsampling scale is per set without consideration of the final result quality. We select three downsampling scales of 1/2, 1/4, and 1/8 in our experiments. It is necessary to test the performance of the proposed architecture at three downsampling scales, particularly the INR module. Table 2 presents a comprehensive comparison of reconstruction results for different downsampling scales, illustrating the effectiveness of our proposed architecture in maintaining a high reconstruction quality across various compression levels. It can be noted that decreasing the downsampling scales does not significantly degenerate the quality of the reconstruction. Additionally, non-standard sampling scales like 1/3, 1/5, or 1/7 would introduce unnecessary complexity and inconsistencies without offering meaningful improvements, making them less suitable for the architecture’s goals. Thus, these three downsampling scales are acceptable. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Performance of the INR module and the whole end-to-end architecture. (The upper row shows the performance of a single SIREN and the lower row shows that of the whole end-to-end architecture). https://doi.org/10.1371/journal.pone.0314944.t002 4.2.2 3D INR module. We opt for the SIREN model [9] as our INR module, focusing on two primary aspects of the SIREN structure: the number of layers and the number of neurons per layer. The goal is to use a compact SIREN model to enhance the compression rate (CR). We experiment with various configurations of the SIREN model, altering the layer count and neuron count per layer, as detailed in Table 3. We find that a SIREN network with 3 layers and between 30 and 230 neurons per layer offers satisfactory performance, especially for small volume data inputs, while substantially cutting down on GPU memory usage. Furthermore, we compare the performance of a single SIREN model against our proposed architecture, as shown in Table 2. The notable benefit is a dramatic reduction in GPU memory consumption while maintaining comparable reconstruction quality. Additionally, using more than 230 neurons per layer increases the model’s capacity to represent detailed structures but leads to diminishing returns in terms of reconstruction quality. Beyond 230 neurons, the gains in PSNR and SSIM are marginal, while the computational cost and GPU memory usage increase significantly. This increased complexity does not translate into substantial improvements in performance, making the additional computational overhead unjustified. Thus, we prefer the SIREN model with 3 layers in the INR module. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Average values for different INR layers and neurons. https://doi.org/10.1371/journal.pone.0314944.t003 4.2.3 3D super-resolution module. We utilise the SRNO [11] for the SR module due to its compact size, as evidenced by the average number of parameters of deep networks in Table 2. We also compare our end-to-end architecture with cutting-edge methods [32–37]. Table 8 reveals that (1) the SR module performs effectively, as our architecture, using a 3-layer SIREN, matches the reconstruction quality of a standalone 5-layer SIREN; and (2) our architecture surpasses other state-of-the-art image compression methods in terms of PSNR and SSIM. 4.2.4 Find a trade-off architecture by trade-off point approach. To find the trade-off point for our proposed architecture, firstly, our proposed architecture is tested in terms of all combinations of NC, DS and SN, which is presented separately in Table 4 with 4 channels of feature extraction in the SRNO model, Table 5 with 8 channels of feature extraction in the SRNO model, and Table 6 with 16 channels of feature extraction in the SRNO model. The trade-off point of the proposed architecture is then calculated using Eq 8, that is, the trade-off point (NC = 4, DS = 1/2, SN = 30). At the trade-off point, the PSNR upper bound is around 38, the SSIM upper bound is around 0.94, and the CR upper bound is around 76.6%, as shown in Table 7. This is a good setting for the proposed architecture, as it reaches a high compression rate and good quality for reconstruction. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. The results of our proposed architecture with 4 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t004 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 5. The results of our proposed network with 8 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t005 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 6. The results of our proposed network with 16 channels of shallow feature extractor in SR module. https://doi.org/10.1371/journal.pone.0314944.t006 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 7. Our proposed architecture’s trade-off point. https://doi.org/10.1371/journal.pone.0314944.t007 Moreover, it is further illustrated by Eq 9. We show the three sets of marginal distributions concerning dimensions (NC, DS, SN), in Figs 3–5, respectively. If CR is decreased, the SIREN size (SN) or channel number (NC) can be increased. However, the reconstruction quality (i.e. PSNR or SSIM) shows a slight improvement. Thus, enlarging the model size or channel number will not significantly improve reconstruction quality. Additionally, compared to other existing approaches in Table 8, our architecture excels in maintaining a low Bitrate(bpp), ensuring that the compressed file size is significantly smaller. Our results (PSNR and SSIM) are still comparable with those of the “3D-VOI-OMLSVD [34]”. Fig 6 further shows the reconstructed slices of volume data. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Illustrates the trade-off point for the number of channels (NC) in the SR module concerning the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines indicate the intersection where the optimal trade-off is achieved, balancing compression efficiency and reconstruction quality. https://doi.org/10.1371/journal.pone.0314944.g003 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. The trade-off point for the downsampling scale (DS) is based on the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines highlight where the downsampling scale achieves an optimal balance between compression rate and reconstruction accuracy. https://doi.org/10.1371/journal.pone.0314944.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. The trade-off point for the number of neurons (SN) in the SIREN model, plotted against the performance metrics, 1/PSNR, 1-SSIM, and 1-CR. The red dashed lines indicate the optimal configuration of neurons in the SIREN model for achieving high reconstruction quality with minimal compression loss. https://doi.org/10.1371/journal.pone.0314944.g005 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. The Left column shows the different original slices of the volume with sizes of (155, 240, 240); the middle column shows the labelled patches of the slices with sizes of (64, 64, 64); the right column shows the reconstructed patches by our architecture. https://doi.org/10.1371/journal.pone.0314944.g006 Additionally, Fig 7 shows a steady optimisation process over 5000 epochs, with continuous improvements in reconstruction accuracy and structural similarity. The PSNR curve exceeds 40 dB, indicating high reconstruction quality with minimal error. The SSIM curve approaches 0.96, demonstrating the model’s effectiveness in preserving perceptual and structural fidelity. The steady decrease in the loss function, alongside the PSNR and SSIM improvements, confirms effective convergence. These results, consistent with the final performance metrics in Table 8, highlight the architecture’s ability to balance compression efficiency and high-quality reconstruction, making it ideal for medical imaging. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Training procedure of the architecture according to the trade-off point setting. https://doi.org/10.1371/journal.pone.0314944.g007 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 8. Comparison of our techniques with other state-of-the-art methods in terms of PSNR and SSIM in volume reconstruction. https://doi.org/10.1371/journal.pone.0314944.t008 Remark: The proposed trade-off point approach serves as a pragmatic optimisation strategy. In the context of the compression problem, it is essential to balance various requirements, including downsampling scales, INR module size, SR module structure, etc., rather than overemphasising one or two factors. The trade-off point approach addresses this challenge by elegantly optimising the parameters involved. 5 Conclusion and future work In this paper, we proposed an innovative architecture that integrates available deep-learning techniques with a focus on compressing volume data while maintaining high reconstruction fidelity. One notable aspect of our approach is the utilisation of emerging deep learning technologies, which have witnessed rapid development in recent years. We emphasised the importance of carefully considering various factors such as network architecture, computational efficiency, and reconstruction accuracy when designing and implementing the end-to-end solution. To this end, we proposed the end-to-end network architecture for volume data compression and developed the trade-off approach to determine optimal settings for individual modules, which is a practical method to balance performance considerations in the context of medical visualisation tasks. 5.1 Limitations 5.1.1 Generalisation to diverse medical datasets. Applying the proposed end-to-end architecture to various volume datasets requires significant retraining time for each dataset individually, as there is no fine-tuning strategy in place to speed up this process. 5.1.2 Time complexity of trade-off point approach. The trade-off point method necessitates sampling the model’s performance across different architecture settings, which is highly time-consuming. 5.2 Future work Beyond the realm of compression, visualising over-large medical volume data through real-time rendering is meaningful. Compression with rendering could enable real-time visualisation of such over-large volume data. In future work, we intend to focus on volume-rendering techniques that leverage implicit neural representations. This research direction shows significant promise for advancements in the field of visualisation. 5.1 Limitations 5.1.1 Generalisation to diverse medical datasets. Applying the proposed end-to-end architecture to various volume datasets requires significant retraining time for each dataset individually, as there is no fine-tuning strategy in place to speed up this process. 5.1.2 Time complexity of trade-off point approach. The trade-off point method necessitates sampling the model’s performance across different architecture settings, which is highly time-consuming. 5.1.1 Generalisation to diverse medical datasets. Applying the proposed end-to-end architecture to various volume datasets requires significant retraining time for each dataset individually, as there is no fine-tuning strategy in place to speed up this process. 5.1.2 Time complexity of trade-off point approach. The trade-off point method necessitates sampling the model’s performance across different architecture settings, which is highly time-consuming. 5.2 Future work Beyond the realm of compression, visualising over-large medical volume data through real-time rendering is meaningful. Compression with rendering could enable real-time visualisation of such over-large volume data. In future work, we intend to focus on volume-rendering techniques that leverage implicit neural representations. This research direction shows significant promise for advancements in the field of visualisation.
TI - An end-to-end implicit neural representation architecture for medical volume data
JF - PLoS ONE
DO - 10.1371/journal.pone.0314944
DA - 2025-01-03
UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/an-end-to-end-implicit-neural-representation-architecture-for-medical-Jp0REESLA3
SP - e0314944
VL - 20
IS - 1
DP - DeepDyve
ER -