Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning

Chang-Min Lee; Byung-Gun Jung

doi:10.3390/jmse12091603

Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning

Lee, Chang-Min;Jung, Byung-Gun 2024-09-10 00:00:00 Journal of Marine Science and Engineering Article Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning Chang-Min Lee and Byung-Gun Jung * Division of Marine System Engineering, Korea Maritime and Ocean University, 727, Taejong-ro, Yeongdo-gu, Busan 49112, Republic of Korea; [email protected] * Correspondence: [email protected] Abstract: The control system of oil-fired boiler units on ships plays a crucial role in reducing the emissions of atmospheric pollutants such as nitrogen oxides (NO ), sulfur dioxides (SO ), and carbon dioxide (CO ). Traditional control methods using conventional measurement sensors face limitations in real-time control due to response delays, which has led to the growing interest in combustion control methods using flame images. To ensure the precision of such combustion control systems, the system model must be thoroughly considered during controller design. However, finding the optimal tuning point is challenging due to the changes in the system model and nonlinearity caused by environmental variations. This study proposes a controller that integrates an internal model control (IMC)-based PID controller with the deep deterministic policy gradient (DDPG) algorithm of deep reinforcement learning to enhance the adaptability of image-based combustion control systems to environmental changes. The proposed controller adjusts the PID parameter values in real-time through the learning of the determination constant lambda (λ) of the IMC internal model. This approach reduces computational resources by shrinking the learning dimensions of the DDPG agent and limits transient responses through constrained learning of control parameters. Experimental results show that the proposed controller exhibited rapid adaptive performance in the learning process for the target oxygen concentration, achieving a reward value of − 0.05 within just 105 episodes. Furthermore, when compared to traditional PID tuning methods, the proposed controller demonstrated superior performance, achieving a target value error of 0.0032 and a low Citation: Lee, C.-M.; Jung, B.-G. overshoot range of 0.0498 to 0.0631, providing the fastest response speed and minimal oscillation. Adaptive Control of Ships’ Oil-Fired Additionally, experiments conducted on an actual operating ship verified the practical feasibility Boilers Using Flame Image-Based of this system, highlighting its potential for real-time control and pollutant reduction in marine IMC-PID and Deep Reinforcement applications. Learning. J. Mar. Sci. Eng. 2024, 12, 1603. https://doi.org/10.3390/ Keywords: combustion control; emission prediction; IMC-based PID; real-time control; image-based jmse12091603 control; deep deterministic policy gradient algorithm Academic Editor: Pasqualino Corigliano Received: 28 July 2024 1. Introduction Revised: 23 August 2024 Accepted: 5 September 2024 Combustion boilers are widely used in the maritime industry for preheating, hot Published: 10 September 2024 water, and steam supply and have shown continuous growth in the context of atmospheric pollutant emission restrictions [1]. During the combustion process, these boilers produce exhaust gases that contain atmospheric pollutants such as NO , SO and CO . These x x 2 pollutants contribute to greenhouse gas effects and accelerate global warming, underscoring Copyright: © 2024 by the authors. the necessity to reduce their emissions during the combustion process [2,3]. Licensee MDPI, Basel, Switzerland. To reduce atmospheric pollutants, it is necessary to appropriately regulate the air This article is an open access article and fuel supplied to the combustion process. Accordingly, ongoing research focuses on distributed under the terms and directly controlling the flow rates of fuel and air supplied to combustion systems to mitigate conditions of the Creative Commons atmospheric pollutants [4,5]. Attribution (CC BY) license (https:// However, a significant challenge with these combustion control systems, which utilize creativecommons.org/licenses/by/ direct measurement devices, is the inherent delay in the response of oxygen concentration 4.0/). J. Mar. Sci. Eng. 2024, 12, 1603. https://doi.org/10.3390/jmse12091603 https://www.mdpi.com/journal/jmse J. Mar. Sci. Eng. 2024, 12, 1603 2 of 22 changes in the exhaust gases to control outputs. Additionally, disturbances such as varia- tions in intake air temperature, fuel properties, and combustion efficiency can impact the emission of atmospheric pollutants from the combustion system in real-time [6,7]. This issue can be addressed by utilizing flame images generated during the combustion process. Since flame images reflect the combustion state, they can reduce the delay in assessing the current state of exhaust gases. By analyzing the radiative emissions and color space of the flame, it is possible to monitor the production of atmospheric pollutants in real-time [8,9]. Previous studies developed a system for real-time monitoring of air pollutants and oxygen concentration by analyzing two-dimensional HSV images collected using acces- sible webcams, which identified spectral characteristic differences across various fuel-air ratios [10]. In subsequent research, this monitoring system was utilized as a control input to propose an oxygen concentration control system that could be easily applied to marine boilers. The proposed system models the correlation between oxygen and combustion based on operational data and uses an IMC-PI closed-loop control structure, effectively controlling exhaust gas emissions and reducing the production of air pollutants [11]. However, systems with complex combustion mechanisms, such as boilers, exhibit variable internal models due to numerous factors. Therefore, it is crucial to employ control algorithms that can adapt to a wide range of environmental changes. In the field of control engineering, extensive research has been conducted on various adaptive control meth- ods [12–14]. Notably, adaptive tuning of PID parameters, which accounts for 90% of control processes in industrial applications, has been widely studied to optimize performance under varying conditions [15,16]. Recently, neural network-based supervised learning techniques for tuning PID pa- rameters have gained attention due to their ability to map high-dimensional relationships between inputs and outputs. These techniques have demonstrated superior performance compared to other intelligent methods in the context of adaptive tuning [17–19]. However, such supervised learning approaches require extensive data sources to cover a wide range of environmental changes. Multivariable systems, like oil-fired boilers, can demand sig- nificant time and human resources, making them challenging to implement in real-world engineering applications. Unlike supervised learning methods, deep reinforcement learning (DRL), an unsu- pervised learning approach, does not require labeled data, thus overcoming some of these challenges [20,21]. Consequently, DRL methods have been widely applied in the field of PID parameter tuning. In Lee’s study, an adaptive PID controller was developed to adjust PID gains in real-time while adapting to environmental changes in a dynamic positioning system (DPS) [22]. Carlucho’s research addressed the issue of simultaneously outputting multiple parameters from a PID controller based on reinforcement learning (RL) [23]. Ad- ditionally, Siraskar proposed an adaptive PID tuning method that features auto-tuning capabilities and high-frequency noise suppression [24]. Nevertheless, these studies have shown that excessive PID parameter outputs and integral windup can occur during the exploration process of DRL learning, potentially destabilizing the system. To address this issue, Lawrence’s study aimed to improve stability by representing the PID controller with a shallow neural network in the actor network [25]. Furthermore, Lakhani proposed an RL-based stability-preserving PID adaptive tuning framework to ensure controller stability [26]. In Ding’s research, the actions of the agent were constrained during the multi-stage focusing process, enabling stable PID tuning even with limited prior knowledge [27]. However, these methods required 1500, 3000, and 4000 episodes, respectively, for the system to stabilize. Attempts have been made to apply these DRL-based PID frameworks, which require such long episodes, to an oxygen concentration control system based on flame images. However, excessive response of control parameters during the exploration J. Mar. Sci. Eng. 2024, 12, 1603 3 of 22 process led to issues such as flame extinction, causing system shutdown, or accelerated contamination of the heat transfer surfaces in the boiler system due to unstable combustion. Therefore, the objective of this study is to develop an RL-based PID adaptive tuning framework that ensures improved tuning performance while minimizing the impact of episodes during the exploration process on the system. To achieve this, the concept of the internal model in IMC-based PID control is utilized [28]. The internal model of the system is leveraged to constrain excessive control parameter outputs. When gradual variations occur in the system due to changes in the combustion environment, such as variations in fuel and air quality or fouling of heat transfer surfaces, the proposed controller adjusts the IMC tuning constant, lambda (λ), based on the experimentally obtained internal model to ensure that the system adapts to these altered conditions. This approach ensures that each control parameter is connected by the internal model and changes within a limited range, thereby restricting excessive system responses. Moreover, reducing the control parameters to be tuned from three to one (lambda) simplifies the system’s dimensionality and decreases the number of learning episodes required. The innovative contributions of this paper are as follows: 1. Real-Time Image-Based Combustion Control: This study replaces the traditional oxygen concentration measurement methods by utilizing a predictor based on flame images that reflect the combustion state in real-time. This significantly reduces the delay in exhaust gas control and enables real-time control. 2. Proposed Adaptive Controller for Boiler Combustion Control: The study proposes a control system that integrates an internal model control (IMC)-based PID controller with the deep reinforcement learning algorithm and deep deterministic policy gradi- ent (DDPG) algorithm, designed to effectively adapt to changes in the combustion environment. 3. Validation of Practical Applicability: The proposed control system has been vali- dated for its practicality through experiments conducted in actual ship operation environments, demonstrating its potential for easy integration into existing systems. 2. Image-Based Boiler Combustion Control System The existing ship oil combustion boiler control system, S1, is a proportional combus- tion control system that simultaneously controls the airflow and fuel amount to maintain a constant steam pressure. This system focuses on combustion stability and follows the ratio set by the manufacturer during the commissioning process. This ratio helps maintain flame stability within a limited range of variations in the environmental conditions of the supplied air and fuel characteristics. However, during this process, changes in the combustion process may lead to varying levels of air pollutant emissions [29]. This study examines the control performance of the S2 system, which is additionally implemented on the original S1 system. The schematic diagrams of both S1 and S2 systems are shown in Figure 1. The S2 system is an image-based combustion control (ICC) system that uses flame images as real-time input to predict oxygen concentration through an SEF + SVM predictor. The predicted oxygen concentration is used by the controller to adjust the damper servo motor, compensating for deviations from the target value. This controlled damper changes the amount of air supplied to the combustor, thereby controlling the air pollutants generated during the combustion process. The saturation extraction filter (SEF) is a method for preprocessing flame images to process images linearly related to various combustion states. This process involves converting the RGB flame image into HSV format and then extracting the saturation component. The extracted saturation data are transformed into a histogram, removing noise and unnecessary redundant data from the original image. This process generates a feature set that more effectively represents the combustion state. J. Mar. Sci. Eng. 2024, 12, 1603 J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 4 of 4 of 23 22 Figure 1. Overview of the boiler control system with the image-based combustion control system. Figure 1. Overview of the boiler control system with the image-based combustion control system. The 𝑆2 system is an image-based combustion control (ICC) system that uses ﬂame The support vector machine (SVM), a supervised learning-based classification model, images as real-time input to predict oxygen concentration through an SEF + SVM predic- is trained to predict the oxygen concentration in exhaust gases using features extracted from tor. The predicted oxygen concentration is used by the controller to adjust the damper the SEF. This approach builds a robust predictive model capable of handling non-linearity servo motor, compensating for deviations from the target value. This controlled damper and complexity. The training process leverages flame image data collected under various changes the amount of air supplied to the combustor, thereby controlling the air pollu- combustion conditions to enhance the model’s accuracy and reliability. tants generated during the combustion process. By integrating SEF and SVM methods, the system can predict oxygen concentration The saturation extraction ﬁlter (SEF) is a method for preprocessing ﬂame images to in real-time from flame images, and this predicted value is used as input for the control system. process im The ag oxygen es linearly concentration related to v pra edictor rious co used mbust in S io 2nis st aat model es. Thtrained is process invo on data lv collected es con- fr ve om rting the a gas analyzer RGB ﬂame that im of age fersinto H higher SV format accuracy and th than the en extract traditional ing lambda the satupr rat obe ion compo- method. This model uses flame images of quasi-instantaneous combustion states as input, ensuring nent. The extracted saturation data are transformed into a histogram, removing noise and high unnecessar accuracy y redund while ant dat reducing a from the or latency. Accor iginal im ding age to the . This process study, this method generatedemonstrated s a feature set its effectiveness and practicality, achieving an R value of 0.97 in oxygen concentration that more eﬀectively represents the combustion state. prediction through experiments. In addition, given that the input process for flame images The support vector machine (SVM), a supervised learning-based classiﬁcation model, may vary over time, it is important to retrain the model periodically to ensure it maintains is trained to predict the oxygen concentration in exhaust gases using features extracted a high level of accuracy. from the SEF. This approach builds a robust predictive model capable of handling non- Therefore, the S2 system can establish a real-time control system for regulating the linearity and complexity. The training process leverages ﬂame image data collected under oxygen concentration in oil-fired boiler (OFB) exhaust gases. By controlling the predicted various combustion conditions to enhance the model’s accuracy and reliability. oxygen concentration, the system can effectively manage the air pollutants generated dur- By integrating SEF and SVM methods, the system can predict oxygen concentration ing the combustion process. Additionally, the S2 system enhances the existing proportional in real-time from ﬂame images, and this predicted value is used as input for the control control system, S1, by adding a function to adjust the limited air supply for air pollutant system. The oxygen concentration predictor used in 𝑆2 is a model trained on data col- control. This allows for efficient exhaust gas management while maintaining combustion lected from a gas analyzer that oﬀers higher accuracy than the traditional lambda probe stability. The S2 system has the advantage of high accessibility, as it can be easily applied method. This model uses ﬂame images of quasi-instantaneous combustion states as input, not only to newly constructed ship OFBs at a low cost but also to existing ships in operation ensuring high accuracy while reducing latency. According to the study, this method by utilizing existing flame observation ports and additional adjustments to existing air demonstrated its eﬀectiveness and practicality, achieving an R value of 0.97 in oxygen control dampers. concentration prediction through experiments. In addition, given that the input process The target oxygen concentration for the S2 control system is set at 4%. According for ﬂame images may vary over time, it is important to retrain the model periodically to to related studies, pollutants such as CO , NO , and SO are inversely proportional to 2 2 ensure it maintains a high level of accuracy. oxygen concentration. Research by J. Chen et al. identified a correlation between NO Therefore, the 𝑆2 system can establish a real-time control system for regulating the emissions and oxygen concentration in the range of 2% to 5% during flame image prediction. oxygen concentration in oil-ﬁred boiler (OFB) exhaust gases. By controlling the predicted Specifically, they found that at an oxygen concentration of 4.02%, the formation of soot oxygen concentration, the system can eﬀectively manage the air pollutants generated J. Mar. Sci. Eng. 2024, 12, 1603 5 of 22 and graphite is minimized, resulting in the least noise during flame image recognition. Additionally, G. Xiao et al. discovered that at 80% load, an oxygen concentration of 3.5% achieves a balance between heat release and NO formation. Their study also found that to reduce NO emissions by a factor of two, the oxygen concentration needs to be increased by 1.14 times. Particularly, they confirmed that near 3.5% oxygen concentration at 80% load, an optimal balance between heat release and NO emissions is achieved. Based on a comprehensive review of these findings, setting the target oxygen concen- tration at 4% is considered suitable for optimizing boiler combustion through flame image analysis [30,31]. 2.1. Experimental Setup for Image-Based Combustion Control (ICC) System The experiment is conducted on a 9200 t ship, and the details of the boiler and combustor of the test subject OFB are shown in Table 1. Table 1. Specifications of boiler and burner for OFB. Steam Working Boiler Drum Type Production Steam Pressure Boiler Cylindrical Water 3000 kg/h 5.5~7 kg/cm Tube Fuel Oil Fuel Type Air Supply Volume Consumption Burner LSMGO, 0.1% sulfur Min/Max: Min/Max: (DMA) 68.5/205.5 kg/h 1650~3700 m /h J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 6 of 23 To implement the ICC system on the aforementioned OFB, the experimental environ- ment is configured as shown in Figure 2. Figure 2. Equipment conﬁguration for boiler ICC system experiment. Figure 2. Equipment configuration for boiler ICC system experiment. The burner of the cylindrical water-tube boiler is located at the bott om of the cylinder. The burner of the cylindrical water-tube boiler is located at the bottom of the cylinder. The burner initiates the combustion reaction between fuel and air, producing ﬂames and The burner initiates the combustion reaction between fuel and air, producing flames and ex- exhaust gases, which contain information about the oxygen concentration within the ex- haust gases, which contain information about the oxygen concentration within the exhaust. haust. The ﬂame generated by the burner is captured in high deﬁnition by a 1920 × 1080 The flame generated by the burner is captured in high definition by a 1920 × 1080 pixel pixel CMOS webcam in real-time. The camera is placed in the ﬂame observation port on CMOS webcam in real-time. The camera is placed in the flame observation port on the the side of the boiler according to SOLAS regulations. The collected images are transmit- side of the boiler according to SOLAS regulations. The collected images are transmitted ted to a computer via a USB 3.0 interface. These ﬂame images are used by the computer to a computer via a USB 3.0 interface. These flame images are used by the computer to to extract information about the exhaust gases. The computer analyzes the transmitt ed extract information about the exhaust gases. The computer analyzes the transmitted flame ﬂame images using an SEF + SVM predictor to estimate the oxygen concentration. The predicted oxygen concentration serves as a crucial input variable for combustion control. The oxygen concentration input is converted into a control signal for the air regulation damper by the controller. The output analog control signal is converted through an A/D converter drive to an analog control output ranging from 0 to 90 degrees, and a servo motor att ached to the damper end provides real-time control. 2.2. Data-Driven System Modeling The ICC system, which receives ﬂame images as input and outputs oxygen concen- tration, is diﬃcult to calculate dynamically due to numerous variables such as air proper- ties, changes in fuel characteristics, and changes in heat transfer eﬃciency due to contam- ination. Therefore, the transfer function is estimated using MATLAB’s System Identiﬁca- tion Toolbox, version R2024a. This method leverages machine learning algorithms to esti- mate the transfer function by learning from input and output data, making it an advanced modeling technique. It is particularly suitable for irregular and nonlinear systems and has the advantage of being applicable to models with many system variables [32]. Based on the system identiﬁcation results from response analysis of input–output data, the esti- mated transfer function model of the ICC system, denoted as 𝒢 in Equation (1), was found. The estimated model showed an accuracy of 99.28% and an MSE of 0.0001833. 0.2187𝑠 + 0.5960 (1) 𝒢 = 𝑠 + 2847.82𝑠 + 1508.78 To further understand the characteristics of the system, it can be represented in pole- zero form as shown in Equation (2). J. Mar. Sci. Eng. 2024, 12, 1603 6 of 22 images using an SEF + SVM predictor to estimate the oxygen concentration. The predicted oxygen concentration serves as a crucial input variable for combustion control. The oxygen concentration input is converted into a control signal for the air regulation damper by the controller. The output analog control signal is converted through an A/D converter drive to an analog control output ranging from 0 to 90 degrees, and a servo motor attached to the damper end provides real-time control. 2.2. Data-Driven System Modeling The ICC system, which receives flame images as input and outputs oxygen concentra- tion, is difficult to calculate dynamically due to numerous variables such as air properties, changes in fuel characteristics, and changes in heat transfer efciency fi due to contamina- tion. Therefore, the transfer function is estimated using MATLAB’s System Identification Toolbox, version R2024a. This method leverages machine learning algorithms to estimate the transfer function by learning from input and output data, making it an advanced modeling technique. It is particularly suitable for irregular and nonlinear systems and has the advantage of being applicable to models with many system variables [32]. Based on the system identification results from response analysis of input–output data, the estimated transfer function model of the ICC system, denoted as G in Equation (1), was found. The S2 estimated model showed an accuracy of 99.28% and an MSE of 0.0001833. 0.2187s + 0.5960 G = (1) S2 s + 2847.82s + 1508.78 To further understand the characteristics of the system, it can be represented in pole- zero form as shown in Equation (2). 0.2187(s + 2.728) G = (2) S2 (s + 0.523)(s + 2847.29) The system is in SOPZ form. Examining the poles and zeros of the transfer function G , the poles are located at S1 ≈ − 0.523 and S2 ≈ − 2847. Since the real parts of both S2 poles are negative, they are located in the left half of the complex plane, indicating that the system is stable and controllable. The zero is also real and negative, confirming that it does not affect the system’s stability. Converting Equation (2) to the system’s time constant form results in Equation (3). 0.21866(2.728s + 1) G = (3) S2 (1.887649s + 1)(0.000351s + 1) − 4 Accordingly, the time constants of this system are found to be τ = 3.51 × 10 , τ = 1.887649. Examining the time constants, τ is much larger than τ , indicating that the b b impact of τ on the system is negligible. Therefore, it suggests that variables other than changes in air supply do not significantly affect the system. 3. Preliminaries 3.1. Internal Model Control-Based PID Control Internal model control (IMC) is a control system design methodology that enhances the performance of the controller by using a process model. The basic idea of IMC is that the control system should include an internal model of the process being controlled. The fundamental structure of an IMC-based PID controller, which consists of a single PID controller with three adjustable parameters, is combined in parallel form, as shown in Equation (4). 1 de(t) u(t) = K e(t) + e(t)dt + τ (4) p d τ dt i J. Mar. Sci. Eng. 2024, 12, 1603 7 of 22 In Equation (4), u t is the manipulated variable at time t. K , τ , and τ are the ( ) p i proportional, integral, and derivative parameters, respectively. The error e(t) is the differ- ence between the control variable y(t) and the setpoint at time t. IMC-based PID control improves control performance by tuning the PID parameters using the internal model of the control system. This method ensures effective control by directly utilizing and compensating for the dynamic characteristics of the system during controller design. The structure of the IMC controller Q(s) for controlling the target system model G(s) is shown in Equation (5). − 1 Q(s) = G(s) · f (s) (5) G(s) is the transfer function of the process being controlled, and f (s) is the IMC filter function. When G(s) is unstable, it is difficult to directly use the inverse model, so an appropriate filter must be applied. Therefore, an IMC controller is designed by applying a suitable filter f (s) to the system function. The IMC filter function f (s) for the system G(s) is given in Equation (6). ηs + 1 ( ) f (s) = (6) (λs + 1) Here, λ and η are the IMC filter parameters, primarily used for ensuring system stability and noise reduction. The orders m and n are determined based on the system’s stability and performance requirements. Generally, m is set equal to the number of poles of the system, while n is set to match the total number of poles of the system. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 8 of 23 3.2. Reinforcement Learning–Deep Deterministic Policy Gradient The deep deterministic policy gradient (DDPG) algorithm is a model-free, policy- based, and off-policy reinforcement learning algorithm designed to solve control problems problems in continuous action spaces. DDPG uses the actor–critic methodology to learn in continuous action spaces. DDPG uses the actor–critic methodology to learn and optimize and optimize policies, and it was developed speciﬁcally to overcome the limitations of policies, and it was developed specifically to overcome the limitations of deep Q-networks deep Q-networks (DQN). DDPG consists of two neural networks: an actor and a critic. (DQN). DDPG consists of two neural networks: an actor and a critic. Figure 3 demonstrates Figure 3 demonstrates the principle of the DDPG algorithm [33,34]. the principle of the DDPG algorithm [33,34]. Figure 3. Architecture of actor–critic reinforcement learning with experience replay in DDPG. Figure 3. Architecture of actor–critic reinforcement learning with experience replay in DDPG. The actor network receives the current state (State, 𝑠 ) as input and outputs continu- ous action (Action, 𝑎 ) values. The critic network 𝜃 evaluates the 𝑄 -values for the given state and action. The weights of the actor network 𝜃 are updated using the deterministic policy gradient algorithm, and the weights of 𝜃 are updated using the gradients derived from the time delay (TD) error signal. ( | ) The DDPG algorithm operates in the following steps. First, the critic 𝑄 𝑠, 𝑎 𝜃 and ( | ) actor 𝜇 𝑠 𝜃 networks are initialized arbitrarily. Along with this, the target networks 𝜃 and 𝜃 for the critic and actor are initialized with the values of 𝜃 and 𝜃 , respec- tively. An appropriate buﬀer value for experience replay is then set to store the data out- put from the environment. At the start of an episode, the initial state 𝑠 is observ ed. 𝜃 receives the current state 𝑠 as input and outputs the action 𝑎 = 𝜇 (𝑠 |𝜃 ) + 𝜀 , which is applied to the en- vironment. 𝜀 is the noise for exploration, and it uses the Ornstein–Uhlenbeck process. This process is smooth and prevents abnormal system responses due to exploration. The environment responds with the next state 𝑠 and reward 𝑟 , and the tuple (𝑠 ,𝑎 ,𝑟 ,𝑠 ) is stored in the experience replay buﬀer. A mini-batch is randomly sampled from the experience replay to update 𝜃 such that the loss function ℒ(𝜃 ) is minimized. The 𝑄 -function is updated as shown in Equa- tion (7). 𝑄 (𝑠 ,𝑎 ) = 𝔼 𝑟 (𝑠 ,𝑎 ) + 𝛾 𝑄 𝑠 ,𝜇 (𝑠 ) (7) 𝜃 evaluates the 𝑄 -values for the given state and action, reducing the diﬀerence be- tween the actual reward and the predicted 𝑄 -value. 𝜃 is updated using the policy gradient as shown in Equation (8). J. Mar. Sci. Eng. 2024, 12, 1603 8 of 22 The actor network receives the current state (State, s ) as input and outputs continuous action (Action, a ) values. The critic network θ evaluates the Q-values for the given state and action. The weights of the actor network θ are updated using the deterministic policy gradient algorithm, and the weights of θ are updated using the gradients derived from the time delay (TD) error signal. The DDPG algorithm operates in the following steps. First, the critic Q s, a θ and µ Q actor µ(s|θ ) networks are initialized arbitrarily. Along with this, the target networks θ µ Q µ and θ for the critic and actor are initialized with the values of θ and θ , respectively. An appropriate buffer value for experience replay is then set to store the data output from the environment. At the start of an episode, the initial state s is observed. θ receives the current state s as input and outputs the action a = µ(s |θ ) + ε , which is applied to the environment. t t t t ε is the noise for exploration, and it uses the Ornstein–Uhlenbeck process. This process is smooth and prevents abnormal system responses due to exploration. The environment responds with the next state s and reward r , and the tuple (s , a , r , s ) is stored in t+1 t t t t t+1 the experience replay buffer. A mini-batch is randomly sampled from the experience replay to update θ such that the loss function L θ is minimized. The Q-function is updated as shown in Equation (7). µ µ Q (s , a ) = E [ r(s , a ) + γ Q (s , µ(s ))] (7) t t r , s t t t+1 t+1 t t+1 θ evaluates the Q-values for the given state and action, reducing the difference between the actual reward and the predicted Q-value. θ is updated using the policy gradient as shown in Equation (8). h i Q µ ∇ µ J ≈ E ∇ Q s, a θ | ∇ µµ(s|θ ) (8) θ a θ (s ) a=µ(s) Equation (8) calculates the gradient for the current policy µ(s|θ ), optimizing the policy network parameters θ . This allows the agent to learn actions that yield higher expected rewards. The policy gradient ∇ J is used to update the policy network parameters in a direction that maximizes the expected reward J. The critic network Q s, a θ evaluates the Q-value for state s and action a, and through the gradient of this Q-value, it assesses the effectiveness of the current µ s θ . Based on this assessment, the policy network is ( | ) updated. Subsequently, the parameters θ of the actor network are updated by reflecting the gradient ∇ Q s, a θ of the critic network. This adjustment enables the policy network to output better actions, thereby allowing the agent to receive higher rewards. Finally, the target networks are updated using the target soft update method. Through this iterative process, the actor and critic networks gradually learn the optimal policy and Q-values. DDPG, with its actor–critic architecture, enables stable and efficient learning. It is a powerful reinforcement learning algorithm specifically designed to solve continuous action control problems. In this paper, to effectively control the ICC system, a DPG-IMC-based PID controller, which integrates the deep reinforcement learning DDPG algorithm with an IMC-based PID controller, is proposed, and its effectiveness is verified. 4. Deep Deterministic Policy Gradient-Based Internal Model Control-PID Control 4.1. IMC-Based PID Controller for Image-Based Combustion Control System To effectively control the image-based combustion system of the ICC system, it is important to use an appropriate controller. One method is to use an IMC-based PID controller. Previous research applied an IMC-based PI controller to the ICC system and obtained a significant result with an ISE value of 10.1159. However, since flame images are used as input signals, including the derivative component of the PID controller can J. Mar. Sci. Eng. 2024, 12, 1603 9 of 22 help predict and respond to rapid changes in the combustion process, thereby improving stability and responsiveness. Therefore, a PID controller is more suitable. In this process, high-frequency noise due to intermittent prediction errors may occur, but it can be mitigated by applying appropriate filtering techniques. The derivative com- ponent enhances the system’s ability to respond to dynamic changes, reducing overshoot and settling time. Furthermore, despite the increased complexity of tuning the IMC con- troller, the advantages of achieving more precise and robust control using a PID controller outweigh these difficulties [35]. First, to design an IMC-based PID controller, the internal model is analyzed. The internal model transfer function estimated from the data in Equation (3) is in SOPZ form and is expressed as shown in Equation (9). k (βs + 1) G s = τ < τ (9) ( ) ( ) ICC a (τ s + 1)(τ s + 1) a b where τ and τ are the time constants of the system, k is the proportional gain, and β is a p the constant associated with zero. Consequently, the IMC controller can be expressed as shown in Equation (10), where f (s) represents the IMC filter for ICC system. − 1 q(s) = G f (s) (10) ICC Since f (s) must be equal to or greater than the order of the numerator to achieve control, the order of the filter function is set to match the order of the internal model G ICC shown in Equation (11). ηs + 1 f (s) = (11) (λs + 1) λ and η are the time constants of the filter, and they need to be adjusted according to the required performance of the controller. They are parameters that regulate control performance and robustness. In this context, η is set to be equal to λ for the design of the PID controller. − 1 By integrating the IMC controllerq(s) with the internal modelG , a classic controller ICC K (s) can be formed. This can be expanded using Equations (9) and (10), and can be ICC expressed in the forms shown in Equations (12a) and (12b). − 1 q(s) G f (s) ICC i K s = = (12a) ( ) ICC − 1 1 − G q(s) ICC 1 − G G f (s) ICC i ICC τ τ 1 1 s 1 K (s) = 1 + + s ICC k λ(τ +τ ) (τ +τ )s τ +τ (βs+1) p s s b s b b (12b) τ τ 1 1 b s = 1 + + s f (s) k λ(τ +τ ) (τ +τ )s τ +τ p s s s b b b The expanded Equation (12b) shows that K (s) takes the form of a PID controller. ICC Here, the term can be considered a low-pass filter, denoted as f (s). The cutoff (βs+1) frequency f of this filter is calculated as . When β = 2.728423 × 10 , the cutoff frequency 2π is approximately 434 Hz. In continuous-time systems, it is important to compare the primary operating fre- quency range of the system with the cutoff frequency of the filter. If the system primarily operates in the low-frequency range, a filter with a cutoff frequency of 434 Hz will have little to no impact on the system’s main operating frequency range. Since the filter ’s cutoff frequency is much higher than the system’s main frequency range, the effect of the filter can be ignored. Therefore, the impact of the low-pass filter on the system’s frequency response (βs+1) is negligible because its cutoff frequency is much higher than the system’s main operating J. Mar. Sci. Eng. 2024, 12, 1603 10 of 22 frequency range. Consequently, the term can be disregarded in the analysis and (βs+1) design of the continuous-time controller K (s). ICC 1 1 τ τ K b s i K (s) = 1 + + s = K 1 + + K s (13) ICC k λ(τ + τ ) (τ + τ )s τ + τ s p s s s b b b Comparing Equation (12b) with Equation (13), the control parameters can be consid- ered as Equation (14). 1 1 τ τ K = , K = , K = (14) p i k λ(τ + τ ) k λ(τ + τ )(τ + τ ) k λ(τ + τ )(τ + τ ) p b s p b s b s p b s s The control elements for the ICC system are summarized in Table 2. Table 2. Internal model and control elements in the ICC System. G (s) f(s) K (s) ICC ICC k (βs+1) ηs+1 p K K 1 + + K s 2 p d (τ s+1)(τ s+1) a b (λs+1) K K K p i d 1 1 τ τ b s k λ(τ +τ ) p b s k λ(τ +τ ) k λ(τ +τ ) p b s p b s − 4 3 − 4 k = 2.1865704 × 10 , β = 2.728423 × 10 , τ = 3.51 × 10 , τ = 1.887649 p a b Therefore, by adjusting the IMC filter constant λ, the values of K , K , and K can be i d set to optimize control performance. 4.2. Proposal of IMC-DPGA (Deep Policy Gradient Adaptive) Controller Previous related studies have conducted empirical learning using deep reinforcement learning algorithms to select the PID parameters K , K , and K for optimal control perfor- i d mance. However, when using deep reinforcement learning to directly learn K , K , and K , p i d excessive parameter fluctuations due to initial exploration and exploration noise can cause overshoot in the control output, negatively affecting the plant. In flame combustion-based systems like the ICC system, changes in air supply during the exploration phase can lead to incomplete combustion of the flame, contaminating the heat exchange surface and altering the system. Additionally, excessive overshot poses the risk of flame extinction, which can prevent further learning stages and trap the system in a non-progressive learning loop. Moreover, if PID parameters are learned sporadically, the range of action variables can widen, potentially leading to the curse of dimensionality. However, in internal model control, the values of K , K , and K are determined by the internal model G , and they p i d ICC vary organically within the range set by the λ [36]. By learning the λ to achieve optimal control performance, the number of action variables can be reduced from three to one, and the range can be limited. This can make the learning of the deep reinforcement learning agent more stable and faster. Additionally, since the control parameters are dynamically connected by the internal model, it is possible to prevent control instability caused by sporadic parameters, thereby ensuring stable control performance even during the learning process. However, to adjust the optimal λ, the control system needs to be tuned at each unit value, which consumes a significant amount of time resources. Additionally, it is practically difficult to verify control performance down to small units (below 0.1), and the system must continuously respond to changes in external environmental conditions. Therefore, to apply the optimal λ to the control system in real time in response to system changes, this paper proposes an IMC-DPGA (deep policy gradient adaptive) controller using the DDPG algorithm. The structure of the proposed IMC-DPGA controller is shown in Figure 4. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 11 of 23 Therefore, by adjusting the IMC ﬁlter constant 𝜆 , the values of 𝐾 , 𝐾 ,and 𝐾 can be set to optimize control performance. 4.2. Proposal of IMC-DPGA (Deep Policy Gradient Adaptive) Controller Previous related studies have conducted empirical learning using deep reinforce- ment learning algorithms to select the PID parameters 𝐾 , 𝐾 , and 𝐾 for optimal control performance. However, when using deep reinforcement learning to directly learn 𝐾 , 𝐾 , and 𝐾 , excessive parameter ﬂuctuations due to initial exploration and exploration noise can cause overshoot in the control output, negatively aﬀecting the plant. In ﬂame combus- tion-based systems like the ICC system, changes in air supply during the exploration phase can lead to incomplete combustion of the ﬂame, contaminating the heat exchange surface and altering the system. Additionally, excessive overshot poses the risk of ﬂame extinction, which can prevent further learning stages and trap the system in a non-pro- gressive learning loop. Moreover, if PID parameters are learned sporadically, the range of action variables can widen, potentially leading to the curse of dimensionality. However, in internal model control, the values of 𝐾 , 𝐾 , and 𝐾 are determined by the internal model 𝒢 , and they vary organically within the range set by the 𝜆 [36]. By learning the 𝜆 to achieve optimal control performance, the number of action variables can be reduced from three to one, and the range can be limited. This can make the learning of the deep reinforcement learning agent more stable and faster. Additionally, since the control parameters are dynamically connected by the internal model, it is possible to prevent control instability caused by sporadic parameters, thereby ensuring stable control performance even during the learn- ing process. However, to adjust the optimal 𝜆 , the control system needs to be tuned at each unit value, which consumes a signiﬁcant amount of time resources. Additionally, it is practi- cally diﬃcult to verify control performance down to small units (below 0.1), and the sys- tem must continuously respond to changes in external environmental conditions. There- fore, to apply the optimal 𝜆 to the control system in real time in response to system changes, this paper proposes an IMC-DPGA (deep policy gradient adaptive) controller J. Mar. Sci. Eng. 2024, 12, 1603 11 of 22 using the DDPG algorithm. The structure of the proposed IMC-DPGA controller is shown in Figure 4. Figure 4. DDPG-based architecture for image-based combustion control system with IMC-PID inte- Figure 4. DDPG-based architecture for image-based combustion control system with IMC-PID gration. integration. The IMC-DPGA control system shown in Figure 4 illustrates the structure for dy- The IMC-DPGA control system shown in Figure 4 illustrates the structure for dy- namically updating the 𝜆 of the IMC-based PID controller using the DDPG algorithm. namically updating the λ of the IMC-based PID controller using the DDPG algorithm. This system aims to achieve optimal performance of the PID controller under changing This system aims to achieve optimal performance of the PID controller under changing environmental conditions. The image-based combustion control system handles a contin- environmental conditions. The image-based combustion control system handles a continu- uous process in real-time, where the DDPG algorithm eﬃciently learns the optimal policy ous process in real-time, where the DDPG algorithm efficiently learns the optimal policy within a continuous action space. This distinguishes it from other reinforcement learning algorithms that focus on discrete action space. The agent receives information from the ICC system, observes the state s , receives a reward r , and repeatedly determines and updates k k the action a , leading to a new state s . These updates allow the agent to adapt to the k k+1 environment and estimate the appropriate value of λ to improve control performance. The value of λ updated by the agent is applied to the control parameters of the IMC-PID at regular intervals of N. The optimal N for learning may vary depending on the control environment, so it should be determined through additional parameter selection experiments. Based on this structure, the IMC-DPGA controller, which combines the DDPG agent and the IMC-PID controller, effectively adapts to the dynamic changes in the process environment and can stably control the flame. By periodically updating the λ through reinforcement learning, the system maintains optimal control performance, ensuring stability and efficiency. This approach provides an adaptive and intelligent control solution, overcoming the limitations of direct parameter adjustment. 4.3. Agent Environment Configuration 4.3.1. State and Action of Agent The state vector of the IMC-DPGA for ICC system consists of the oxygen concentration error, the rate of change of the error, the current oxygen concentration, and the current λ, as shown in Equation (15).     ∆e   s k = , a k = λ (15) ( ) ( ) [ ] k+1   The oxygen concentration error e is defined as the difference between the target oxygen concentration and the current measured oxygen concentration, and it is used to evaluate the need for adjusting λ. The rate of change of the error ∆e represents the rate at which the oxygen concentration error changes over time, reflecting the system’s dynamic response to λ . The current oxygen concentration O directly reflects the current state of k k the system, helping to construct the state vector, and by including λ , it reflects the current adjustment level of the control input. This allows for more precise prediction and control. J. Mar. Sci. Eng. 2024, 12, 1603 12 of 22 By using this state vector, the dynamic characteristics of the system can be understood, and accurate control can be performed through predictive and adaptive control. 4.3.2. Reward To ensure effective control, the reward function of the IMC-DPGA must be designed to have a positive correlation with performance. Specifically, the amplitude of the system output y(t) should be minimized, and the output should quickly converge to the target value. Therefore, the reward function should include both the time steps of the entire closed-loop trajectory and the error e(t). The reward function for controlling the oxygen concentration of the boiler combustion system can be designed to minimize the error defined as error = O − O , and target current to reduce the system’s instability through the change in error ∆error = error − current error . The reward function reflecting this can be described as shown in Equation (16). previous 1 1 N N r(k) = − |error(k)| + |∆error(k)| = − |e(k)| + |∆e(k)| (16) ∑ ∑ k=1 k=1 N N In the above equation, N represents the number of steps per episode. The reward function’s adjustment of N can optimize the overall performance of the OFB system. Lowering the N value has the advantage of quick adaptation and immediate response, but it also increases computational complexity due to frequent updates and may lead to system instability due to excessive parameter fluctuations. Conversely, increasing the N value reduces the computational load and maintains a certain level of stability between updates, but if the update interval is too long, the accuracy may decrease due to overfitting. To review the stability of the reward function’s learning, experiments will be conducted with N set to 1, 50, 100, and 200. Through these experiments, the system’s response and stability for each value of N will be evaluated, and the optimal N will be determined. 5. Training and Experiments 5.1. Experimental Setup The reinforcement learning algorithm parameters set as initial conditions are presented in Table 3. Table 3. Training parameters used for the DDPG Agent. Parameters Actor Critic Network structure [50 25 1] [50 25 25 1] − 4 − 3 Learning rate 10 10 Activation function Tanh ReLU Optimization function Adam Adam Early stopping patience 10 Mini-batch size 64 Discount factor 0.9 Replay buffer size 10 The network structure, mini-batch size, and learning rate are selected based on pre- liminary experiments that showed optimal performance. Table 3 lists the key parameters adopted for training. When determining parameters such as learning rate and batch size for experience replay, DeepMind’s DPG model was referenced [37], and slight adjustments were made based on benchmark values. These parameters were gradually refined through a process of trial and error. Through this process, it was found that the learning outcomes were sensitive to certain parameters, such as learning rate and network structure, but not to others, such as the experience replay buffer size. Ultimately, parameters were selected that did not lead to overfitting and did not place excessive demands on computational resources. J. Mar. Sci. Eng. 2024, 12, 1603 13 of 22 The actor network uses the Tanh activation function to limit the output range to − 1 and 1, providing stability, while the critic network uses the ReLU activation function to introduce nonlinearity and increase learning speed. The Adam optimization algorithm was chosen because it provides fast and stable convergence by automatically adjusting the learning rate. Early stopping patience is set to 10 to prevent overfitting and allow early termination of the training process. The discount factor is set to 0.9 to balance considering future rewards while not neglecting present rewards. The range of the λ updated by the DDPG agent is set to λ ∈ [0.1, 2], with the initial value of λ set to 1. This range is set considering the system’s performance and stability. According to Equation (14), the range of PID parameters determined by the IMC internal model is K ∈ [1.21, 22.22], K ∈ [0.64, 12.83], and K ∈ [0.000425, 0.0085]. i d 5.2. Threshold Analysis J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 14 of 23 In this section, experiments are conducted to select the optimal value of N, the number of steps per episode for the DDPG agent, by varying N. The experimental learning results for different values of N are shown in Figure 5. Figure 5. Training results for IMC-DPGA according to step number per episode, 𝑁 . Figure 5. Training results for IMC-DPGA according to step number per episode, N. Figure 5 shows the learning performance of the DDPG agent when the number of Figure 5 shows the learning performance of the DDPG agent when the number of steps per episode 𝑁 is 1, 50, 100, and 200, respectively. The number of steps for threshold steps per episode N is 1, 50, 100, and 200, respectively. The number of steps for threshold sett ing was determined experimentally, starting from 1 and increasing in multiples until setting was determined experimentally, starting from 1 and increasing in multiples until the number of episodes where overﬁtt ing occurs. The summarized results for the graph the number of episodes where overfitting occurs. The summarized results for the graph are are presented in Table 4. presented in Table 4. Table 4. Training termination episodes and rewards for diﬀerent 𝑁 in IMC-DPGA training Step Count per Episode, 𝑵 Episode Last Reward 1 289 −0.205 50 194 −0.135 100 105 −0.05 200 158 −0.67 As can be seen from the graph and table, when 𝑁= 1 (blue curve), the agent’s re- ward shows signiﬁcant ﬂuctuations during the learning process and the lowest ﬁnal re- ward value −0.205 after the highest number of episodes 289. This can be interpreted as the negative impact on the reward due to the agent not having suﬃcient opportunities to ex- plore because of the low number of steps per episode. This variability indicates instability and ineﬃciency in learning. When 𝑁= 50 (green curve), the ﬂuctuations decrease compared to 𝑁 , and the ﬁnal reward value -0.135 is higher after fewer episodes 194, showing improved learning stabil- ity. This indicates that as the number of steps per episode increases, the agent has more opportunities to interact with the environment, leading to more eﬀective exploration and increased learning eﬃciency. For 𝑁 = 100 (red curve), the agent achieves the highest ﬁnal reward value -0.05 in the fewest number of learning episodes 105. This suggests that exploration is more Rewards J. Mar. Sci. Eng. 2024, 12, 1603 14 of 22 Table 4. Training termination episodes and rewards for different N in IMC-DPGA training. Step Count per Episode, N Episode Last Reward 1 289 − 0.205 50 194 − 0.135 100 105 − 0.05 200 158 − 0.67 As can be seen from the graph and table, when N = 1 (blue curve), the agent’s reward shows significant fluctuations during the learning process and the lowest final reward value − 0.205 after the highest number of episodes 289. This can be interpreted as the negative impact on the reward due to the agent not having sufficient opportunities to explore because of the low number of steps per episode. This variability indicates instability and inefficiency in learning. When N = 50 (green curve), the fluctuations decrease compared to N, and the final reward value − 0.135 is higher after fewer episodes 194, showing improved learning stability. This indicates that as the number of steps per episode increases, the agent has more opportunities to interact with the environment, leading to more effective exploration and increased learning efficiency. For N = 100 (red curve), the agent achieves the highest final reward value − 0.05 in the fewest number of learning episodes 105. This suggests that exploration is more effective for the same reason as in N = 50, indicating optimal learning efficiency and performance. In the case of N = 200 (pink curve), there is an increase in the number of episodes 158 and a decrease in the reward value − 0.67. This indicates that excessive exploration leads the agent to not find the optimal actions and spend unnecessary time, suggesting that the exploration-exploitation balance is disrupted and that this is not an appropriate value. These results demonstrate the importance of appropriately selecting the number of steps per episode to optimize the learning performance of the DDPG agent. As N increases, the learning process tends to become smoother and more stable; however, an N value that is too large can lead to performance degradation. N = 100 is shown to be the most efficient and high-performing number of steps, as it avoids the excessive exploration that prevents the agent from finding optimal actions and leads to unnecessary time consumption. Therefore, selecting the appropriate value of N = 100 can maximize the performance of the DDPG controller. 5.3. Experiment and Result Analysis In this section, experiments are conducted to apply the proposed IMC-DPGA controller to the ICC system for adaptive tuning. The figure shows the learning process of λ, K , K , and K per episode, with the number of steps per episode N set to 100 as in Figure 6. The graph is a 3D representation of the kernel density estimation for the values output at each step of each episode. Kernel density estimation (KDE) is a non-parametric method for estimating the probability density function of data. It smoothly represents the distribution of given data, making it easier to identify patterns, and is often used to understand or visualize the underlying distribution of the data. This allows for a visual understanding of the data learned at each step of each episode. When examining the overall learning trend of the control parameters, the initial episodes show high volatility and a wide density distribution, indicating exploration of various values. As learning progresses, the control parameter values tend to concentrate within a specific range, as indicated by the points with the highest density values. The increase in density value indicates convergence towards a stable optimal value. From Table 5, the final values of λ, K , K , and K can be determined based on the p i d mode value corresponding to the highest density in the graph distribution. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 15 of 23 eﬀective for the same reason as in 𝑁= 50 , indicating optimal learning eﬃciency and per- formance. In the case of 𝑁 = 200 (pink curve), there is an increase in the number of episodes 158 and a decrease in the reward value −0.67. This indicates that excessive exploration leads the agent to not ﬁnd the optimal actions and spend unnecessary time, suggesting that the exploration-exploitation balance is disrupted and that this is not an appropriate value. These results demonstrate the importance of appropriately selecting the number of steps per episode to optimize the learning performance of the DDPG agent. As 𝑁 in- creases, the learning process tends to become smoother and more stable; however, an 𝑁 value that is too large can lead to performance degradation. 𝑁 = 100 is shown to be the most eﬃcient and high-performing number of steps, as it avoids the excessive exploration that prevents the agent from ﬁnding optimal actions and leads to unnecessary time con- sumption. Therefore, selecting the appropriate value of 𝑁 = 100 can maximize the per- formance of the DDPG controller. 5.3. Experiment and Result Analysis In this section, experiments are conducted to apply the proposed IMC-DPGA con- J. Mar. Sci. Eng. 2024, 12, 1603 troller to the ICC system for adaptive tuning. The ﬁgure shows the learning process of 𝜆 , 15 of 22 𝐾 ,𝐾 , and 𝐾 per episode, with the number of steps per episode 𝑁 set to 100 as in Figure Figure 6. Parameter-wise KDE of IMC-DPGA training process ((A) 𝜆. (B) 𝐾 (C) 𝐾 (D) 𝐾 ) at N Figure 6. Parameter-wise KDE of IMC-DPGA training process ((A) λ (B) K (C) K (D) K ) at N = 100. i d = 100. Table 5. Parameter-wise KDE result details from the IMC-DPGA training. The graph is a 3D representation of the kernel density estimation for the values out- put at each step of each episode. Kernel density estimation (KDE) is a non-parametric Control Range of Mode Density Mode method f Paremeters or estimating the probability density function of data. It smoothly represents the distribution of given data, making it easier to identify patt erns, and is often used to λ 0.435~1 63.27 0.44 K 2.41~5.59 5.16 5.48 K 1.28~2.98 9.73 2.90 K 0.00085~0.002 14,229 0.0019 As a result of the learning process over the episodes, the value of λ fluctuates within the range of 0.435 to 1, and accordingly, K varies from 2.41 to 5.59, K ranges from 1.28 to p i 2.98, and K changes from 0.00085 to 0.002 in a similar trend. The detailed learning process of these changes can be observed in Figure 7. This figure represents a cross-sectional view of the KDE graph for λ in Figure 6. The learning process can be divided into two phases. Phase 1 is the period of rapid change starting from the initial value of 1 to the 37th episode. Phase 2 is from the 38th episode to the end of the learning process, converging to 0.44, during which the volatility is very small, ranging from 0.456 to the final value of 0.44. In Phase 1, λ rapidly changes within the range of 1 to 0.456. During this time, K changes from 2.41 to 5.4, K changes from 1.28 to 2.88, and K changes within the small i d range of 0.00189 to 0.002. This demonstrates that the IMC-DPGA can effectively and stably adapt to changes due to exploration through the internal model, especially for plants like ICC system where transient response significantly impacts stability. Subsequently, in Phase 2, these parameters converge more stably, ensuring the final control performance of the system. As a result, the optimal λ converges to 0.44, and the corresponding values of K , K , and K converge to 5.48, 2.9, and 0.0019, respectively. p i d J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 16 of 23 understand or visualize the underlying distribution of the data. This allows for a visual understanding of the data learned at each step of each episode. When examining the overall learning trend of the control parameters, the initial epi- sodes show high volatility and a wide density distribution, indicating exploration of var- ious values. As learning progresses, the control parameter values tend to concentrate within a speciﬁc range, as indicated by the points with the highest density values. The increase in density value indicates convergence towards a stable optimal value. From Table 5, the ﬁnal values of 𝜆 , 𝐾 ,𝐾 , and 𝐾 can be determined based on the mode value corresponding to the highest density in the graph distribution. Table 5. Parameter-wise KDE result details from the IMC-DPGA training. Control Range of Mode Density Mode Paremeters 𝜆 0.435~1 63.27 0.44 2.41~5.59 5.16 5.48 𝐾 1.28~2.98 9.73 2.90 0.00085~0.002 14,229 0.0019 As a result of the learning process over the episodes, the value of 𝜆 ﬂuctuates within the range of 0.435 to 1, and accordingly, 𝐾 varies from 2.41 to 5.59, 𝐾 ranges from 1.28 to 2.98, and 𝐾 changes from 0.00085 to 0.002 in a similar trend. The detailed learning J. Mar. Sci. Eng. 2024, 12, 1603 16 of 22 process of these changes can be observed in Figure 7. This ﬁgure represents a cross-sec- tional view of the KDE graph for 𝜆 in Figure 6. Figure 7. Cross-sectional KDE for detailed analysis of IMC-DPGA training process. Figure 7. Cross-sectional KDE for detailed analysis of IMC-DPGA training process. 6. Compare Performance of Different Controllers In this section, the real-time control performance of the proposed IMC-DPGA con- troller is evaluated by comparing it with several major PID control algorithms. The experi- ments are conducted on the S2 system to maintain a consistent oxygen concentration while the OFB system’s S1 is operating. The experimental process involves maintaining an initial 4% oxygen concentration in the S2 ICC system for 200 s and then comparing the results to verify control performance. At the 100 s mark, the setpoint for the oxygen concentration is changed from 4% to 5%, and control output data are collected. The collected data includes the value predicted by flame images, with a sampling time of 1 s. The transient response period and steady-state performance of each controller are compared. The algorithms se- lected for comparison are the Ziegler–Nichols tuning method, the Lambda tuning method, and the IMC-Maclaurin (IMC-MAC) closed-loop tuning method. The Ziegler–Nichols tuning method is a classical approach that sets PID parameters using the critical gain and critical period, allowing for simple and quick initial settings. The Lambda tuning method sets PID parameters based on the system’s time constant, making it practical and easy to use. The IMC-MAC closed-loop tuning method combines the IMC tuning method with MAC’s optimization algorithm, providing high precision for complex systems. This comparison allows for the evaluation of the real-time control performance and suitability of various PID tuning methods. The results are shown in Figure 8. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 17 of 23 The learning process can be divided into two phases. Phase 1 is the period of rapid change starting from the initial value of 1 to the 37th episode. Phase 2 is from the 38th episode to the end of the learning process, converging to 0.44, during which the volatility is very small, ranging from 0.456 to the ﬁnal value of 0.44. In Phase 1, 𝜆 rapidly changes within the range of 1 to 0.456. During this time, 𝐾 changes from 2.41 to 5.4, 𝐾 changes from 1.28 to 2.88, and 𝐾 changes within the small range of 0.00189 to 0.002. This demonstrates that the IMC-DPGA can eﬀectively and stably adapt to changes due to exploration through the internal model, especially for plants like ICC system where transient response signiﬁcantly impacts stability. Subsequently, in Phase 2, these parameters converge more stably, ensuring the ﬁnal control performance of the system. As a result, the optimal 𝜆 converges to 0.44, and the corresponding values of 𝐾 , 𝐾 , and 𝐾 converge to 5.48, 2.9, and 0.0019, respectively. 6. Compare Performance of Diﬀerent Controllers In this section, the real-time control performance of the proposed IMC-DPGA con- troller is evaluated by comparing it with several major PID control algorithms. The exper- iments are conducted on the 𝑆2 system to maintain a consistent oxygen concentration while the OFB system’s 𝑆1 is operating. The experimental process involves maintaining an initial 4% oxygen concentration in the 𝑆2 ICC system for 200 s and then comparing the results to verify control performance. At the 100 s mark, the setpoint for the oxygen concentration is changed from 4% to 5%, and control output data are collected. The col- lected data includes the value predicted by ﬂame images, with a sampling time of 1 s. The transient response period and steady-state performance of each controller are compared. The algorithms selected for comparison are the Ziegler–Nichols tuning method, the Lambda tuning method, and the IMC-Maclaurin (IMC-MAC) closed-loop tuning method. The Ziegler–Nichols tuning method is a classical approach that sets PID parameters using the critical gain and critical period, allowing for simple and quick initial sett ings. The Lambda tuning method sets PID parameters based on the system’s time constant, making it practical and easy to use. The IMC-MAC closed-loop tuning method combines the IMC tuning method with MAC’s optimization algorithm, providing high precision for J. Mar. Sci. Eng. 2024, 12, 1603 complex systems. This comparison allows for the evaluation of the real-time control per- 17 of 22 formance and suitability of various PID tuning methods. The results are shown in Figure 8. Figure 8. Comparison of control strategies for oxygen concentration step change. Figure 8. Comparison of control strategies for oxygen concentration step change. The two graphs compare the performance of various control algorithms for regulat- The two graphs compare the performance of various control algorithms for regulating ing oxygen concentration. The graph on the right provides a detailed view of the transient oxygen concentration. The graph on the right provides a detailed view of the transient response period from 85 to 125 s, allowing for an evaluation of the proposed IMC-DPGA controller ’s performance in comparison with other existing control algorithms. Table 6 compares the maximum overshoot (M ) and integral square error (ISE) to evaluate the step response performance of each controller. Table 6. Response of the tuning method according to changes in the oxygen concentration target value. Tuning Method M ISE Z-N 0.1114 11.1966 λ-T 0.0819 10.0912 IMC-MAC 0.1250 8.1189 IMC-DPGA 0.0631 7.7278 Analysis of the results in Table 6 reveals differences in the step response performance of each controller. For the Z-N tuning method, the M is 0.1114, indicating a significantly large transient response. Additionally, the ISE is high at 11.1966, which suggests considerable residual oscillations and error in the system’s response. This implies that the response of the Z-N controller is unstable and prone to oscillations. In the case of the λ-T tuning method, the M is 0.0819, showing an improvement in transient response compared to Z-N. However, the ISE remains high at 10.0912, indicating that residual oscillations have not been eliminated. This suggests that while the transient response has been reduced, the overall quality of the response is still lacking. The IMC-MAC tuning method shows a significant improvement with an ISE of 8.1189, indicating a substantial reduction in error. However, the M is recorded at 0.1250, the highest among the methods, suggesting that the initial stability of the response is lacking due to the large transient response. In other words, while the error has decreased, the method exhibits a considerable transient phenomenon during the initial response. Finally, the proposed IMC-DPGA tuning method demonstrates substantial improve- ments, with M and ISE values of 0.0631 and 7.7278, respectively. This indicates that both the overall error and transient response have been greatly improved. Notably, the M is the lowest, meaning the transient response is minimized, which signifies that the system is highly stable and converges to the target value rapidly. Additionally, Figure 9 represents the steady-state response for oxygen concentration targets of 4% and 5%. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 18 of 23 response period from 85 to 125 s, allowing for an evaluation of the proposed IMC-DPGA controller’s performance in comparison with other existing control algorithms. Table 6 compares the maximum overshoot (𝑀 ) and integral square error (ISE) to evaluate the step response performance of each controller. Table 6. Response of the tuning method according to changes in the oxygen concentration target value. Tuning Method 𝑴 𝑰𝑺𝑬 Z-N 0.1114 11.1966 𝜆 -T 0.0819 10.0912 IMC-MAC 0.1250 8.1189 IMC-DPGA 0.0631 7.7278 Analysis of the results in Table 6 reveals diﬀerences in the step response performance of each controller. For the Z-N tuning method, the 𝑀 is 0.1114, indicating a signiﬁcantly large transient response. Additionally, the ISE is high at 11.1966, which suggests consid- erable residual oscillations and error in the system’s response. This implies that the re- sponse of the Z-N controller is unstable and prone to oscillations. In the case of the 𝜆 -T tuning method, the 𝑀 is 0.0819, showing an improvement in transient response compared to Z-N. However, the ISE remains high at 10.0912, indicating that residual oscillations have not been eliminated. This suggests that while the transient response has been reduced, the overall quality of the response is still lacking. The IMC-MAC tuning method shows a signiﬁcant improvement with an ISE of 8.1189, indicating a substantial reduction in error. However, the 𝑀 is recorded at 0.1250, the highest among the methods, suggesting that the initial stability of the response is lack- ing due to the large transient response. In other words, while the error has decreased, the method exhibits a considerable transient phenomenon during the initial response. Finally, the proposed IMC-DPGA tuning method demonstrates substantial improve- ments, with 𝑀 and ISE values of 0.0631 and 7.7278, respectively. This indicates that both the overall error and transient response have been greatly improved. Notably, the 𝑀 is the lowest, meaning the transient response is minimized, which signiﬁes that the system is highly stable and converges to the target value rapidly. J. Mar. Sci. Eng. 2024, 12, 1603 18 of 22 Additionally, Figure 9 represents the steady-state response for oxygen concentration targets of 4% and 5%. Figure 9. Comparison of 4% and 5% steady-state responses for various controllers. Figure 9. Comparison of 4% and 5% steady-state responses for various controllers. The graph is a boxplot of the output data in the steady-state regions at the control targets of 4% and 5%. This allows for the assessment of the stability of each controller in the steady-state. Table 7 quantifies the data from the graph, showing the median, upper adjacent (U.A), and lower adjacent (L.A) of the output data for each controller. Table 7. Steady-state analysis of oxygen concentration at 4% and 5% for various controllers. 4% Steady-State Response 5% Steady-State Response Tuning Method Median U.A L.A Median U.A L.A Z-N 4.0269 4.1069 3.932 5.022 5.112 4.9304 λ-T 4.0331 4.136 3.9056 4.9863 5.0819 4.8698 IMC-MAC 4.0054 4.1029 3.8696 5.024 5.125 4.9024 IMC-DPGA 3.9968 4.0498 3.9444 5.0188 5.0631 4.974 The Z-N controller shows similar medians and data distributions at both control targets of 4% and 5%. The λ-T controller also exhibits a similar distribution, indicating stability comparable to that of the Z-N controller. The IMC-MAC controller, however, shows a significantly larger data distribution, suggesting lower stability. This implies that while IMC-MAC demonstrates a fast response speed during the transient response period, it experiences significant oscillations in the steady-state, resulting in lower stability. The IMC- DPGA, compared to the other controllers, shows the lowest data distribution and closely follows the control target with its median, indicating the highest control stability in the steady-state. This confirms that IMC-DPGA ensures faster response speeds while providing superior stability compared to other controllers. Specifically, the superior performance of the IMC-DPGA compared to the IMC-MAC demonstrates the effectiveness of the tuning method, which allows for the adaptive real-time adjustment of the value of λ according to the internal model by combining the DDPG algorithm with the IMC structure. 7. Conclusions The tightening of atmospheric pollutant emission regulations in the maritime sector has spurred efforts to reduce emissions from combustion boilers. Understanding the corre- lation between control variables and atmospheric pollutants and controlling a calculated model can reduce these emissions. However, existing boiler combustion measurement- control systems have high time constants and struggle to achieve appropriate control in the face of dynamic changes in models due to various variables. Thus, using flame images as a means to measure oxygen concentration and employing an image-based combustion control system that can additionally control the air volume in J. Mar. Sci. Eng. 2024, 12, 1603 19 of 22 existing combustion systems can reduce measurement delay times and excessive combus- tion state changes, enabling stable real-time control. In this paper, the IMC-DPGA (internal model control–deep policy gradient adaptive) controller is proposed, which combines the IMC-PID controller, known for its excellent model-based control, with the DDPG algorithm, which allows continuous exploration learning, and is applied to an image-based combustion control system. Because PID control parameters are linked by the internal model of IMC, it can prevent transient responses caused by sporadic changes in each parameter during the learning phase. Additionally, unlike traditional RL-based PID parameter tuning methods, the action variable is reduced from three dimensions to one by using lambda λ , the IMC filter, saving computational ( ) resources and enabling stable and fast learning. By setting and controlling the PID parameters based on the threshold value of 100 steps (N) per episode established through experimentation, a reward value of − 0.05 was achieved in just 105 episodes. Furthermore, comparison experiments in step response with other controllers showed that the IMC-DPGA controller demonstrated the fastest response speed, lowest overshoot, and minimal oscillation compared to existing PID con- trollers, proving its stability and effectiveness. The experiments in this study were conducted on actual operating ships, verifying their practicality. Additionally, the image-based combustion control system can be easily integrated into existing ships at low cost, providing an immediate reduction in atmospheric pollutants. However, increasing the target oxygen concentration can suppress atmospheric pol- lutants through excess air but decrease boiler performance efciency fi . Therefore, future research must develop optimal control strategies that balance pollutant reduction and boiler performance efficiency. To achieve this, combining multi-objective optimization techniques with the IMC-DPGA control algorithm will be essential to respond to real-time changes in combustion conditions and simultaneously optimize pollutant emissions and energy efficiency. Furthermore, since the improvement in the learning agent’s performance directly translates to enhanced controller performance, further research on improving the agent model’s performance through transfer learning is necessary. Author Contributions: Conceptualization, C.-M.L. and B.-G.J.; methodology, C.-M.L.; formal analysis, C.-M.L.; writing—original draft preparation, C.-M.L.; writing—review and editing, B.-G.J. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the Research promotion program through the National Korea Maritime and Ocean University Research Fund in 2023. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The original contributions presented in the study are included in the article, further in-queries can be directed to the corresponding authors. Acknowledgments: First and foremost, we extend our deepest gratitude to everyone who has played a role in the successful completion of this journal. We also wish to express our sincere thanks to the esteemed reviewers for their meticulous evaluation, insightful feedback, and expert guidance throughout the peer review process. Additionally, we would like to extend our heartfelt appreciation to Jung Byung-Gun for his invaluable mentorship, unwavering support, and profound insights that greatly contributed to this work. Lastly, we are immensely thankful to the editors for their dedication, hard work, and commitment to advancing knowledge in our field. Conflicts of Interest: The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. J. Mar. Sci. Eng. 2024, 12, 1603 20 of 22 Nomenclature CO Carbon dioxide NO Nitrogen oxides SO Sulfur dioxides SO Sulfur oxides Greek symbols β constant associated with the zero a Action of actor (variable for time) e Target oxygen concentration error f (s) IMC filter f Cutoff frequency f (s) IMC filter for ICC system f (s) Low-pass filter G(s) Transfer function of controlled process G Actual system transfer function S2 G The internal model transfer function of ICC system ICC G ICC system transfer function S2 J Expected reward K (s) Classic controller of ICC system ICC K Derivation gain K Integral gain K Proportional gain L θ Loss function λ, η Time constants of the IMC filter M Maximum peak error N Step number per episode O Current oxygen concentration r Reward (variable for time) s Next state t+1 u t Control input ( ) y(t) Amplitude of the system output θ Critic network θ Target network for critic θ Actor network θ Target network for actor τ , τ Time constants of the system τ Derivative parameter τ Integral parameter Q(s) IMC controller s State of actor (variable for time) ε Noise for exploration (variable for time) ∇ µ J Policy gradient for expected reward Index A/D Analog-to-digital CMOS Complementary metal-oxide-semiconductor DDPG Deep deterministic policy gradient DRL Deep reinforcement learning DPS Dynamic positioning system DQN Deep Q-networks HSV Hue, saturation, and value ICC Image-based combustion control ISE Integral of squared error IMC Internal model control MAC Maclaurin J. Mar. Sci. Eng. 2024, 12, 1603 21 of 22 MSE Mean squared error KDE Kernel density estimation L.A Lower adjacent OFB Oil-fired boiler PID Proportional–integral–derivation PI Proportional–integral R R-squared SEF Saturation extraction filter SOLAS The International Convention for the Safety of Life at Sea SOPZ Second-order plus zero-pole SVM Support vector machine TD Time delay USB Universal serial bus U.A Upper adjacent Z-N Ziegler–Nichols References 1. MarkWide Research. Global Marine Boilers Market: Analysis, Industry Size, Share, Research Report, Insights, COVID-19 Impact, Statistics, Trends, Growth, and Forecast 2024–2032; MarkWide Research: Torrance, CA, USA, 2024. 2. Shelyapina, M.G.; Rodríguez-Iznaga, I.; Petranovskii, V. Materials for CO , SO , and NO Emission Reduction. In Handbook of x x Nanomaterials and Nanocomposites for Energy and Environmental Applications; Springer: Cham, Switzerland, 2020; pp. 2429–2458. 3. Tadros, M.; Ventura, M.; Soares, C.G. Review of current regulations, available technologies, and future trends in the green shipping industry. Ocean Eng. 2023, 280, 114670. [CrossRef] 4. Zhao, J.; Wei, Q.; Wang, S.; Ren, X. Progress of ship exhaust gas control technology. Sci. Total Environ. 2021, 799, 149437. [CrossRef] [PubMed] 5. Nemitallah, M.A.; Nabhan, M.A.; Alowaifeer, M.; Haeruman, A.; Alzahrani, F.; Habib, M.A.; Elshafei, M.; Abouheaf, M.I.; Aliyu, M.; Alfarraj, M. Artificial intelligence for control and optimization of boilers’ performance and emissions: A review. J. Clean. Prod. 2023, 417, 138109. [CrossRef] 6. Chen, J.; Chang, Y.; Cheng, Y.; Hsu, C. Design of image-based control loops for industrial combustion processes. Appl. Energy 2012, 94, 13–21. [CrossRef] 7. Krishnamoorthi, M.; Agarwal, A.K. Combustion instabilities and control in compression ignition, low-temperature com- bustion, and gasoline compression ignition engines. In Gasoline Compression Ignition Technology: Future Prospects; Springer: Berlin/Heidelberg, Germany, 2022; pp. 183–216. 8. Sujatha, K.; Venmathi, M.; Pappa, N. Flame Monitoring in power station boilers using image processing. Ictact J. Image Video Process. 2012, 2, 427–434. 9. Omiotek, Z.; Kotyra, A. Flame image processing and classification using a pre-trained VGG16 model in combustion diagnosis. Sensors 2021, 21, 500. [CrossRef] 10. Lee, C.; Jung, B.; Choi, J. Experimental Study on Prediction for Combustion Optimal Control of Oil-Fired Boilers of Ships Using Color Space Image Feature Analysis and Support Vector Machine. J. Mar. Sci. Eng. 2023, 11, 1993. [CrossRef] 11. Lee, C. Combustion Control of Ship’s Oil-Fired Boilers based on Prediction of Flame Images. J. Mar. Sci. Eng. 2024, 12, 1474. [CrossRef] 12. Noye, S.; Martinez, R.M.; Carnieletto, L.; De Carli, M.; Aguirre, A.C. A review of advanced ground source heat pump control: Artificial intelligence for autonomous and adaptive control. Renew. Sustain. Energy Rev. 2022, 153, 111685. [CrossRef] 13. Qi, R.; Tao, G.; Jiang, B. Fuzzy System Identification and Adaptive Control ; Springer: Cham, Switzerland, 2019. 14. Yaseen, H.M.S.; Siffat, S.A.; Ahmad, I.; Malik, A.S. Nonlinear adaptive control of magnetic levitation system using terminal sliding mode and integral backstepping sliding mode controllers. ISA Trans. 2022, 126, 121–133. [CrossRef] 15. Mahmud, M.; Motakabber, S.; Alam, A.Z.; Nordin, A.N. Adaptive PID controller using for speed control of the BLDC motor. In Proceedings of the 2020 IEEE International Conference on Semiconductor Electronics (ICSE), Kuala Lumpur, Malaysia, 28–29 July 2020; pp. 168–171. 16. Nohooji, H.R. Constrained neural adaptive PID control for robot manipulators. J. Frankl. Inst. 2020, 357, 3907–3923. [CrossRef] 17. Wang, J.; Zhu, Y.; Qi, R.; Zheng, X.; Li, W. Adaptive PID control of multi-DOF industrial robot based on neural network. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 6249–6260. [CrossRef] 18. Dubey, V.; Goud, H.; Sharma, P.C. Role of PID control techniques in process control system: A review. In Data Engineering for Smart Systems: Proceedings of SSIC 2021; Springer: Singapore, 2022; pp. 659–670. 19. Kanungo, A.; Choubey, C.; Gupta, V.; Kumar, P.; Kumar, N. Design of an intelligent wavelet-based fuzzy adaptive PID control for brushless motor. Multimed. Tools Appl. 2023, 82, 33203–33223. [CrossRef] 20. Chen, S. Review on supervised and unsupervised learning techniques for electrical power systems: Algorithms and applications. IEEJ Trans. Electr. Electron. Eng. 2021, 16, 1487–1499. [CrossRef] 21. Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. J. Mar. Sci. Eng. 2024, 12, 1603 22 of 22 22. Lee, D.; Lee, S.J.; Yim, S.C. Reinforcement learning-based adaptive PID controller for DPS. Ocean Eng. 2020, 216, 108053. [CrossRef] 23. Carlucho, I.; De Paula, M.; Acosta, G.G. An adaptive deep reinforcement learning approach for MIMO PID control of mobile robots. ISA Trans. 2020, 102, 280–294. [CrossRef] 24. Siraskar, R. Reinforcement learning for control of valves. Mach. Learn. Appl. 2021, 4, 100030. [CrossRef] 25. Lawrence, N.P.; Stewart, G.E.; Loewen, P.D.; Forbes, M.G.; Backstrom, J.U.; Gopaluni, R.B. Optimal PID and antiwindup control design as a reinforcement learning problem. IFAC-PapersOnLine 2020, 53, 236–241. [CrossRef] 26. Lakhani, A.I.; Chowdhury, M.A.; Lu, Q. Stability-preserving automatic tuning of PID control with reinforcement learning. arXiv 2021, arXiv:2112.15187. [CrossRef] 27. Ding, Y.; Ren, X.; Zhang, X.; Liu, X.; Wang, X. Multi-phase focused pid adaptive tuning with reinforcement learning. Electronics 2023, 12, 3925. [CrossRef] 28. Datta, A. Adaptive Internal Model Control; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. 29. Zaporozhets, A.O.; Zaporozhets, A.O. Research of the process of fuel combustion in boilers. In Control of Fuel Combustion in Boilers; Springer: Cham, Switzerland, 2020; pp. 35–60. 30. Chen, J.; Chang, Y.; Cheng, Y. Performance design of image-oxygen based cascade control loops for boiler combustion processes. Ind. Eng. Chem. Res. 2013, 52, 2368–2378. [CrossRef] 31. Xiao, G.; Gao, X.; Lu, W.; Liu, X.; Asghar, A.B.; Jiang, L.; Jing, W. A physically based air proportioning methodology for optimized combustion in gas-fired boilers considering both heat release and NOx emissions. Appl. Energy 2023, 350, 121800. [CrossRef] 32. Li, Y.; Zhang, T.; Das, S.; Shamma, J.; Li, N. Non-asymptotic system identification for linear systems with nonlinear policies. IFAC-PapersOnLine 2023, 56, 1672–1679. [CrossRef] 33. Tan, H. Reinforcement learning with deep deterministic policy gradient. In Proceedings of the 2021 International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA), Xi’an, China, 28–30 May 2021; pp. 82–85. 34. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 21–26 June 2014; pp. 387–395. 35. Nise, N.S. Control Systems Engineering; John Wiley & Sons: Hoboken, NJ, USA, 2020. 36. Rivera, D.E. Internal Model Control: A Comprehensive View; Arizona State University: Tempe, AZ, USA, 1999; pp. 85287–86006. 37. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Marine Science and Engineering Multidisciplinary Digital Publishing Institute http://www.deepdyve.com/lp/multidisciplinary-digital-publishing-institute/adaptive-control-of-ships-oil-fired-boilers-using-flame-image-based-Z2zjT0cl5F

Loading next page...

References (31)

Ye Ding, Xiaoguang Ren, Xiaochuan Zhang, Xin Liu, Xu Wang (2023)
Multi-Phase Focused PID Adaptive Tuning with Reinforcement Learning
Electronics
Chang-Min Lee, Byung-Gun Jung, Jae-Hyuk Choi (2023)
Experimental Study on Prediction for Combustion Optimal Control of Oil-Fired Boilers of Ships Using Color Space Image Feature Analysis and Support Vector Machine
Journal of Marine Science and Engineering
Ignacio Carlucho, Mariano Paula, G. Acosta (2020)
An adaptive deep reinforcement learning approach for MIMO PID control of mobile robots.
ISA transactions
Reinforcement Learning with Deep Deterministic Policy Gradient
Junghui Chen, Yu-Hsiang Chang, Yi-Cheng Cheng (2013)
Performance Design of Image-Oxygen Based Cascade Control Loops for Boiler Combustion Processes
Industrial & Engineering Chemistry Research, 52
Adaptive PID Controller Using for Speed Control of the BLDC Motor
Materials for CO2, SOx, and NOx Emission Reduction
S. Noye, Rubén Martinez, Laura Carnieletto, M. Carli, Amaia Aguirre (2022)
A review of advanced ground source heat pump control: Artificial intelligence for autonomous and adaptive control
Renewable and Sustainable Energy Reviews
Role of PID Control Techniques in Process Control System: A Review
Guolin Xiao, Xiaori Gao, Wei Lu, Xiao-duo Liu, A. Asghar, Liu Jiang, Wenlin Jing (2023)
A physically based air proportioning methodology for optimized combustion in gas-fired boilers considering both heat release and NOx emissions
Applied Energy
A. Zaporozhets (2020)
Control of Fuel Combustion in Boilers
, 287
Zbigniew Omiotek, A. Kotyra (2021)
Flame Image Processing and Classification Using a Pre-Trained VGG16 Model in Combustion Diagnosis
Sensors (Basel, Switzerland), 21
M. Tadros, M. Ventura, Carlos Soares (2023)
Review of current regulations, available technologies, and future trends in the green shipping industry
Ocean Engineering
Song Chen (2021)
Review on Supervised and Unsupervised Learning Techniques for Electrical Power Systems: Algorithms and Applications
IEEJ Transactions on Electrical and Electronic Engineering, 16
Jiyue Wang, Yonggang Zhu, Renlong Qi, Xigui Zheng, Wei Li (2020)
Adaptive PID control of multi-DOF industrial robot based on neural network
Journal of Ambient Intelligence and Humanized Computing, 11
Junghui Chen, Yu-Hsiang Chang, Yi-Cheng Cheng, Chen-Kai Hsu (2012)
Design of image-based control loops for industrial combustion processes
Applied Energy, 94
Yingying Li, Tianpeng Zhang, Subhro Das, J. Shamma, Na Li (2023)
Non-asymptotic System Identification for Linear Systems with Nonlinear Policies
ArXiv, abs/2306.10369
Ayub Lakhani, Myisha Chowdhury, Qiugang Lu (2021)
Stability-Preserving Automatic Tuning of PID Control with Reinforcement Learning
ArXiv, abs/2112.15187
Hafiz Yaseen, Syed Siffat, Iftikhar Ahmad, A. Malik (2021)
Nonlinear adaptive control of magnetic levitation system using terminal sliding mode and integral backstepping sliding mode controllers.
ISA transactions
Combustion Instabilities and Control in Compression Ignition, Low-Temperature Combustion, and Gasoline Compression Ignition Engines
H. Nohooji (2020)
Constrained neural adaptive PID control for robot manipulators
J. Frankl. Inst., 357
Nathan Lawrence, G. Stewart, Philip Loewen, M. Forbes, J. Backström, R. Gopaluni (2020)
Optimal PID and Antiwindup Control Design as a Reinforcement Learning Problem
ArXiv, abs/2005.04539
Ruiyun Qi, G. Tao, B. Jiang (2019)
Fuzzy System Identification and Adaptive Control
Communications and Control Engineering
Daesoo Lee, S. Lee, S. Yim (2020)
Reinforcement learning-based adaptive PID controller for DPS
Ocean Engineering
Sujatha K, V. M., Pappa N (2012)
FLAME MONITORING IN POWER STATION BOILERS USING IMAGE PROCESSING
ICTACT Journal on Image and Video Processing, 02
Rajesh Siraskar (2020)
Reinforcement Learning for Control of Valves
ArXiv, abs/2012.14668
Combustion Control of Ship’s Oil-Fired Boilers based on Prediction of Flame Images
Abhas Kanungo, Chandan Choubey, Varun Gupta, Pankaj Kumar, N. Gupta (2023)
Design of an intelligent wavelet-based fuzzy adaptive PID control for brushless motor
Multimedia Tools and Applications
Jun Zhao, Qi-feng Wei, Shanshan Wang, Xiulian Ren (2021)
Progress of ship exhaust gas control technology.
The Science of the total environment, 799
Li (2023)
Non-asymptotic system identification for linear systems with nonlinear policies
IFAC-PapersOnLine, 56
M. Nemitallah, M. Nabhan, Maad Alowaifeer, Agus Haeruman, Fahad Alzahrani, M. Habib, M. Elshafei, M. Abouheaf, M. Aliyu, M. Alfarraj (2023)
Artificial intelligence for control and optimization of boilers’ performance and emissions: A review
Journal of Cleaner Production

Publisher: Multidisciplinary Digital Publishing Institute
Copyright: © 1996-2025 MDPI (Basel, Switzerland) unless otherwise stated Disclaimer Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. Terms and Conditions Privacy Policy
ISSN: 2077-1312
DOI: 10.3390/jmse12091603
Publisher site: See Article on Publisher Site

Abstract

Journal of Marine Science and Engineering Article Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning Chang-Min Lee and Byung-Gun Jung * Division of Marine System Engineering, Korea Maritime and Ocean University, 727, Taejong-ro, Yeongdo-gu, Busan 49112, Republic of Korea; [email protected] * Correspondence: [email protected] Abstract: The control system of oil-fired boiler units on ships plays a crucial role in reducing the emissions of atmospheric pollutants such as nitrogen oxides (NO ), sulfur dioxides (SO ), and carbon dioxide (CO ). Traditional control methods using conventional measurement sensors face limitations in real-time control due to response delays, which has led to the growing interest in combustion control methods using flame images. To ensure the precision of such combustion control systems, the system model must be thoroughly considered during controller design. However, finding the optimal tuning point is challenging due to the changes in the system model and nonlinearity caused by environmental variations. This study proposes a controller that integrates an internal model control (IMC)-based PID controller with the deep deterministic policy gradient (DDPG) algorithm of deep reinforcement learning to enhance the adaptability of image-based combustion control systems to environmental changes. The proposed controller adjusts the PID parameter values in real-time through the learning of the determination constant lambda (λ) of the IMC internal model. This approach reduces computational resources by shrinking the learning dimensions of the DDPG agent and limits transient responses through constrained learning of control parameters. Experimental results show that the proposed controller exhibited rapid adaptive performance in the learning process for the target oxygen concentration, achieving a reward value of − 0.05 within just 105 episodes. Furthermore, when compared to traditional PID tuning methods, the proposed controller demonstrated superior performance, achieving a target value error of 0.0032 and a low Citation: Lee, C.-M.; Jung, B.-G. overshoot range of 0.0498 to 0.0631, providing the fastest response speed and minimal oscillation. Adaptive Control of Ships’ Oil-Fired Additionally, experiments conducted on an actual operating ship verified the practical feasibility Boilers Using Flame Image-Based of this system, highlighting its potential for real-time control and pollutant reduction in marine IMC-PID and Deep Reinforcement applications. Learning. J. Mar. Sci. Eng. 2024, 12, 1603. https://doi.org/10.3390/ Keywords: combustion control; emission prediction; IMC-based PID; real-time control; image-based jmse12091603 control; deep deterministic policy gradient algorithm Academic Editor: Pasqualino Corigliano Received: 28 July 2024 1. Introduction Revised: 23 August 2024 Accepted: 5 September 2024 Combustion boilers are widely used in the maritime industry for preheating, hot Published: 10 September 2024 water, and steam supply and have shown continuous growth in the context of atmospheric pollutant emission restrictions [1]. During the combustion process, these boilers produce exhaust gases that contain atmospheric pollutants such as NO , SO and CO . These x x 2 pollutants contribute to greenhouse gas effects and accelerate global warming, underscoring Copyright: © 2024 by the authors. the necessity to reduce their emissions during the combustion process [2,3]. Licensee MDPI, Basel, Switzerland. To reduce atmospheric pollutants, it is necessary to appropriately regulate the air This article is an open access article and fuel supplied to the combustion process. Accordingly, ongoing research focuses on distributed under the terms and directly controlling the flow rates of fuel and air supplied to combustion systems to mitigate conditions of the Creative Commons atmospheric pollutants [4,5]. Attribution (CC BY) license (https:// However, a significant challenge with these combustion control systems, which utilize creativecommons.org/licenses/by/ direct measurement devices, is the inherent delay in the response of oxygen concentration 4.0/). J. Mar. Sci. Eng. 2024, 12, 1603. https://doi.org/10.3390/jmse12091603 https://www.mdpi.com/journal/jmse J. Mar. Sci. Eng. 2024, 12, 1603 2 of 22 changes in the exhaust gases to control outputs. Additionally, disturbances such as varia- tions in intake air temperature, fuel properties, and combustion efficiency can impact the emission of atmospheric pollutants from the combustion system in real-time [6,7]. This issue can be addressed by utilizing flame images generated during the combustion process. Since flame images reflect the combustion state, they can reduce the delay in assessing the current state of exhaust gases. By analyzing the radiative emissions and color space of the flame, it is possible to monitor the production of atmospheric pollutants in real-time [8,9]. Previous studies developed a system for real-time monitoring of air pollutants and oxygen concentration by analyzing two-dimensional HSV images collected using acces- sible webcams, which identified spectral characteristic differences across various fuel-air ratios [10]. In subsequent research, this monitoring system was utilized as a control input to propose an oxygen concentration control system that could be easily applied to marine boilers. The proposed system models the correlation between oxygen and combustion based on operational data and uses an IMC-PI closed-loop control structure, effectively controlling exhaust gas emissions and reducing the production of air pollutants [11]. However, systems with complex combustion mechanisms, such as boilers, exhibit variable internal models due to numerous factors. Therefore, it is crucial to employ control algorithms that can adapt to a wide range of environmental changes. In the field of control engineering, extensive research has been conducted on various adaptive control meth- ods [12–14]. Notably, adaptive tuning of PID parameters, which accounts for 90% of control processes in industrial applications, has been widely studied to optimize performance under varying conditions [15,16]. Recently, neural network-based supervised learning techniques for tuning PID pa- rameters have gained attention due to their ability to map high-dimensional relationships between inputs and outputs. These techniques have demonstrated superior performance compared to other intelligent methods in the context of adaptive tuning [17–19]. However, such supervised learning approaches require extensive data sources to cover a wide range of environmental changes. Multivariable systems, like oil-fired boilers, can demand sig- nificant time and human resources, making them challenging to implement in real-world engineering applications. Unlike supervised learning methods, deep reinforcement learning (DRL), an unsu- pervised learning approach, does not require labeled data, thus overcoming some of these challenges [20,21]. Consequently, DRL methods have been widely applied in the field of PID parameter tuning. In Lee’s study, an adaptive PID controller was developed to adjust PID gains in real-time while adapting to environmental changes in a dynamic positioning system (DPS) [22]. Carlucho’s research addressed the issue of simultaneously outputting multiple parameters from a PID controller based on reinforcement learning (RL) [23]. Ad- ditionally, Siraskar proposed an adaptive PID tuning method that features auto-tuning capabilities and high-frequency noise suppression [24]. Nevertheless, these studies have shown that excessive PID parameter outputs and integral windup can occur during the exploration process of DRL learning, potentially destabilizing the system. To address this issue, Lawrence’s study aimed to improve stability by representing the PID controller with a shallow neural network in the actor network [25]. Furthermore, Lakhani proposed an RL-based stability-preserving PID adaptive tuning framework to ensure controller stability [26]. In Ding’s research, the actions of the agent were constrained during the multi-stage focusing process, enabling stable PID tuning even with limited prior knowledge [27]. However, these methods required 1500, 3000, and 4000 episodes, respectively, for the system to stabilize. Attempts have been made to apply these DRL-based PID frameworks, which require such long episodes, to an oxygen concentration control system based on flame images. However, excessive response of control parameters during the exploration J. Mar. Sci. Eng. 2024, 12, 1603 3 of 22 process led to issues such as flame extinction, causing system shutdown, or accelerated contamination of the heat transfer surfaces in the boiler system due to unstable combustion. Therefore, the objective of this study is to develop an RL-based PID adaptive tuning framework that ensures improved tuning performance while minimizing the impact of episodes during the exploration process on the system. To achieve this, the concept of the internal model in IMC-based PID control is utilized [28]. The internal model of the system is leveraged to constrain excessive control parameter outputs. When gradual variations occur in the system due to changes in the combustion environment, such as variations in fuel and air quality or fouling of heat transfer surfaces, the proposed controller adjusts the IMC tuning constant, lambda (λ), based on the experimentally obtained internal model to ensure that the system adapts to these altered conditions. This approach ensures that each control parameter is connected by the internal model and changes within a limited range, thereby restricting excessive system responses. Moreover, reducing the control parameters to be tuned from three to one (lambda) simplifies the system’s dimensionality and decreases the number of learning episodes required. The innovative contributions of this paper are as follows: 1. Real-Time Image-Based Combustion Control: This study replaces the traditional oxygen concentration measurement methods by utilizing a predictor based on flame images that reflect the combustion state in real-time. This significantly reduces the delay in exhaust gas control and enables real-time control. 2. Proposed Adaptive Controller for Boiler Combustion Control: The study proposes a control system that integrates an internal model control (IMC)-based PID controller with the deep reinforcement learning algorithm and deep deterministic policy gradi- ent (DDPG) algorithm, designed to effectively adapt to changes in the combustion environment. 3. Validation of Practical Applicability: The proposed control system has been vali- dated for its practicality through experiments conducted in actual ship operation environments, demonstrating its potential for easy integration into existing systems. 2. Image-Based Boiler Combustion Control System The existing ship oil combustion boiler control system, S1, is a proportional combus- tion control system that simultaneously controls the airflow and fuel amount to maintain a constant steam pressure. This system focuses on combustion stability and follows the ratio set by the manufacturer during the commissioning process. This ratio helps maintain flame stability within a limited range of variations in the environmental conditions of the supplied air and fuel characteristics. However, during this process, changes in the combustion process may lead to varying levels of air pollutant emissions [29]. This study examines the control performance of the S2 system, which is additionally implemented on the original S1 system. The schematic diagrams of both S1 and S2 systems are shown in Figure 1. The S2 system is an image-based combustion control (ICC) system that uses flame images as real-time input to predict oxygen concentration through an SEF + SVM predictor. The predicted oxygen concentration is used by the controller to adjust the damper servo motor, compensating for deviations from the target value. This controlled damper changes the amount of air supplied to the combustor, thereby controlling the air pollutants generated during the combustion process. The saturation extraction filter (SEF) is a method for preprocessing flame images to process images linearly related to various combustion states. This process involves converting the RGB flame image into HSV format and then extracting the saturation component. The extracted saturation data are transformed into a histogram, removing noise and unnecessary redundant data from the original image. This process generates a feature set that more effectively represents the combustion state. J. Mar. Sci. Eng. 2024, 12, 1603 J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 4 of 4 of 23 22 Figure 1. Overview of the boiler control system with the image-based combustion control system. Figure 1. Overview of the boiler control system with the image-based combustion control system. The 𝑆2 system is an image-based combustion control (ICC) system that uses ﬂame The support vector machine (SVM), a supervised learning-based classification model, images as real-time input to predict oxygen concentration through an SEF + SVM predic- is trained to predict the oxygen concentration in exhaust gases using features extracted from tor. The predicted oxygen concentration is used by the controller to adjust the damper the SEF. This approach builds a robust predictive model capable of handling non-linearity servo motor, compensating for deviations from the target value. This controlled damper and complexity. The training process leverages flame image data collected under various changes the amount of air supplied to the combustor, thereby controlling the air pollu- combustion conditions to enhance the model’s accuracy and reliability. tants generated during the combustion process. By integrating SEF and SVM methods, the system can predict oxygen concentration The saturation extraction ﬁlter (SEF) is a method for preprocessing ﬂame images to in real-time from flame images, and this predicted value is used as input for the control system. process im The ag oxygen es linearly concentration related to v pra edictor rious co used mbust in S io 2nis st aat model es. Thtrained is process invo on data lv collected es con- fr ve om rting the a gas analyzer RGB ﬂame that im of age fersinto H higher SV format accuracy and th than the en extract traditional ing lambda the satupr rat obe ion compo- method. This model uses flame images of quasi-instantaneous combustion states as input, ensuring nent. The extracted saturation data are transformed into a histogram, removing noise and high unnecessar accuracy y redund while ant dat reducing a from the or latency. Accor iginal im ding age to the . This process study, this method generatedemonstrated s a feature set its effectiveness and practicality, achieving an R value of 0.97 in oxygen concentration that more eﬀectively represents the combustion state. prediction through experiments. In addition, given that the input process for flame images The support vector machine (SVM), a supervised learning-based classiﬁcation model, may vary over time, it is important to retrain the model periodically to ensure it maintains is trained to predict the oxygen concentration in exhaust gases using features extracted a high level of accuracy. from the SEF. This approach builds a robust predictive model capable of handling non- Therefore, the S2 system can establish a real-time control system for regulating the linearity and complexity. The training process leverages ﬂame image data collected under oxygen concentration in oil-fired boiler (OFB) exhaust gases. By controlling the predicted various combustion conditions to enhance the model’s accuracy and reliability. oxygen concentration, the system can effectively manage the air pollutants generated dur- By integrating SEF and SVM methods, the system can predict oxygen concentration ing the combustion process. Additionally, the S2 system enhances the existing proportional in real-time from ﬂame images, and this predicted value is used as input for the control control system, S1, by adding a function to adjust the limited air supply for air pollutant system. The oxygen concentration predictor used in 𝑆2 is a model trained on data col- control. This allows for efficient exhaust gas management while maintaining combustion lected from a gas analyzer that oﬀers higher accuracy than the traditional lambda probe stability. The S2 system has the advantage of high accessibility, as it can be easily applied method. This model uses ﬂame images of quasi-instantaneous combustion states as input, not only to newly constructed ship OFBs at a low cost but also to existing ships in operation ensuring high accuracy while reducing latency. According to the study, this method by utilizing existing flame observation ports and additional adjustments to existing air demonstrated its eﬀectiveness and practicality, achieving an R value of 0.97 in oxygen control dampers. concentration prediction through experiments. In addition, given that the input process The target oxygen concentration for the S2 control system is set at 4%. According for ﬂame images may vary over time, it is important to retrain the model periodically to to related studies, pollutants such as CO , NO , and SO are inversely proportional to 2 2 ensure it maintains a high level of accuracy. oxygen concentration. Research by J. Chen et al. identified a correlation between NO Therefore, the 𝑆2 system can establish a real-time control system for regulating the emissions and oxygen concentration in the range of 2% to 5% during flame image prediction. oxygen concentration in oil-ﬁred boiler (OFB) exhaust gases. By controlling the predicted Specifically, they found that at an oxygen concentration of 4.02%, the formation of soot oxygen concentration, the system can eﬀectively manage the air pollutants generated J. Mar. Sci. Eng. 2024, 12, 1603 5 of 22 and graphite is minimized, resulting in the least noise during flame image recognition. Additionally, G. Xiao et al. discovered that at 80% load, an oxygen concentration of 3.5% achieves a balance between heat release and NO formation. Their study also found that to reduce NO emissions by a factor of two, the oxygen concentration needs to be increased by 1.14 times. Particularly, they confirmed that near 3.5% oxygen concentration at 80% load, an optimal balance between heat release and NO emissions is achieved. Based on a comprehensive review of these findings, setting the target oxygen concen- tration at 4% is considered suitable for optimizing boiler combustion through flame image analysis [30,31]. 2.1. Experimental Setup for Image-Based Combustion Control (ICC) System The experiment is conducted on a 9200 t ship, and the details of the boiler and combustor of the test subject OFB are shown in Table 1. Table 1. Specifications of boiler and burner for OFB. Steam Working Boiler Drum Type Production Steam Pressure Boiler Cylindrical Water 3000 kg/h 5.5~7 kg/cm Tube Fuel Oil Fuel Type Air Supply Volume Consumption Burner LSMGO, 0.1% sulfur Min/Max: Min/Max: (DMA) 68.5/205.5 kg/h 1650~3700 m /h J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 6 of 23 To implement the ICC system on the aforementioned OFB, the experimental environ- ment is configured as shown in Figure 2. Figure 2. Equipment conﬁguration for boiler ICC system experiment. Figure 2. Equipment configuration for boiler ICC system experiment. The burner of the cylindrical water-tube boiler is located at the bott om of the cylinder. The burner of the cylindrical water-tube boiler is located at the bottom of the cylinder. The burner initiates the combustion reaction between fuel and air, producing ﬂames and The burner initiates the combustion reaction between fuel and air, producing flames and ex- exhaust gases, which contain information about the oxygen concentration within the ex- haust gases, which contain information about the oxygen concentration within the exhaust. haust. The ﬂame generated by the burner is captured in high deﬁnition by a 1920 × 1080 The flame generated by the burner is captured in high definition by a 1920 × 1080 pixel pixel CMOS webcam in real-time. The camera is placed in the ﬂame observation port on CMOS webcam in real-time. The camera is placed in the flame observation port on the the side of the boiler according to SOLAS regulations. The collected images are transmit- side of the boiler according to SOLAS regulations. The collected images are transmitted ted to a computer via a USB 3.0 interface. These ﬂame images are used by the computer to a computer via a USB 3.0 interface. These flame images are used by the computer to to extract information about the exhaust gases. The computer analyzes the transmitt ed extract information about the exhaust gases. The computer analyzes the transmitted flame ﬂame images using an SEF + SVM predictor to estimate the oxygen concentration. The predicted oxygen concentration serves as a crucial input variable for combustion control. The oxygen concentration input is converted into a control signal for the air regulation damper by the controller. The output analog control signal is converted through an A/D converter drive to an analog control output ranging from 0 to 90 degrees, and a servo motor att ached to the damper end provides real-time control. 2.2. Data-Driven System Modeling The ICC system, which receives ﬂame images as input and outputs oxygen concen- tration, is diﬃcult to calculate dynamically due to numerous variables such as air proper- ties, changes in fuel characteristics, and changes in heat transfer eﬃciency due to contam- ination. Therefore, the transfer function is estimated using MATLAB’s System Identiﬁca- tion Toolbox, version R2024a. This method leverages machine learning algorithms to esti- mate the transfer function by learning from input and output data, making it an advanced modeling technique. It is particularly suitable for irregular and nonlinear systems and has the advantage of being applicable to models with many system variables [32]. Based on the system identiﬁcation results from response analysis of input–output data, the esti- mated transfer function model of the ICC system, denoted as 𝒢 in Equation (1), was found. The estimated model showed an accuracy of 99.28% and an MSE of 0.0001833. 0.2187𝑠 + 0.5960 (1) 𝒢 = 𝑠 + 2847.82𝑠 + 1508.78 To further understand the characteristics of the system, it can be represented in pole- zero form as shown in Equation (2). J. Mar. Sci. Eng. 2024, 12, 1603 6 of 22 images using an SEF + SVM predictor to estimate the oxygen concentration. The predicted oxygen concentration serves as a crucial input variable for combustion control. The oxygen concentration input is converted into a control signal for the air regulation damper by the controller. The output analog control signal is converted through an A/D converter drive to an analog control output ranging from 0 to 90 degrees, and a servo motor attached to the damper end provides real-time control. 2.2. Data-Driven System Modeling The ICC system, which receives flame images as input and outputs oxygen concentra- tion, is difficult to calculate dynamically due to numerous variables such as air properties, changes in fuel characteristics, and changes in heat transfer efciency fi due to contamina- tion. Therefore, the transfer function is estimated using MATLAB’s System Identification Toolbox, version R2024a. This method leverages machine learning algorithms to estimate the transfer function by learning from input and output data, making it an advanced modeling technique. It is particularly suitable for irregular and nonlinear systems and has the advantage of being applicable to models with many system variables [32]. Based on the system identification results from response analysis of input–output data, the estimated transfer function model of the ICC system, denoted as G in Equation (1), was found. The S2 estimated model showed an accuracy of 99.28% and an MSE of 0.0001833. 0.2187s + 0.5960 G = (1) S2 s + 2847.82s + 1508.78 To further understand the characteristics of the system, it can be represented in pole- zero form as shown in Equation (2). 0.2187(s + 2.728) G = (2) S2 (s + 0.523)(s + 2847.29) The system is in SOPZ form. Examining the poles and zeros of the transfer function G , the poles are located at S1 ≈ − 0.523 and S2 ≈ − 2847. Since the real parts of both S2 poles are negative, they are located in the left half of the complex plane, indicating that the system is stable and controllable. The zero is also real and negative, confirming that it does not affect the system’s stability. Converting Equation (2) to the system’s time constant form results in Equation (3). 0.21866(2.728s + 1) G = (3) S2 (1.887649s + 1)(0.000351s + 1) − 4 Accordingly, the time constants of this system are found to be τ = 3.51 × 10 , τ = 1.887649. Examining the time constants, τ is much larger than τ , indicating that the b b impact of τ on the system is negligible. Therefore, it suggests that variables other than changes in air supply do not significantly affect the system. 3. Preliminaries 3.1. Internal Model Control-Based PID Control Internal model control (IMC) is a control system design methodology that enhances the performance of the controller by using a process model. The basic idea of IMC is that the control system should include an internal model of the process being controlled. The fundamental structure of an IMC-based PID controller, which consists of a single PID controller with three adjustable parameters, is combined in parallel form, as shown in Equation (4). 1 de(t) u(t) = K e(t) + e(t)dt + τ (4) p d τ dt i J. Mar. Sci. Eng. 2024, 12, 1603 7 of 22 In Equation (4), u t is the manipulated variable at time t. K , τ , and τ are the ( ) p i proportional, integral, and derivative parameters, respectively. The error e(t) is the differ- ence between the control variable y(t) and the setpoint at time t. IMC-based PID control improves control performance by tuning the PID parameters using the internal model of the control system. This method ensures effective control by directly utilizing and compensating for the dynamic characteristics of the system during controller design. The structure of the IMC controller Q(s) for controlling the target system model G(s) is shown in Equation (5). − 1 Q(s) = G(s) · f (s) (5) G(s) is the transfer function of the process being controlled, and f (s) is the IMC filter function. When G(s) is unstable, it is difficult to directly use the inverse model, so an appropriate filter must be applied. Therefore, an IMC controller is designed by applying a suitable filter f (s) to the system function. The IMC filter function f (s) for the system G(s) is given in Equation (6). ηs + 1 ( ) f (s) = (6) (λs + 1) Here, λ and η are the IMC filter parameters, primarily used for ensuring system stability and noise reduction. The orders m and n are determined based on the system’s stability and performance requirements. Generally, m is set equal to the number of poles of the system, while n is set to match the total number of poles of the system. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 8 of 23 3.2. Reinforcement Learning–Deep Deterministic Policy Gradient The deep deterministic policy gradient (DDPG) algorithm is a model-free, policy- based, and off-policy reinforcement learning algorithm designed to solve control problems problems in continuous action spaces. DDPG uses the actor–critic methodology to learn in continuous action spaces. DDPG uses the actor–critic methodology to learn and optimize and optimize policies, and it was developed speciﬁcally to overcome the limitations of policies, and it was developed specifically to overcome the limitations of deep Q-networks deep Q-networks (DQN). DDPG consists of two neural networks: an actor and a critic. (DQN). DDPG consists of two neural networks: an actor and a critic. Figure 3 demonstrates Figure 3 demonstrates the principle of the DDPG algorithm [33,34]. the principle of the DDPG algorithm [33,34]. Figure 3. Architecture of actor–critic reinforcement learning with experience replay in DDPG. Figure 3. Architecture of actor–critic reinforcement learning with experience replay in DDPG. The actor network receives the current state (State, 𝑠 ) as input and outputs continu- ous action (Action, 𝑎 ) values. The critic network 𝜃 evaluates the 𝑄 -values for the given state and action. The weights of the actor network 𝜃 are updated using the deterministic policy gradient algorithm, and the weights of 𝜃 are updated using the gradients derived from the time delay (TD) error signal. ( | ) The DDPG algorithm operates in the following steps. First, the critic 𝑄 𝑠, 𝑎 𝜃 and ( | ) actor 𝜇 𝑠 𝜃 networks are initialized arbitrarily. Along with this, the target networks 𝜃 and 𝜃 for the critic and actor are initialized with the values of 𝜃 and 𝜃 , respec- tively. An appropriate buﬀer value for experience replay is then set to store the data out- put from the environment. At the start of an episode, the initial state 𝑠 is observ ed. 𝜃 receives the current state 𝑠 as input and outputs the action 𝑎 = 𝜇 (𝑠 |𝜃 ) + 𝜀 , which is applied to the en- vironment. 𝜀 is the noise for exploration, and it uses the Ornstein–Uhlenbeck process. This process is smooth and prevents abnormal system responses due to exploration. The environment responds with the next state 𝑠 and reward 𝑟 , and the tuple (𝑠 ,𝑎 ,𝑟 ,𝑠 ) is stored in the experience replay buﬀer. A mini-batch is randomly sampled from the experience replay to update 𝜃 such that the loss function ℒ(𝜃 ) is minimized. The 𝑄 -function is updated as shown in Equa- tion (7). 𝑄 (𝑠 ,𝑎 ) = 𝔼 𝑟 (𝑠 ,𝑎 ) + 𝛾 𝑄 𝑠 ,𝜇 (𝑠 ) (7) 𝜃 evaluates the 𝑄 -values for the given state and action, reducing the diﬀerence be- tween the actual reward and the predicted 𝑄 -value. 𝜃 is updated using the policy gradient as shown in Equation (8). J. Mar. Sci. Eng. 2024, 12, 1603 8 of 22 The actor network receives the current state (State, s ) as input and outputs continuous action (Action, a ) values. The critic network θ evaluates the Q-values for the given state and action. The weights of the actor network θ are updated using the deterministic policy gradient algorithm, and the weights of θ are updated using the gradients derived from the time delay (TD) error signal. The DDPG algorithm operates in the following steps. First, the critic Q s, a θ and µ Q actor µ(s|θ ) networks are initialized arbitrarily. Along with this, the target networks θ µ Q µ and θ for the critic and actor are initialized with the values of θ and θ , respectively. An appropriate buffer value for experience replay is then set to store the data output from the environment. At the start of an episode, the initial state s is observed. θ receives the current state s as input and outputs the action a = µ(s |θ ) + ε , which is applied to the environment. t t t t ε is the noise for exploration, and it uses the Ornstein–Uhlenbeck process. This process is smooth and prevents abnormal system responses due to exploration. The environment responds with the next state s and reward r , and the tuple (s , a , r , s ) is stored in t+1 t t t t t+1 the experience replay buffer. A mini-batch is randomly sampled from the experience replay to update θ such that the loss function L θ is minimized. The Q-function is updated as shown in Equation (7). µ µ Q (s , a ) = E [ r(s , a ) + γ Q (s , µ(s ))] (7) t t r , s t t t+1 t+1 t t+1 θ evaluates the Q-values for the given state and action, reducing the difference between the actual reward and the predicted Q-value. θ is updated using the policy gradient as shown in Equation (8). h i Q µ ∇ µ J ≈ E ∇ Q s, a θ | ∇ µµ(s|θ ) (8) θ a θ (s ) a=µ(s) Equation (8) calculates the gradient for the current policy µ(s|θ ), optimizing the policy network parameters θ . This allows the agent to learn actions that yield higher expected rewards. The policy gradient ∇ J is used to update the policy network parameters in a direction that maximizes the expected reward J. The critic network Q s, a θ evaluates the Q-value for state s and action a, and through the gradient of this Q-value, it assesses the effectiveness of the current µ s θ . Based on this assessment, the policy network is ( | ) updated. Subsequently, the parameters θ of the actor network are updated by reflecting the gradient ∇ Q s, a θ of the critic network. This adjustment enables the policy network to output better actions, thereby allowing the agent to receive higher rewards. Finally, the target networks are updated using the target soft update method. Through this iterative process, the actor and critic networks gradually learn the optimal policy and Q-values. DDPG, with its actor–critic architecture, enables stable and efficient learning. It is a powerful reinforcement learning algorithm specifically designed to solve continuous action control problems. In this paper, to effectively control the ICC system, a DPG-IMC-based PID controller, which integrates the deep reinforcement learning DDPG algorithm with an IMC-based PID controller, is proposed, and its effectiveness is verified. 4. Deep Deterministic Policy Gradient-Based Internal Model Control-PID Control 4.1. IMC-Based PID Controller for Image-Based Combustion Control System To effectively control the image-based combustion system of the ICC system, it is important to use an appropriate controller. One method is to use an IMC-based PID controller. Previous research applied an IMC-based PI controller to the ICC system and obtained a significant result with an ISE value of 10.1159. However, since flame images are used as input signals, including the derivative component of the PID controller can J. Mar. Sci. Eng. 2024, 12, 1603 9 of 22 help predict and respond to rapid changes in the combustion process, thereby improving stability and responsiveness. Therefore, a PID controller is more suitable. In this process, high-frequency noise due to intermittent prediction errors may occur, but it can be mitigated by applying appropriate filtering techniques. The derivative com- ponent enhances the system’s ability to respond to dynamic changes, reducing overshoot and settling time. Furthermore, despite the increased complexity of tuning the IMC con- troller, the advantages of achieving more precise and robust control using a PID controller outweigh these difficulties [35]. First, to design an IMC-based PID controller, the internal model is analyzed. The internal model transfer function estimated from the data in Equation (3) is in SOPZ form and is expressed as shown in Equation (9). k (βs + 1) G s = τ < τ (9) ( ) ( ) ICC a (τ s + 1)(τ s + 1) a b where τ and τ are the time constants of the system, k is the proportional gain, and β is a p the constant associated with zero. Consequently, the IMC controller can be expressed as shown in Equation (10), where f (s) represents the IMC filter for ICC system. − 1 q(s) = G f (s) (10) ICC Since f (s) must be equal to or greater than the order of the numerator to achieve control, the order of the filter function is set to match the order of the internal model G ICC shown in Equation (11). ηs + 1 f (s) = (11) (λs + 1) λ and η are the time constants of the filter, and they need to be adjusted according to the required performance of the controller. They are parameters that regulate control performance and robustness. In this context, η is set to be equal to λ for the design of the PID controller. − 1 By integrating the IMC controllerq(s) with the internal modelG , a classic controller ICC K (s) can be formed. This can be expanded using Equations (9) and (10), and can be ICC expressed in the forms shown in Equations (12a) and (12b). − 1 q(s) G f (s) ICC i K s = = (12a) ( ) ICC − 1 1 − G q(s) ICC 1 − G G f (s) ICC i ICC τ τ 1 1 s 1 K (s) = 1 + + s ICC k λ(τ +τ ) (τ +τ )s τ +τ (βs+1) p s s b s b b (12b) τ τ 1 1 b s = 1 + + s f (s) k λ(τ +τ ) (τ +τ )s τ +τ p s s s b b b The expanded Equation (12b) shows that K (s) takes the form of a PID controller. ICC Here, the term can be considered a low-pass filter, denoted as f (s). The cutoff (βs+1) frequency f of this filter is calculated as . When β = 2.728423 × 10 , the cutoff frequency 2π is approximately 434 Hz. In continuous-time systems, it is important to compare the primary operating fre- quency range of the system with the cutoff frequency of the filter. If the system primarily operates in the low-frequency range, a filter with a cutoff frequency of 434 Hz will have little to no impact on the system’s main operating frequency range. Since the filter ’s cutoff frequency is much higher than the system’s main frequency range, the effect of the filter can be ignored. Therefore, the impact of the low-pass filter on the system’s frequency response (βs+1) is negligible because its cutoff frequency is much higher than the system’s main operating J. Mar. Sci. Eng. 2024, 12, 1603 10 of 22 frequency range. Consequently, the term can be disregarded in the analysis and (βs+1) design of the continuous-time controller K (s). ICC 1 1 τ τ K b s i K (s) = 1 + + s = K 1 + + K s (13) ICC k λ(τ + τ ) (τ + τ )s τ + τ s p s s s b b b Comparing Equation (12b) with Equation (13), the control parameters can be consid- ered as Equation (14). 1 1 τ τ K = , K = , K = (14) p i k λ(τ + τ ) k λ(τ + τ )(τ + τ ) k λ(τ + τ )(τ + τ ) p b s p b s b s p b s s The control elements for the ICC system are summarized in Table 2. Table 2. Internal model and control elements in the ICC System. G (s) f(s) K (s) ICC ICC k (βs+1) ηs+1 p K K 1 + + K s 2 p d (τ s+1)(τ s+1) a b (λs+1) K K K p i d 1 1 τ τ b s k λ(τ +τ ) p b s k λ(τ +τ ) k λ(τ +τ ) p b s p b s − 4 3 − 4 k = 2.1865704 × 10 , β = 2.728423 × 10 , τ = 3.51 × 10 , τ = 1.887649 p a b Therefore, by adjusting the IMC filter constant λ, the values of K , K , and K can be i d set to optimize control performance. 4.2. Proposal of IMC-DPGA (Deep Policy Gradient Adaptive) Controller Previous related studies have conducted empirical learning using deep reinforcement learning algorithms to select the PID parameters K , K , and K for optimal control perfor- i d mance. However, when using deep reinforcement learning to directly learn K , K , and K , p i d excessive parameter fluctuations due to initial exploration and exploration noise can cause overshoot in the control output, negatively affecting the plant. In flame combustion-based systems like the ICC system, changes in air supply during the exploration phase can lead to incomplete combustion of the flame, contaminating the heat exchange surface and altering the system. Additionally, excessive overshot poses the risk of flame extinction, which can prevent further learning stages and trap the system in a non-progressive learning loop. Moreover, if PID parameters are learned sporadically, the range of action variables can widen, potentially leading to the curse of dimensionality. However, in internal model control, the values of K , K , and K are determined by the internal model G , and they p i d ICC vary organically within the range set by the λ [36]. By learning the λ to achieve optimal control performance, the number of action variables can be reduced from three to one, and the range can be limited. This can make the learning of the deep reinforcement learning agent more stable and faster. Additionally, since the control parameters are dynamically connected by the internal model, it is possible to prevent control instability caused by sporadic parameters, thereby ensuring stable control performance even during the learning process. However, to adjust the optimal λ, the control system needs to be tuned at each unit value, which consumes a significant amount of time resources. Additionally, it is practically difficult to verify control performance down to small units (below 0.1), and the system must continuously respond to changes in external environmental conditions. Therefore, to apply the optimal λ to the control system in real time in response to system changes, this paper proposes an IMC-DPGA (deep policy gradient adaptive) controller using the DDPG algorithm. The structure of the proposed IMC-DPGA controller is shown in Figure 4. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 11 of 23 Therefore, by adjusting the IMC ﬁlter constant 𝜆 , the values of 𝐾 , 𝐾 ,and 𝐾 can be set to optimize control performance. 4.2. Proposal of IMC-DPGA (Deep Policy Gradient Adaptive) Controller Previous related studies have conducted empirical learning using deep reinforce- ment learning algorithms to select the PID parameters 𝐾 , 𝐾 , and 𝐾 for optimal control performance. However, when using deep reinforcement learning to directly learn 𝐾 , 𝐾 , and 𝐾 , excessive parameter ﬂuctuations due to initial exploration and exploration noise can cause overshoot in the control output, negatively aﬀecting the plant. In ﬂame combus- tion-based systems like the ICC system, changes in air supply during the exploration phase can lead to incomplete combustion of the ﬂame, contaminating the heat exchange surface and altering the system. Additionally, excessive overshot poses the risk of ﬂame extinction, which can prevent further learning stages and trap the system in a non-pro- gressive learning loop. Moreover, if PID parameters are learned sporadically, the range of action variables can widen, potentially leading to the curse of dimensionality. However, in internal model control, the values of 𝐾 , 𝐾 , and 𝐾 are determined by the internal model 𝒢 , and they vary organically within the range set by the 𝜆 [36]. By learning the 𝜆 to achieve optimal control performance, the number of action variables can be reduced from three to one, and the range can be limited. This can make the learning of the deep reinforcement learning agent more stable and faster. Additionally, since the control parameters are dynamically connected by the internal model, it is possible to prevent control instability caused by sporadic parameters, thereby ensuring stable control performance even during the learn- ing process. However, to adjust the optimal 𝜆 , the control system needs to be tuned at each unit value, which consumes a signiﬁcant amount of time resources. Additionally, it is practi- cally diﬃcult to verify control performance down to small units (below 0.1), and the sys- tem must continuously respond to changes in external environmental conditions. There- fore, to apply the optimal 𝜆 to the control system in real time in response to system changes, this paper proposes an IMC-DPGA (deep policy gradient adaptive) controller J. Mar. Sci. Eng. 2024, 12, 1603 11 of 22 using the DDPG algorithm. The structure of the proposed IMC-DPGA controller is shown in Figure 4. Figure 4. DDPG-based architecture for image-based combustion control system with IMC-PID inte- Figure 4. DDPG-based architecture for image-based combustion control system with IMC-PID gration. integration. The IMC-DPGA control system shown in Figure 4 illustrates the structure for dy- The IMC-DPGA control system shown in Figure 4 illustrates the structure for dy- namically updating the 𝜆 of the IMC-based PID controller using the DDPG algorithm. namically updating the λ of the IMC-based PID controller using the DDPG algorithm. This system aims to achieve optimal performance of the PID controller under changing This system aims to achieve optimal performance of the PID controller under changing environmental conditions. The image-based combustion control system handles a contin- environmental conditions. The image-based combustion control system handles a continu- uous process in real-time, where the DDPG algorithm eﬃciently learns the optimal policy ous process in real-time, where the DDPG algorithm efficiently learns the optimal policy within a continuous action space. This distinguishes it from other reinforcement learning algorithms that focus on discrete action space. The agent receives information from the ICC system, observes the state s , receives a reward r , and repeatedly determines and updates k k the action a , leading to a new state s . These updates allow the agent to adapt to the k k+1 environment and estimate the appropriate value of λ to improve control performance. The value of λ updated by the agent is applied to the control parameters of the IMC-PID at regular intervals of N. The optimal N for learning may vary depending on the control environment, so it should be determined through additional parameter selection experiments. Based on this structure, the IMC-DPGA controller, which combines the DDPG agent and the IMC-PID controller, effectively adapts to the dynamic changes in the process environment and can stably control the flame. By periodically updating the λ through reinforcement learning, the system maintains optimal control performance, ensuring stability and efficiency. This approach provides an adaptive and intelligent control solution, overcoming the limitations of direct parameter adjustment. 4.3. Agent Environment Configuration 4.3.1. State and Action of Agent The state vector of the IMC-DPGA for ICC system consists of the oxygen concentration error, the rate of change of the error, the current oxygen concentration, and the current λ, as shown in Equation (15).     ∆e   s k = , a k = λ (15) ( ) ( ) [ ] k+1   The oxygen concentration error e is defined as the difference between the target oxygen concentration and the current measured oxygen concentration, and it is used to evaluate the need for adjusting λ. The rate of change of the error ∆e represents the rate at which the oxygen concentration error changes over time, reflecting the system’s dynamic response to λ . The current oxygen concentration O directly reflects the current state of k k the system, helping to construct the state vector, and by including λ , it reflects the current adjustment level of the control input. This allows for more precise prediction and control. J. Mar. Sci. Eng. 2024, 12, 1603 12 of 22 By using this state vector, the dynamic characteristics of the system can be understood, and accurate control can be performed through predictive and adaptive control. 4.3.2. Reward To ensure effective control, the reward function of the IMC-DPGA must be designed to have a positive correlation with performance. Specifically, the amplitude of the system output y(t) should be minimized, and the output should quickly converge to the target value. Therefore, the reward function should include both the time steps of the entire closed-loop trajectory and the error e(t). The reward function for controlling the oxygen concentration of the boiler combustion system can be designed to minimize the error defined as error = O − O , and target current to reduce the system’s instability through the change in error ∆error = error − current error . The reward function reflecting this can be described as shown in Equation (16). previous 1 1 N N r(k) = − |error(k)| + |∆error(k)| = − |e(k)| + |∆e(k)| (16) ∑ ∑ k=1 k=1 N N In the above equation, N represents the number of steps per episode. The reward function’s adjustment of N can optimize the overall performance of the OFB system. Lowering the N value has the advantage of quick adaptation and immediate response, but it also increases computational complexity due to frequent updates and may lead to system instability due to excessive parameter fluctuations. Conversely, increasing the N value reduces the computational load and maintains a certain level of stability between updates, but if the update interval is too long, the accuracy may decrease due to overfitting. To review the stability of the reward function’s learning, experiments will be conducted with N set to 1, 50, 100, and 200. Through these experiments, the system’s response and stability for each value of N will be evaluated, and the optimal N will be determined. 5. Training and Experiments 5.1. Experimental Setup The reinforcement learning algorithm parameters set as initial conditions are presented in Table 3. Table 3. Training parameters used for the DDPG Agent. Parameters Actor Critic Network structure [50 25 1] [50 25 25 1] − 4 − 3 Learning rate 10 10 Activation function Tanh ReLU Optimization function Adam Adam Early stopping patience 10 Mini-batch size 64 Discount factor 0.9 Replay buffer size 10 The network structure, mini-batch size, and learning rate are selected based on pre- liminary experiments that showed optimal performance. Table 3 lists the key parameters adopted for training. When determining parameters such as learning rate and batch size for experience replay, DeepMind’s DPG model was referenced [37], and slight adjustments were made based on benchmark values. These parameters were gradually refined through a process of trial and error. Through this process, it was found that the learning outcomes were sensitive to certain parameters, such as learning rate and network structure, but not to others, such as the experience replay buffer size. Ultimately, parameters were selected that did not lead to overfitting and did not place excessive demands on computational resources. J. Mar. Sci. Eng. 2024, 12, 1603 13 of 22 The actor network uses the Tanh activation function to limit the output range to − 1 and 1, providing stability, while the critic network uses the ReLU activation function to introduce nonlinearity and increase learning speed. The Adam optimization algorithm was chosen because it provides fast and stable convergence by automatically adjusting the learning rate. Early stopping patience is set to 10 to prevent overfitting and allow early termination of the training process. The discount factor is set to 0.9 to balance considering future rewards while not neglecting present rewards. The range of the λ updated by the DDPG agent is set to λ ∈ [0.1, 2], with the initial value of λ set to 1. This range is set considering the system’s performance and stability. According to Equation (14), the range of PID parameters determined by the IMC internal model is K ∈ [1.21, 22.22], K ∈ [0.64, 12.83], and K ∈ [0.000425, 0.0085]. i d 5.2. Threshold Analysis J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 14 of 23 In this section, experiments are conducted to select the optimal value of N, the number of steps per episode for the DDPG agent, by varying N. The experimental learning results for different values of N are shown in Figure 5. Figure 5. Training results for IMC-DPGA according to step number per episode, 𝑁 . Figure 5. Training results for IMC-DPGA according to step number per episode, N. Figure 5 shows the learning performance of the DDPG agent when the number of Figure 5 shows the learning performance of the DDPG agent when the number of steps per episode 𝑁 is 1, 50, 100, and 200, respectively. The number of steps for threshold steps per episode N is 1, 50, 100, and 200, respectively. The number of steps for threshold sett ing was determined experimentally, starting from 1 and increasing in multiples until setting was determined experimentally, starting from 1 and increasing in multiples until the number of episodes where overﬁtt ing occurs. The summarized results for the graph the number of episodes where overfitting occurs. The summarized results for the graph are are presented in Table 4. presented in Table 4. Table 4. Training termination episodes and rewards for diﬀerent 𝑁 in IMC-DPGA training Step Count per Episode, 𝑵 Episode Last Reward 1 289 −0.205 50 194 −0.135 100 105 −0.05 200 158 −0.67 As can be seen from the graph and table, when 𝑁= 1 (blue curve), the agent’s re- ward shows signiﬁcant ﬂuctuations during the learning process and the lowest ﬁnal re- ward value −0.205 after the highest number of episodes 289. This can be interpreted as the negative impact on the reward due to the agent not having suﬃcient opportunities to ex- plore because of the low number of steps per episode. This variability indicates instability and ineﬃciency in learning. When 𝑁= 50 (green curve), the ﬂuctuations decrease compared to 𝑁 , and the ﬁnal reward value -0.135 is higher after fewer episodes 194, showing improved learning stabil- ity. This indicates that as the number of steps per episode increases, the agent has more opportunities to interact with the environment, leading to more eﬀective exploration and increased learning eﬃciency. For 𝑁 = 100 (red curve), the agent achieves the highest ﬁnal reward value -0.05 in the fewest number of learning episodes 105. This suggests that exploration is more Rewards J. Mar. Sci. Eng. 2024, 12, 1603 14 of 22 Table 4. Training termination episodes and rewards for different N in IMC-DPGA training. Step Count per Episode, N Episode Last Reward 1 289 − 0.205 50 194 − 0.135 100 105 − 0.05 200 158 − 0.67 As can be seen from the graph and table, when N = 1 (blue curve), the agent’s reward shows significant fluctuations during the learning process and the lowest final reward value − 0.205 after the highest number of episodes 289. This can be interpreted as the negative impact on the reward due to the agent not having sufficient opportunities to explore because of the low number of steps per episode. This variability indicates instability and inefficiency in learning. When N = 50 (green curve), the fluctuations decrease compared to N, and the final reward value − 0.135 is higher after fewer episodes 194, showing improved learning stability. This indicates that as the number of steps per episode increases, the agent has more opportunities to interact with the environment, leading to more effective exploration and increased learning efficiency. For N = 100 (red curve), the agent achieves the highest final reward value − 0.05 in the fewest number of learning episodes 105. This suggests that exploration is more effective for the same reason as in N = 50, indicating optimal learning efficiency and performance. In the case of N = 200 (pink curve), there is an increase in the number of episodes 158 and a decrease in the reward value − 0.67. This indicates that excessive exploration leads the agent to not find the optimal actions and spend unnecessary time, suggesting that the exploration-exploitation balance is disrupted and that this is not an appropriate value. These results demonstrate the importance of appropriately selecting the number of steps per episode to optimize the learning performance of the DDPG agent. As N increases, the learning process tends to become smoother and more stable; however, an N value that is too large can lead to performance degradation. N = 100 is shown to be the most efficient and high-performing number of steps, as it avoids the excessive exploration that prevents the agent from finding optimal actions and leads to unnecessary time consumption. Therefore, selecting the appropriate value of N = 100 can maximize the performance of the DDPG controller. 5.3. Experiment and Result Analysis In this section, experiments are conducted to apply the proposed IMC-DPGA controller to the ICC system for adaptive tuning. The figure shows the learning process of λ, K , K , and K per episode, with the number of steps per episode N set to 100 as in Figure 6. The graph is a 3D representation of the kernel density estimation for the values output at each step of each episode. Kernel density estimation (KDE) is a non-parametric method for estimating the probability density function of data. It smoothly represents the distribution of given data, making it easier to identify patterns, and is often used to understand or visualize the underlying distribution of the data. This allows for a visual understanding of the data learned at each step of each episode. When examining the overall learning trend of the control parameters, the initial episodes show high volatility and a wide density distribution, indicating exploration of various values. As learning progresses, the control parameter values tend to concentrate within a specific range, as indicated by the points with the highest density values. The increase in density value indicates convergence towards a stable optimal value. From Table 5, the final values of λ, K , K , and K can be determined based on the p i d mode value corresponding to the highest density in the graph distribution. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 15 of 23 eﬀective for the same reason as in 𝑁= 50 , indicating optimal learning eﬃciency and per- formance. In the case of 𝑁 = 200 (pink curve), there is an increase in the number of episodes 158 and a decrease in the reward value −0.67. This indicates that excessive exploration leads the agent to not ﬁnd the optimal actions and spend unnecessary time, suggesting that the exploration-exploitation balance is disrupted and that this is not an appropriate value. These results demonstrate the importance of appropriately selecting the number of steps per episode to optimize the learning performance of the DDPG agent. As 𝑁 in- creases, the learning process tends to become smoother and more stable; however, an 𝑁 value that is too large can lead to performance degradation. 𝑁 = 100 is shown to be the most eﬃcient and high-performing number of steps, as it avoids the excessive exploration that prevents the agent from ﬁnding optimal actions and leads to unnecessary time con- sumption. Therefore, selecting the appropriate value of 𝑁 = 100 can maximize the per- formance of the DDPG controller. 5.3. Experiment and Result Analysis In this section, experiments are conducted to apply the proposed IMC-DPGA con- J. Mar. Sci. Eng. 2024, 12, 1603 troller to the ICC system for adaptive tuning. The ﬁgure shows the learning process of 𝜆 , 15 of 22 𝐾 ,𝐾 , and 𝐾 per episode, with the number of steps per episode 𝑁 set to 100 as in Figure Figure 6. Parameter-wise KDE of IMC-DPGA training process ((A) 𝜆. (B) 𝐾 (C) 𝐾 (D) 𝐾 ) at N Figure 6. Parameter-wise KDE of IMC-DPGA training process ((A) λ (B) K (C) K (D) K ) at N = 100. i d = 100. Table 5. Parameter-wise KDE result details from the IMC-DPGA training. The graph is a 3D representation of the kernel density estimation for the values out- put at each step of each episode. Kernel density estimation (KDE) is a non-parametric Control Range of Mode Density Mode method f Paremeters or estimating the probability density function of data. It smoothly represents the distribution of given data, making it easier to identify patt erns, and is often used to λ 0.435~1 63.27 0.44 K 2.41~5.59 5.16 5.48 K 1.28~2.98 9.73 2.90 K 0.00085~0.002 14,229 0.0019 As a result of the learning process over the episodes, the value of λ fluctuates within the range of 0.435 to 1, and accordingly, K varies from 2.41 to 5.59, K ranges from 1.28 to p i 2.98, and K changes from 0.00085 to 0.002 in a similar trend. The detailed learning process of these changes can be observed in Figure 7. This figure represents a cross-sectional view of the KDE graph for λ in Figure 6. The learning process can be divided into two phases. Phase 1 is the period of rapid change starting from the initial value of 1 to the 37th episode. Phase 2 is from the 38th episode to the end of the learning process, converging to 0.44, during which the volatility is very small, ranging from 0.456 to the final value of 0.44. In Phase 1, λ rapidly changes within the range of 1 to 0.456. During this time, K changes from 2.41 to 5.4, K changes from 1.28 to 2.88, and K changes within the small i d range of 0.00189 to 0.002. This demonstrates that the IMC-DPGA can effectively and stably adapt to changes due to exploration through the internal model, especially for plants like ICC system where transient response significantly impacts stability. Subsequently, in Phase 2, these parameters converge more stably, ensuring the final control performance of the system. As a result, the optimal λ converges to 0.44, and the corresponding values of K , K , and K converge to 5.48, 2.9, and 0.0019, respectively. p i d J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 16 of 23 understand or visualize the underlying distribution of the data. This allows for a visual understanding of the data learned at each step of each episode. When examining the overall learning trend of the control parameters, the initial epi- sodes show high volatility and a wide density distribution, indicating exploration of var- ious values. As learning progresses, the control parameter values tend to concentrate within a speciﬁc range, as indicated by the points with the highest density values. The increase in density value indicates convergence towards a stable optimal value. From Table 5, the ﬁnal values of 𝜆 , 𝐾 ,𝐾 , and 𝐾 can be determined based on the mode value corresponding to the highest density in the graph distribution. Table 5. Parameter-wise KDE result details from the IMC-DPGA training. Control Range of Mode Density Mode Paremeters 𝜆 0.435~1 63.27 0.44 2.41~5.59 5.16 5.48 𝐾 1.28~2.98 9.73 2.90 0.00085~0.002 14,229 0.0019 As a result of the learning process over the episodes, the value of 𝜆 ﬂuctuates within the range of 0.435 to 1, and accordingly, 𝐾 varies from 2.41 to 5.59, 𝐾 ranges from 1.28 to 2.98, and 𝐾 changes from 0.00085 to 0.002 in a similar trend. The detailed learning J. Mar. Sci. Eng. 2024, 12, 1603 16 of 22 process of these changes can be observed in Figure 7. This ﬁgure represents a cross-sec- tional view of the KDE graph for 𝜆 in Figure 6. Figure 7. Cross-sectional KDE for detailed analysis of IMC-DPGA training process. Figure 7. Cross-sectional KDE for detailed analysis of IMC-DPGA training process. 6. Compare Performance of Different Controllers In this section, the real-time control performance of the proposed IMC-DPGA con- troller is evaluated by comparing it with several major PID control algorithms. The experi- ments are conducted on the S2 system to maintain a consistent oxygen concentration while the OFB system’s S1 is operating. The experimental process involves maintaining an initial 4% oxygen concentration in the S2 ICC system for 200 s and then comparing the results to verify control performance. At the 100 s mark, the setpoint for the oxygen concentration is changed from 4% to 5%, and control output data are collected. The collected data includes the value predicted by flame images, with a sampling time of 1 s. The transient response period and steady-state performance of each controller are compared. The algorithms se- lected for comparison are the Ziegler–Nichols tuning method, the Lambda tuning method, and the IMC-Maclaurin (IMC-MAC) closed-loop tuning method. The Ziegler–Nichols tuning method is a classical approach that sets PID parameters using the critical gain and critical period, allowing for simple and quick initial settings. The Lambda tuning method sets PID parameters based on the system’s time constant, making it practical and easy to use. The IMC-MAC closed-loop tuning method combines the IMC tuning method with MAC’s optimization algorithm, providing high precision for complex systems. This comparison allows for the evaluation of the real-time control performance and suitability of various PID tuning methods. The results are shown in Figure 8. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 17 of 23 The learning process can be divided into two phases. Phase 1 is the period of rapid change starting from the initial value of 1 to the 37th episode. Phase 2 is from the 38th episode to the end of the learning process, converging to 0.44, during which the volatility is very small, ranging from 0.456 to the ﬁnal value of 0.44. In Phase 1, 𝜆 rapidly changes within the range of 1 to 0.456. During this time, 𝐾 changes from 2.41 to 5.4, 𝐾 changes from 1.28 to 2.88, and 𝐾 changes within the small range of 0.00189 to 0.002. This demonstrates that the IMC-DPGA can eﬀectively and stably adapt to changes due to exploration through the internal model, especially for plants like ICC system where transient response signiﬁcantly impacts stability. Subsequently, in Phase 2, these parameters converge more stably, ensuring the ﬁnal control performance of the system. As a result, the optimal 𝜆 converges to 0.44, and the corresponding values of 𝐾 , 𝐾 , and 𝐾 converge to 5.48, 2.9, and 0.0019, respectively. 6. Compare Performance of Diﬀerent Controllers In this section, the real-time control performance of the proposed IMC-DPGA con- troller is evaluated by comparing it with several major PID control algorithms. The exper- iments are conducted on the 𝑆2 system to maintain a consistent oxygen concentration while the OFB system’s 𝑆1 is operating. The experimental process involves maintaining an initial 4% oxygen concentration in the 𝑆2 ICC system for 200 s and then comparing the results to verify control performance. At the 100 s mark, the setpoint for the oxygen concentration is changed from 4% to 5%, and control output data are collected. The col- lected data includes the value predicted by ﬂame images, with a sampling time of 1 s. The transient response period and steady-state performance of each controller are compared. The algorithms selected for comparison are the Ziegler–Nichols tuning method, the Lambda tuning method, and the IMC-Maclaurin (IMC-MAC) closed-loop tuning method. The Ziegler–Nichols tuning method is a classical approach that sets PID parameters using the critical gain and critical period, allowing for simple and quick initial sett ings. The Lambda tuning method sets PID parameters based on the system’s time constant, making it practical and easy to use. The IMC-MAC closed-loop tuning method combines the IMC tuning method with MAC’s optimization algorithm, providing high precision for J. Mar. Sci. Eng. 2024, 12, 1603 complex systems. This comparison allows for the evaluation of the real-time control per- 17 of 22 formance and suitability of various PID tuning methods. The results are shown in Figure 8. Figure 8. Comparison of control strategies for oxygen concentration step change. Figure 8. Comparison of control strategies for oxygen concentration step change. The two graphs compare the performance of various control algorithms for regulat- The two graphs compare the performance of various control algorithms for regulating ing oxygen concentration. The graph on the right provides a detailed view of the transient oxygen concentration. The graph on the right provides a detailed view of the transient response period from 85 to 125 s, allowing for an evaluation of the proposed IMC-DPGA controller ’s performance in comparison with other existing control algorithms. Table 6 compares the maximum overshoot (M ) and integral square error (ISE) to evaluate the step response performance of each controller. Table 6. Response of the tuning method according to changes in the oxygen concentration target value. Tuning Method M ISE Z-N 0.1114 11.1966 λ-T 0.0819 10.0912 IMC-MAC 0.1250 8.1189 IMC-DPGA 0.0631 7.7278 Analysis of the results in Table 6 reveals differences in the step response performance of each controller. For the Z-N tuning method, the M is 0.1114, indicating a significantly large transient response. Additionally, the ISE is high at 11.1966, which suggests considerable residual oscillations and error in the system’s response. This implies that the response of the Z-N controller is unstable and prone to oscillations. In the case of the λ-T tuning method, the M is 0.0819, showing an improvement in transient response compared to Z-N. However, the ISE remains high at 10.0912, indicating that residual oscillations have not been eliminated. This suggests that while the transient response has been reduced, the overall quality of the response is still lacking. The IMC-MAC tuning method shows a significant improvement with an ISE of 8.1189, indicating a substantial reduction in error. However, the M is recorded at 0.1250, the highest among the methods, suggesting that the initial stability of the response is lacking due to the large transient response. In other words, while the error has decreased, the method exhibits a considerable transient phenomenon during the initial response. Finally, the proposed IMC-DPGA tuning method demonstrates substantial improve- ments, with M and ISE values of 0.0631 and 7.7278, respectively. This indicates that both the overall error and transient response have been greatly improved. Notably, the M is the lowest, meaning the transient response is minimized, which signifies that the system is highly stable and converges to the target value rapidly. Additionally, Figure 9 represents the steady-state response for oxygen concentration targets of 4% and 5%. J. Mar. Sci. Eng. 2024, 12, x FOR PEER REVIEW 18 of 23 response period from 85 to 125 s, allowing for an evaluation of the proposed IMC-DPGA controller’s performance in comparison with other existing control algorithms. Table 6 compares the maximum overshoot (𝑀 ) and integral square error (ISE) to evaluate the step response performance of each controller. Table 6. Response of the tuning method according to changes in the oxygen concentration target value. Tuning Method 𝑴 𝑰𝑺𝑬 Z-N 0.1114 11.1966 𝜆 -T 0.0819 10.0912 IMC-MAC 0.1250 8.1189 IMC-DPGA 0.0631 7.7278 Analysis of the results in Table 6 reveals diﬀerences in the step response performance of each controller. For the Z-N tuning method, the 𝑀 is 0.1114, indicating a signiﬁcantly large transient response. Additionally, the ISE is high at 11.1966, which suggests consid- erable residual oscillations and error in the system’s response. This implies that the re- sponse of the Z-N controller is unstable and prone to oscillations. In the case of the 𝜆 -T tuning method, the 𝑀 is 0.0819, showing an improvement in transient response compared to Z-N. However, the ISE remains high at 10.0912, indicating that residual oscillations have not been eliminated. This suggests that while the transient response has been reduced, the overall quality of the response is still lacking. The IMC-MAC tuning method shows a signiﬁcant improvement with an ISE of 8.1189, indicating a substantial reduction in error. However, the 𝑀 is recorded at 0.1250, the highest among the methods, suggesting that the initial stability of the response is lack- ing due to the large transient response. In other words, while the error has decreased, the method exhibits a considerable transient phenomenon during the initial response. Finally, the proposed IMC-DPGA tuning method demonstrates substantial improve- ments, with 𝑀 and ISE values of 0.0631 and 7.7278, respectively. This indicates that both the overall error and transient response have been greatly improved. Notably, the 𝑀 is the lowest, meaning the transient response is minimized, which signiﬁes that the system is highly stable and converges to the target value rapidly. J. Mar. Sci. Eng. 2024, 12, 1603 18 of 22 Additionally, Figure 9 represents the steady-state response for oxygen concentration targets of 4% and 5%. Figure 9. Comparison of 4% and 5% steady-state responses for various controllers. Figure 9. Comparison of 4% and 5% steady-state responses for various controllers. The graph is a boxplot of the output data in the steady-state regions at the control targets of 4% and 5%. This allows for the assessment of the stability of each controller in the steady-state. Table 7 quantifies the data from the graph, showing the median, upper adjacent (U.A), and lower adjacent (L.A) of the output data for each controller. Table 7. Steady-state analysis of oxygen concentration at 4% and 5% for various controllers. 4% Steady-State Response 5% Steady-State Response Tuning Method Median U.A L.A Median U.A L.A Z-N 4.0269 4.1069 3.932 5.022 5.112 4.9304 λ-T 4.0331 4.136 3.9056 4.9863 5.0819 4.8698 IMC-MAC 4.0054 4.1029 3.8696 5.024 5.125 4.9024 IMC-DPGA 3.9968 4.0498 3.9444 5.0188 5.0631 4.974 The Z-N controller shows similar medians and data distributions at both control targets of 4% and 5%. The λ-T controller also exhibits a similar distribution, indicating stability comparable to that of the Z-N controller. The IMC-MAC controller, however, shows a significantly larger data distribution, suggesting lower stability. This implies that while IMC-MAC demonstrates a fast response speed during the transient response period, it experiences significant oscillations in the steady-state, resulting in lower stability. The IMC- DPGA, compared to the other controllers, shows the lowest data distribution and closely follows the control target with its median, indicating the highest control stability in the steady-state. This confirms that IMC-DPGA ensures faster response speeds while providing superior stability compared to other controllers. Specifically, the superior performance of the IMC-DPGA compared to the IMC-MAC demonstrates the effectiveness of the tuning method, which allows for the adaptive real-time adjustment of the value of λ according to the internal model by combining the DDPG algorithm with the IMC structure. 7. Conclusions The tightening of atmospheric pollutant emission regulations in the maritime sector has spurred efforts to reduce emissions from combustion boilers. Understanding the corre- lation between control variables and atmospheric pollutants and controlling a calculated model can reduce these emissions. However, existing boiler combustion measurement- control systems have high time constants and struggle to achieve appropriate control in the face of dynamic changes in models due to various variables. Thus, using flame images as a means to measure oxygen concentration and employing an image-based combustion control system that can additionally control the air volume in J. Mar. Sci. Eng. 2024, 12, 1603 19 of 22 existing combustion systems can reduce measurement delay times and excessive combus- tion state changes, enabling stable real-time control. In this paper, the IMC-DPGA (internal model control–deep policy gradient adaptive) controller is proposed, which combines the IMC-PID controller, known for its excellent model-based control, with the DDPG algorithm, which allows continuous exploration learning, and is applied to an image-based combustion control system. Because PID control parameters are linked by the internal model of IMC, it can prevent transient responses caused by sporadic changes in each parameter during the learning phase. Additionally, unlike traditional RL-based PID parameter tuning methods, the action variable is reduced from three dimensions to one by using lambda λ , the IMC filter, saving computational ( ) resources and enabling stable and fast learning. By setting and controlling the PID parameters based on the threshold value of 100 steps (N) per episode established through experimentation, a reward value of − 0.05 was achieved in just 105 episodes. Furthermore, comparison experiments in step response with other controllers showed that the IMC-DPGA controller demonstrated the fastest response speed, lowest overshoot, and minimal oscillation compared to existing PID con- trollers, proving its stability and effectiveness. The experiments in this study were conducted on actual operating ships, verifying their practicality. Additionally, the image-based combustion control system can be easily integrated into existing ships at low cost, providing an immediate reduction in atmospheric pollutants. However, increasing the target oxygen concentration can suppress atmospheric pol- lutants through excess air but decrease boiler performance efciency fi . Therefore, future research must develop optimal control strategies that balance pollutant reduction and boiler performance efficiency. To achieve this, combining multi-objective optimization techniques with the IMC-DPGA control algorithm will be essential to respond to real-time changes in combustion conditions and simultaneously optimize pollutant emissions and energy efficiency. Furthermore, since the improvement in the learning agent’s performance directly translates to enhanced controller performance, further research on improving the agent model’s performance through transfer learning is necessary. Author Contributions: Conceptualization, C.-M.L. and B.-G.J.; methodology, C.-M.L.; formal analysis, C.-M.L.; writing—original draft preparation, C.-M.L.; writing—review and editing, B.-G.J. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the Research promotion program through the National Korea Maritime and Ocean University Research Fund in 2023. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The original contributions presented in the study are included in the article, further in-queries can be directed to the corresponding authors. Acknowledgments: First and foremost, we extend our deepest gratitude to everyone who has played a role in the successful completion of this journal. We also wish to express our sincere thanks to the esteemed reviewers for their meticulous evaluation, insightful feedback, and expert guidance throughout the peer review process. Additionally, we would like to extend our heartfelt appreciation to Jung Byung-Gun for his invaluable mentorship, unwavering support, and profound insights that greatly contributed to this work. Lastly, we are immensely thankful to the editors for their dedication, hard work, and commitment to advancing knowledge in our field. Conflicts of Interest: The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. J. Mar. Sci. Eng. 2024, 12, 1603 20 of 22 Nomenclature CO Carbon dioxide NO Nitrogen oxides SO Sulfur dioxides SO Sulfur oxides Greek symbols β constant associated with the zero a Action of actor (variable for time) e Target oxygen concentration error f (s) IMC filter f Cutoff frequency f (s) IMC filter for ICC system f (s) Low-pass filter G(s) Transfer function of controlled process G Actual system transfer function S2 G The internal model transfer function of ICC system ICC G ICC system transfer function S2 J Expected reward K (s) Classic controller of ICC system ICC K Derivation gain K Integral gain K Proportional gain L θ Loss function λ, η Time constants of the IMC filter M Maximum peak error N Step number per episode O Current oxygen concentration r Reward (variable for time) s Next state t+1 u t Control input ( ) y(t) Amplitude of the system output θ Critic network θ Target network for critic θ Actor network θ Target network for actor τ , τ Time constants of the system τ Derivative parameter τ Integral parameter Q(s) IMC controller s State of actor (variable for time) ε Noise for exploration (variable for time) ∇ µ J Policy gradient for expected reward Index A/D Analog-to-digital CMOS Complementary metal-oxide-semiconductor DDPG Deep deterministic policy gradient DRL Deep reinforcement learning DPS Dynamic positioning system DQN Deep Q-networks HSV Hue, saturation, and value ICC Image-based combustion control ISE Integral of squared error IMC Internal model control MAC Maclaurin J. Mar. Sci. Eng. 2024, 12, 1603 21 of 22 MSE Mean squared error KDE Kernel density estimation L.A Lower adjacent OFB Oil-fired boiler PID Proportional–integral–derivation PI Proportional–integral R R-squared SEF Saturation extraction filter SOLAS The International Convention for the Safety of Life at Sea SOPZ Second-order plus zero-pole SVM Support vector machine TD Time delay USB Universal serial bus U.A Upper adjacent Z-N Ziegler–Nichols References 1. MarkWide Research. Global Marine Boilers Market: Analysis, Industry Size, Share, Research Report, Insights, COVID-19 Impact, Statistics, Trends, Growth, and Forecast 2024–2032; MarkWide Research: Torrance, CA, USA, 2024. 2. Shelyapina, M.G.; Rodríguez-Iznaga, I.; Petranovskii, V. Materials for CO , SO , and NO Emission Reduction. In Handbook of x x Nanomaterials and Nanocomposites for Energy and Environmental Applications; Springer: Cham, Switzerland, 2020; pp. 2429–2458. 3. Tadros, M.; Ventura, M.; Soares, C.G. Review of current regulations, available technologies, and future trends in the green shipping industry. Ocean Eng. 2023, 280, 114670. [CrossRef] 4. Zhao, J.; Wei, Q.; Wang, S.; Ren, X. Progress of ship exhaust gas control technology. Sci. Total Environ. 2021, 799, 149437. [CrossRef] [PubMed] 5. Nemitallah, M.A.; Nabhan, M.A.; Alowaifeer, M.; Haeruman, A.; Alzahrani, F.; Habib, M.A.; Elshafei, M.; Abouheaf, M.I.; Aliyu, M.; Alfarraj, M. Artificial intelligence for control and optimization of boilers’ performance and emissions: A review. J. Clean. Prod. 2023, 417, 138109. [CrossRef] 6. Chen, J.; Chang, Y.; Cheng, Y.; Hsu, C. Design of image-based control loops for industrial combustion processes. Appl. Energy 2012, 94, 13–21. [CrossRef] 7. Krishnamoorthi, M.; Agarwal, A.K. Combustion instabilities and control in compression ignition, low-temperature com- bustion, and gasoline compression ignition engines. In Gasoline Compression Ignition Technology: Future Prospects; Springer: Berlin/Heidelberg, Germany, 2022; pp. 183–216. 8. Sujatha, K.; Venmathi, M.; Pappa, N. Flame Monitoring in power station boilers using image processing. Ictact J. Image Video Process. 2012, 2, 427–434. 9. Omiotek, Z.; Kotyra, A. Flame image processing and classification using a pre-trained VGG16 model in combustion diagnosis. Sensors 2021, 21, 500. [CrossRef] 10. Lee, C.; Jung, B.; Choi, J. Experimental Study on Prediction for Combustion Optimal Control of Oil-Fired Boilers of Ships Using Color Space Image Feature Analysis and Support Vector Machine. J. Mar. Sci. Eng. 2023, 11, 1993. [CrossRef] 11. Lee, C. Combustion Control of Ship’s Oil-Fired Boilers based on Prediction of Flame Images. J. Mar. Sci. Eng. 2024, 12, 1474. [CrossRef] 12. Noye, S.; Martinez, R.M.; Carnieletto, L.; De Carli, M.; Aguirre, A.C. A review of advanced ground source heat pump control: Artificial intelligence for autonomous and adaptive control. Renew. Sustain. Energy Rev. 2022, 153, 111685. [CrossRef] 13. Qi, R.; Tao, G.; Jiang, B. Fuzzy System Identification and Adaptive Control ; Springer: Cham, Switzerland, 2019. 14. Yaseen, H.M.S.; Siffat, S.A.; Ahmad, I.; Malik, A.S. Nonlinear adaptive control of magnetic levitation system using terminal sliding mode and integral backstepping sliding mode controllers. ISA Trans. 2022, 126, 121–133. [CrossRef] 15. Mahmud, M.; Motakabber, S.; Alam, A.Z.; Nordin, A.N. Adaptive PID controller using for speed control of the BLDC motor. In Proceedings of the 2020 IEEE International Conference on Semiconductor Electronics (ICSE), Kuala Lumpur, Malaysia, 28–29 July 2020; pp. 168–171. 16. Nohooji, H.R. Constrained neural adaptive PID control for robot manipulators. J. Frankl. Inst. 2020, 357, 3907–3923. [CrossRef] 17. Wang, J.; Zhu, Y.; Qi, R.; Zheng, X.; Li, W. Adaptive PID control of multi-DOF industrial robot based on neural network. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 6249–6260. [CrossRef] 18. Dubey, V.; Goud, H.; Sharma, P.C. Role of PID control techniques in process control system: A review. In Data Engineering for Smart Systems: Proceedings of SSIC 2021; Springer: Singapore, 2022; pp. 659–670. 19. Kanungo, A.; Choubey, C.; Gupta, V.; Kumar, P.; Kumar, N. Design of an intelligent wavelet-based fuzzy adaptive PID control for brushless motor. Multimed. Tools Appl. 2023, 82, 33203–33223. [CrossRef] 20. Chen, S. Review on supervised and unsupervised learning techniques for electrical power systems: Algorithms and applications. IEEJ Trans. Electr. Electron. Eng. 2021, 16, 1487–1499. [CrossRef] 21. Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. J. Mar. Sci. Eng. 2024, 12, 1603 22 of 22 22. Lee, D.; Lee, S.J.; Yim, S.C. Reinforcement learning-based adaptive PID controller for DPS. Ocean Eng. 2020, 216, 108053. [CrossRef] 23. Carlucho, I.; De Paula, M.; Acosta, G.G. An adaptive deep reinforcement learning approach for MIMO PID control of mobile robots. ISA Trans. 2020, 102, 280–294. [CrossRef] 24. Siraskar, R. Reinforcement learning for control of valves. Mach. Learn. Appl. 2021, 4, 100030. [CrossRef] 25. Lawrence, N.P.; Stewart, G.E.; Loewen, P.D.; Forbes, M.G.; Backstrom, J.U.; Gopaluni, R.B. Optimal PID and antiwindup control design as a reinforcement learning problem. IFAC-PapersOnLine 2020, 53, 236–241. [CrossRef] 26. Lakhani, A.I.; Chowdhury, M.A.; Lu, Q. Stability-preserving automatic tuning of PID control with reinforcement learning. arXiv 2021, arXiv:2112.15187. [CrossRef] 27. Ding, Y.; Ren, X.; Zhang, X.; Liu, X.; Wang, X. Multi-phase focused pid adaptive tuning with reinforcement learning. Electronics 2023, 12, 3925. [CrossRef] 28. Datta, A. Adaptive Internal Model Control; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. 29. Zaporozhets, A.O.; Zaporozhets, A.O. Research of the process of fuel combustion in boilers. In Control of Fuel Combustion in Boilers; Springer: Cham, Switzerland, 2020; pp. 35–60. 30. Chen, J.; Chang, Y.; Cheng, Y. Performance design of image-oxygen based cascade control loops for boiler combustion processes. Ind. Eng. Chem. Res. 2013, 52, 2368–2378. [CrossRef] 31. Xiao, G.; Gao, X.; Lu, W.; Liu, X.; Asghar, A.B.; Jiang, L.; Jing, W. A physically based air proportioning methodology for optimized combustion in gas-fired boilers considering both heat release and NOx emissions. Appl. Energy 2023, 350, 121800. [CrossRef] 32. Li, Y.; Zhang, T.; Das, S.; Shamma, J.; Li, N. Non-asymptotic system identification for linear systems with nonlinear policies. IFAC-PapersOnLine 2023, 56, 1672–1679. [CrossRef] 33. Tan, H. Reinforcement learning with deep deterministic policy gradient. In Proceedings of the 2021 International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA), Xi’an, China, 28–30 May 2021; pp. 82–85. 34. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 21–26 June 2014; pp. 387–395. 35. Nise, N.S. Control Systems Engineering; John Wiley & Sons: Hoboken, NJ, USA, 2020. 36. Rivera, D.E. Internal Model Control: A Comprehensive View; Arizona State University: Tempe, AZ, USA, 1999; pp. 85287–86006. 37. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Journal

Journal of Marine Science and Engineering – Multidisciplinary Digital Publishing Institute

Published: Sep 10, 2024

Keywords: combustion control; emission prediction; IMC-based PID; real-time control; image-based control; deep deterministic policy gradient algorithm

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning

Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning

Adaptive Control of Ships’ Oil-Fired Boilers Using Flame Image-Based IMC-PID and Deep Reinforcement Learning

References (31)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies