Abstract
FUZZY INFORMATION AND ENGINEERING 2021, VOL. 13, NO. 3, 368–390 https://doi.org/10.1080/16168658.2021.1943887 Output Feedback Controller for a Class of Unknown Nonlinear Discrete Time Systems Using Fuzzy Rules Emulated Networks and Reinforcement Learning C. Treesatayapun Department of Robotic and Advanced Manufacturing, CINVESTAV-Saltillo, Ramos Arizpe, Mexico ABSTRACT KEYWORDS Model-free adaptive control; A model-free adaptive control for non-affine discrete time systems reinforcement learning; is developed by utilising the output feedback and action-critic net- nonlinear discrete time works. Fuzzy rules emulated network (FREN) is employed as the systems; fuzzy neural action network and multi-input version (MiFREN) is implemented network; DC motor current as the critic network. Both networks are constructed using human control knowledge based on IF–THEN rules according to the controlled plant and the learning laws are established by reinforcement learning without any off-line learning phase. The theoretical derivation of the convergence of the tracking error and internal signal is demon- strated. The numerical simulation and the experimental system are given to validate the proposed scheme. 1. Introduction Due to the complexity of controlled plants nowadays, it is commonly difficult or impossible to establish its mathematical model especially for the discrete time system [1]. By utilis- ing only input–output data of the controlled plant, the model-free approaches have been developed [2, 3]. On the other hand, the performance of the controllers is related to data’s quality and quantity [4]. For some engineering applications, it is very difficult to access all state variables, thus the output feedback is still a preferable scheme [5, 6]. Furthermore, the close-loop analysis and stability approaches have been proposed [7, 8, 9] to guarantee the performance of controllers. From the engineering point of view, the stability analysis beside of closed-loop’s performance is only a basic minimum requirement even for the arti- ficial intelligence controller [10]. Therefore, the optimal controllers are more desirable for modern applications [11]orbynatureview[12]. To ensure the closed-loop performance with the optimisation of the predefined cost function, the schemes based on adaptive dynamic programming have been utilised but the mathematic models have been required for its iterative learning [13, 14]. With the model- free aspects, reinforcement learning (RL) algorithms have been developed to solve optimal control [15, 16] with the estimated solution of the Hamilton–Jacobi–Bellman equation [17, 18]. To mimic the RL process, the approaches based on action-critic networks have been CONTACT C. Treesatayapun. treesatayapun@gmail.com © 2021 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. FUZZY INFORMATION AND ENGINEERING 369 derived by artificial neural networks ( ANNs) under considering the controlled plant as a black box [19, 20]. Nevertheless, even the mathematic model is unknown but the engineer still has basic human knowledge of the controlled plant such that ‘IF higher output is required THEN more control effort should be supplied’. Thus, the controlled plant can be considered as a grey box. To integrate the human knowledge as IF–THEN format into the controller, fuzzy logic systems ( FLSs) have been utilised in control applications [21] also including the optimal problems [22]. By including the learning ability to FLS, the integrations between FLS and ANN have been developed such as fuzzy neural network (FNN) [23] and fuzzy rules emu- lated network (FREN) [24, 25]. Thereafter, the approaches of using FNN and FREN for solving the optimal problem with RL have been proposed [26, 27] when the controlled plants have been considered as a class of affine systems. On the other hand, the problem of non-affine systems has been studied in Ref. [28] by the approach of critic-action networks when the state feedback has been utilised for gaining enough information to tune ANNs. In this work, the output feedback model-free controller is proposed when the control effort is non-affine with respect to system dynamics. The controller is designed by the action network called FRENa with the set of IF–THEN rules according to the controlled plant. There- after, the long-term cost function is estimated by the multi-input version of FREN called MiFRENc when IF–THEN rules are established under the general aspect for minimising both tracking error and control energy. The learning laws are derived with the RL approach to tune all adjustable parameters of FRENa and MiFRENc aiming to minimise the tracking error and the estimated cost function. Furthermore, the closed-loop analysis is provided by the Lyapunov method to demonstrate the convergence of the tracking error and internal signals. This paper is organised as follows. Section 2 introduces a class of systems under our investigation and problem formulation. The proposed scheme is introduced in Section 3 including the network architectures with IF–THEN rules of FRENa and MiFRENc and their formulations. The learning laws and closed-loop analysis are derived in Section 4. Section 5 provides the results of the simulation and experimental system. 2. Controlled Plant as a Class of Nonlinear Discrete-Time Systems In this work, the controlled plant for a class of non-affine discrete time systems is considered as y(k + 1) = f (y(k), ... , y(k − n ), u(k), ... , u(k − n )) + d(k),(1) y u where y(k + 1) ∈ R is the plant’s output with respect to the control effort u(k) ∈ R, f (−) is an unknown nonlinear function, n and n are unknown system orders and d(k) denotes u y a bounded disturbance such that |d(k)|≤ d . For further analysis, the following assump- tions are expressed according to the unknown nonlinear function f (−) with respect to the control effort u(k). Assumption 2.1: The derivative of y(k + 1) with respect to u(k) is existed and bounded such that ∂y(k + 1) 0 < g ≤ ≤ g,(2) m M ∂u(k) 370 C. TREESATAYAPUN where g and g are positive constants. m M Remark 2.2: The condition mentioned in (2) indicates that the controlled plant in (1) is a positive control direction. That will assist the setting of IF–THEN rules according to the change of control effort u(k) altogether with the change of output y(k + 1). Referring to condition (2), it is clear that the change of output y(k + 1) with respect to the change of control effort u(k) canberewrittenas y(k + 1) d d g ≤ ≤ g,(3) m M u(k) d d where u(k)> 0and g and g are constants according to g and g , respectively. This m M m M will lead to the setting of IF–THEN rules such that IF u(k) is positive-large, THEN y(k + 1) should be positive-large or IF u(k) is negative small, THEN y(k + 1) should be negative small. By utilising those IF–THEN rules, the adaptive controller based on FRENs will be estab- lished in the next section. 3. RL Controller The proposed controller is illustrated by the block diagram in Figure 1. In this work, the plant is selected as a DC motor current control. Only the armature current is measured as the output y(k + 1) (mA) when the control effort u(k) (V) is the voltage fed to the driver unit. Thus, the IF–THEN rules mentioned in Section 2 can be rewritten according to the physical nature such that IF we apply positive-large change of control voltage [u(k)], THEN we should have positive-large change of armature current [y(k + 1)]. Figure 1. Closed-loop system architecture. FUZZY INFORMATION AND ENGINEERING 371 According to this knowledge, the action network (FRENa) is first established to generate the control effort y(k) when the input is the tracking error e(k) defined as e(k) = r(k) − y(k),(4) where r(k) is the desired trajectory. Second, the critic network is designed using MiFRENc to produce the estimated long-term cost function L(k) for the controller FRENa. The details of two networks and its IF–THEN rules are given as follows. 3.1. Controller or Action Network To utilise the action network, the IF–THEN rules with the relation between the tracking error e(k) and the control effort u(k) are first established. By considering the basic knowledge such that, positive-large e(k) means lack of y(k) in positive-large. In order to compensate, it clearly requires that the control effort u(k) should be positive-large. For conclusion, we have IF e(k) is positive-large, THEN u(k) should be positive-large. With seven linguistic levels, it leads to the design of IF–THEN rules as IF e(k) is NL THEN u(k) is NL, IF e(k) is NM THEN u(k) is NM, IF e(k) is NS THEN u(k) is NS, IF e(k) is Z THEN u(k) is Z, IF e(k) is PS THEN u(k) is PS, IF e(k) is PM THEN u(k) is PM, IF e(k) is PL THEN u(k) is PL, where notations of linguistic variables N, P, L, M, S and Z denote negative, positive, large, medium, small and zero, respectively. Employing this set of IF–THEN rules, the network architecture of FRENa is illustrated by Figure 2. According to the network architecture in Figure 2 and the function formulation of FREN in Ref. [24], the control effort u(k) is determined by u(k) = β (k)φ (k),(5) where φ (k) = [μ (e )μ (e ) ··· μ (e )] (6) a NL k NM k PL k and NL NM PL T β (k) = [β (k)β (k) ··· β (k)].(7) a a a Let us consider FRENa as the function estimator of the unknown control effort, thus it ∗ ∗ exists the ideal control effort u (k) with the ideal parameter β such that ∗ ∗T u (k) = β φ (k) + ε (k),(8) a a where ε (k) is a bounded residual error |ε (k)|≤ ε . a a aM By using the dynamics (1) with the control laws (5) and (8), the tracking error e(k + 1) is rearranged as e(k + 1) = r(k + 1) − y(k + 1) 372 C. TREESATAYAPUN Figure 2. Action network or controller based on FREN. = f (u ) − f (u ) − d(k).(9) Recalling Assumption 1 and using mean value theorem, the error dynamic (9) can be rewritten as ∂f (x) e(k + 1) = [u (k) − u(k)] − d(k) ∂x x=u (k) = g(k)[u (k) − u(k)] − d(k), (10) where ∂f (u (k)) g(k) = , (11) ∂u (k) m ∗ ∗ and u (k) ∈ [min{u , u },max{u , u }]. Employing the control laws (8) and (5), it yields k k k k ∗ T e(k + 1) = g(k)[β − β (k)] φ (k) + g(k)ε (k) − d(k). (12) a a a Let us define β (k) = β − β (k), d (k) = g(k)ε (k) − d(k) and a a a a (k) = β (k)φ (k), (13) a a andweobtain e(k + 1) = g(k) (k) + d (k). (14) a a FUZZY INFORMATION AND ENGINEERING 373 It is worth to note that the tracking error obtained by (14) is functional by β (k) and the unknown but bounded d (k) such that |d (k)|≤ d . This relation will be used for the a a aM performance analysis afterward. 3.2. Estimated Cost–Function or Critic Network In this work, the long-term cost function L(k) is employed by an infinite-horizon of the tracking error e(k) and the control effort u(k) with the discount factor γ as i−k L(k) = γ l(i), (15) i=k where 2 2 l(k) = pe (k) + qu (k), (16) where p and q are positive constants and 0 <γ ≤ 1. L(k) in (15) is functional by two input arguments with the quadratic functions (f = x ) of e(k) and u(k). Thus, an adaptive network MiFRENc is utilised to estimate L(k) as the block diagram in Figure 1. In order to design MiFRENc, IF–THEN rules are first established by Table 1. Thereafter, the network architecture of MiFRENc is illustrated by Figure 3.By utilising the network in Figure 3 andresultsinRef.[24], the estimated cost function L(k) is determined by L(k) = β (k)φ (k), (17) where β (k) = [β (k)β (k) ··· β (k)] (18) c Z Z L Z S L and φ (k) = [φ (k)φ (k) ··· φ (k)] . (19) c 1 2 9 Using the universal approximation property of MiFREN [24], there exists an ideal param- eter β such that ∗T L(k) = β φ (k) + ε (k), (20) c c where ε (k) is a bounded residual error such that |ε (k)|≤ ε . Adding and subtracting c c cM ∗T β φ (k) on the left-hand side of (17) yields T ∗T ˆ ˜ L(k) = β (k)φ (k) + β φ (k) c c c c ∗T = (k) + β φ (k), (21) c c T ∗ T ˜ ˜ where β (k) = β (k) − β and (k) = β (k)φ (k). c c c c c c Table 1. MiFRENc: IF–THEN rules. e (k) L(k) ZS L u (k) Z φ ;Z φ ;S φ ;L 1 Z 4 Z 7 Z S φ ;Z φ ;S φ ;L 2 S 5 S 8 S L φ ;Z φ ;S φ ;L 3 L 6 L 9 L 374 C. TREESATAYAPUN Figure 3. Estimated cost function or critic network. In order to improve the performance of FRENa and MiFRENc, the learning laws will be developed and explained in the next section. 4. Learning Algorithms and Performance Analysis 4.1. Action Network Learning Law Considering the tracking error within (k) as (14) and the estimated cost function L(k),in this work, the error function of action network is given as e (k) = g(k) (k) + L(k). (22) a a g(k) Thereafter, the cost function to be minimised is utilised as E (k) = e (k). (23) Applying the gradient descent, the tuning law for β is derived as ∂E (k) β (k + 1) = β (k) − η , (24) a a a ∂β (k) where η is the learning rate. By using the chain rule and (13), it yields ∂E (k) ∂E (k) ∂e (k) ∂ (k) a a a a ∂β (k) ∂e (k) ∂ (k) ∂β (k) a a a a =−e (k) g(k)φ (k). (25) a a FUZZY INFORMATION AND ENGINEERING 375 Recalling (24) with (25) and using e (k) in (22), it leads to β (k + 1) = β (k) + η e (k) g(k)φ (k) a a a a a = β (k) + η g(k) (k) + L(k) g(k)φ (k) a a a a g(k) = β (k) + η [g(k) (k) + L(k)]φ (k). (26) a a a a By eliminating d (k) in (14), the learning law (26) is rewritten as β (k + 1) = β (k) + η [e(k + 1) + L(k)]φ (k). (27) a a a a The final learning law of FRENa given by (27) is a practical one because all parameters required on the left-hand side are certainly obtained at the time index k + 1. 4.2. Critic Network Learning Law In general, the error function of critic networks is employed by the estimated cost function L(k). Therefore, in this work, the error function e (k) is given as ˆ ˆ e (k) = δL(k) − L(k − 1) + l(k), (28) where δ is a positive constant. In order to tune β , the cost function E (k) is defined as c c E (k) = e (k). (29) Applying the gradient descent at (29) with respect to β (k),wehave ∂E (k) β (k + 1) = β (k) − η , (30) c c c ∂β (k) where η is the learning rate. Using the chain rule along E (k) in (29), e (k) in (28) and L(k) c c c in (17), it yields ∂E (k) ∂E (k) ∂e (k) ∂L(k) c c c ∂β (k) ∂e (k) ˆ ∂β (k) c c ∂L(k) c = e (k)δφ (k). (31) c c Rewriting (30) with (31), it leads to β (k + 1) = β (k) − η e (k)δφ (k) c c c c c ˆ ˆ = β (k) − η δ[l(k) − L(k − 1) + δL(k)]φ (k). (32) c c c Finally, we have a practical tuning law for MiFRENc. 376 C. TREESATAYAPUN Figure 4. FRENa membership functions: simulation case. Figure 5. MiFRENc membership functions: simulation case. FUZZY INFORMATION AND ENGINEERING 377 Table 2. Initial setting β (1): simulation system. FRENa MiFRENc Parameter Value Parameter Value β (1) −0.25 β (1) 0 NL Z β (1) −0.15 β (1) 0.1 NM Z β (1) −0.05 β (1) 0.2 NS Z β (1) 0 β (1) 0.3 Z S β (1) 0.05 β (1) 0.4 PS S β (1) 0.15 β (1) 0.5 PM S β (1) 0.25 β (1) 0.6 PL L β (1) 0.7 β (1) 0.8 Figure 6. Tracking performance y(k) and e(k): simulation system. 4.3. Closed-Loop Analysis In the following theorem, the closed-loop performance of the output feedback controller is demonstrated while the tracking error and internal signals are bounded. Theorem 4.1: For the non-affine discrete time system mentioned in Section 2, the performance of the closed-loop system configured by the structure of FRENa and MiFRENc in Section 3 is guar- anteed in terms of the bonded tracking error and internal signals when the designed parameters are selected as follows: <δ ≤ 1, (33) 0 <η ≤ (34) ν g M 378 C. TREESATAYAPUN Figure 7. Control eﬀort u(k): simulation system. Figure 8. Estimated cost function L(k): simulation system. FUZZY INFORMATION AND ENGINEERING 379 Figure 9. u(k) and e(k): simulation system. Figure 10. FRENa membership functions: experimental system. 380 C. TREESATAYAPUN Figure 11. MiFRENc membership functions: experimental system. Table 3. Initial setting β (1):experi- mental system. FRENa MiFRENc Parameter Value Parameter Value β (1) −3.25 β (1) 0 NL Z β (1) −2.5 β (1) 0.1 NM Z β (1) −1.05 β (1) 0.2 NS Z β (1) 0 β (1) 0.3 Z S β (1) 1.05 β (1) 0.4 PS S β (1) 2.5 β (1) 0.5 PM S β (1) 3.25 β (1) 0.6 PL L β (1) 0.7 β (1) 0.8 and 0 <η ≤ , (35) ν δ 2 2 where ν and ν are upper limits of ||φ (k)|| and ||φ (k)|| , respectively. a c a c Proof : The proof is given in Appendix. The validation of the proposed control scheme will be presented in the next section for the computer simulation system with a non-affine discrete time system and the hardware implementation system for DC motor current control plant. FUZZY INFORMATION AND ENGINEERING 381 5. Simulation and Experimental Systems 5.1. Simulation System and Results The controller developed in this work is first implemented on the nonlinear discrete time given as y(k + 1) = sin(y ) + [5 + cos(y u )]u(k). (36) k k k It is worth to mention that the mathematic model in (36) is used only to establish the simulation. In this test, the desired trajectory is given as r(k + 1) = A sin ω π , (37) r r where k = 500 as the maximum time index, A = 1.0 and ω = 8. To follow (33), δ is M r r selected as δ = 0.75 and ν = ν = 1.5. By using this setting and (35), the learning rate of a c MiFRENc is determined as 1 1 0 <η ≤ = = 0.7901. (38) 2 2 2 2 δ ν 0.75 1.5 In this case, the learning rate for MiFRENc is selected as η = 0.5. To select the learning rate of FRENa, let us chose g and g as 1 and 6, respectively. By using (34), the learning rate of m M FRENa is determined as g 1 0 <η ≤ = = 0.0123. (39) 2 2 2 1.5 6 ν g a M Thus, the learning rate for FRENa is selected as η = 0.01. Figures 4 and 5 illustrate the setting of membership functions for FRENa and MiFRENc, respectively. The initial setting of adjustable parameters β (1) for FRENa and MiFRENc is given as Table 2. Figure 6 displays the tracking performance with both plots of y(k) and e(k) and Figure 7 represents the control effort u(k). The estimated cost function L(k) is illustrated in Figure 8. The phase plane trajectory of u(k) and e(k) is depicted in Figure 9 to demonstrate the closed-loop system’s behaviour. 5.2. Experimental System and Results The experimental system is constructed by a DC motor current control. The output y(k + 1) is the armature current (mA) and the input u(k) is the control voltage applied to the driver circuit depicted in Figure 1. Same as the simulation systems, let us select δ = 0.75, ν = ν = 1.5, g = 5and g = 10. Thus, the learning rate of FRENa is designed as c m M g 5 0 <η ≤ = = 0.0222. (40) 2 2 2 1.5 10 ν g a M In this case, we select η = 0.01. For MiFRENc, we use the same learning rate as the sim- ulation system such that η = 0.5 because of the same network architecture. The desired c 382 C. TREESATAYAPUN Figure 12. Tracking performance y(k) and e(k): experimental system. Figure 13. Control eﬀort u(k): experimental system. FUZZY INFORMATION AND ENGINEERING 383 Figure 14. Estimated cost function L(k): experimental system. Figure 15. u(k) and e(k): experimental system. 384 C. TREESATAYAPUN Figure 16. Pulse response: experimental system. trajectory is given as r(k + 1) = I sin ω π , (41) r r where 15(mA) if 0 k < , I = (42) 30(mA) otherwise, 8if0 k < , ω = (43) 4 otherwise, and k = 2000. Figures 10 and 11 represent the setting of membership functions of FRENa and MiFRENc, respectively. All adjustable parameters β (1) for FRENa and MiFRENc are initialised as the setting in Table 3. Figure 12 displays the motor current y(k) and the tracking error e(k) to demonstrate the performance of the closed-loop system. The maximum absolute value of tracking error is |e(k)| = 48.2936 (mA) and the average absolute value of tracking error at steady state max is 0.4924 (mA) when k =1500–2000. Figure 13 shows the control effort u(k). The estimated cost function L(k) is illustrated in Figure 14. The phase plane trajectory of u(k) and e(k) is plotted in Figure 15. Thus, the large variation is detected because of the back-EMF. In order to evaluate the proposed scheme working under the situation of back-EMF, the pulse-train trajectory is implemented with the response displayed in Figure 16. It is clear that the effect of back-EMF is eliminated within the second pulse (B). 6. Conclusions A model-free adaptive control for a class of non-affine discrete time systems has been devel- oped by RL. The closed-loop system has been established by the output feedback with two FUZZY INFORMATION AND ENGINEERING 385 adaptive networks FRENa and MiFRENc. The initial settings of FRENa and MiFRENc have been conducted according to the human knowledge of the controlled plant within the for- mat of IF–THEN rules. The performance has been enchanted by the learning laws for both FRENa and MiFRENc while the tracking error and internal signals have been guaranteed the convergence over the reasonable compact sets. The numerical system and experimental results have represented to verify theoretical conjecture. Disclosure statement No potential conflict of interest was reported by the author(s). Funding This work has been supported by Fundamental Research Funds for CINVESTAV-IPN and Mexican Research Organization CONACyT [grant number 257253]. Notes on contributor C. Treesatayapun received the Ph.D. in elec- trical engineering from Chiang-Mai University, Thailand, in 2004. He was a production engineer at SAGA Elec- tronics (JRC-NJR) from 1998-2000 and was a head of electrical engineering program at North Chiang-Mai University, Thailand from 2001- 2007. He is currently a senior researcher at Department of robotic and advanced manufac- turing, Mexi- can Research Center and Advanced Technology, CINVESTAV-IPN, Saltillo campus, Mexico. His current research interests include automation and robotic system control and optimization, adaptive and learning algorithms and electric machine drives. ORCID C. Treesatayapun http://orcid.org/0000-0002-8574-672X References [1] Hou ZS, Wang Z. From model-based control to data-driven control: survey, classification and perspective. Inf Sci. 2013;235:3–35. [2] Zhu Y, Hou ZS. Data-driven MFAC for a class of discrete-time nonlinear systems with RBFNN. IEEE Trans Neural Netw Learn Syst. 2014;25(5):1013–1020. [3] Wang X, Li X, Wang J, et al. Data-driven model-free adaptive sliding mode control for the multi degree-of-freedom robotic exoskeleton. Inf Sci. 2016;327:246–257. [4] Lin N, Chi R, Huang B. Data-driven recursive least squares methods for non-affined nonlinear discrete-time systems. Appl Math Modell. 2020;81:787–798. [5] Kaldmae A, Kotta U. Input–output linearization of discrete-time systems by dynamic output feedback. Eur J Control. 2014;20:73–78. [6] Treesatayapun C. Data input–output adaptive controller based on IF–THEN rules for a class of non-affine discrete-time systems: the robotic plant. J Intell Fuzzy Syst. 2015;28:661–668. [7] Liu YJ, Tong S. Adaptive NN tracking control of uncertain nonlinear discrete-time systems with nonaffine dead-zone input. IEEE Trans Cybern. 2015;45(3):497–505. [8] Zhang CL, Li JM. Adaptive iterative learning control of non-uniform trajectory tracking for strict feedback nonlinear time-varying systems with unknown control direction. Appl Math Model. 2015;39:2942–2950. [9] Precup RE, Radac MB, Roman RC, et al. Model-free sliding mode control of nonlinear systems: algorithms and experiments. Inf Sci. 2017;381:176–192. [10] Raj R, Mohan BM. Stability analysis of general Takagi–Sugeno fuzzy two-term controllers. Fuzzy Inf Eng. 2018;10(2):196–212. 386 C. TREESATAYAPUN [11] Zhang X, Zhang HG, Sun QY, et al. Adaptive dynamic programming-based optimal control of unknown nonaffine nonlinear discrete-time systems with proof of convergence. Neurocomput- ing. 2012;35:48–55. [12] Eftekhari M, Zeinalkhani M. Extracting interpretable fuzzy models for nonlinear systems using gradient-based continuous ant colony optimization. Fuzzy Inf Eng. 2013;5(3):255–277. [13] Liu D, Wang D, Yang X. An iterative adaptive dynamic programming algorithm for opti- mal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci. 2013;220(20):331–342. [14] Jiang H, Zhang H. Iterative ADP learning algorithms for discrete-time multi-player games. Artif Intell Rev. 2018;50(1):75–91. [15] Liu D, Wang D, Zhao D, et al. Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming. IEEE Trans Autom Sci Eng. 2012;9(3):628–634. [16] Kiumarsi B, Lewis FL, Modares H, et al. Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica. 2014;50(4):1167–1175. [17] Yang Q, Jagannathan S. Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators. IEEE Trans Syst Man Cybern B Cybern. 2012;42(2):377–390. [18] Ha M, Wang D, Liu D. Event-triggered constrained control with DHP implementation for non- affine discrete-time systems. Inf Sci. 2020;519:110–123. [19] Xu B, Yang C, Shi Z. Reinforcement learning output feedback NN control using deterministic learning technique. IEEE Trans Neural Netw Learn Syst. 2014;25(3):635–641. [20] Liu YJ, Li S, Tong S, et al. Adaptive reinforcement learning control based on neural approximation for nonlinear discrete-time systems with unknown nonaffine dead-zone input. IEEE Trans Neural Netw Learn Syst. 2019;30(1):295–305. [21] Allam E, Elbab HF, Hady MA, et al. Vibration control of active vehicle suspension system using fuzzy logic algorithm. Fuzzy Inf Eng. 2010;2(4):361–387. [22] Niftiyev AA, Zeynalov CI, Poormanuchehri M. Fuzzy optimal control problem with non–linear functional. Fuzzy Inf Eng. 2011;3(3):311–320. [23] Fei J, Wang T. Adaptive fuzzy-neural-network based on RBFNN control for active power filter. Int J Mach Learn Cybern. 2019;10:1139–1150. [24] Treesatayapun C, Uatrongjit S. Adaptive controller with fuzzy rules emulated structure and its applications. Eng Appl Artif Intell. 2005;18:603–615. [25] Treesatayapun C. Adaptive control based on IF–THEN rules for grasping force regulation with unknown contact mechanism. Robot Comput Integr Manuf. 2014;30:11–18. [26] Abouheaf M, Gueaieb W. Neurofuzzy reinforcement learning control schemes for optimized dynamical performance. 2019 IEEE International Symposium on Robotic and Sensors Environ- ments (ROSE). Ontario, Canada; 2019 June. p. 17–18. [27] Treesatayapun C. Fuzzy-rule emulated networks based on reinforcement learning for nonlinear discrete-time controllers. ISA Trans. 2008;47:362–373. [28] Wei Q, Lewis FL, Sun Q, et al. Discrete-time deterministic q-learning: a novel convergence analysis. IEEE Trans Cybern. 2017;47(5):1224–1237. Appendix 1. Proof of Theorem 4.1 Let us refer to the standard Lyapunov function as V(k) = V (k) + V (k) + V (k) + V (k) 1 2 3 4 ρ ρ 2 3 2 T T 2 ˜ ˜ ˜ ˜ = ρ e (k) + β (k)β (k) + β (k)β (k) + ρ (k − 1), (A1) 1 a c 4 a c c η η a c where ρ , ρ , ρ and ρ are positive constants satisfying the following conditions: 1 2 3 4 ρ > pρ , (A2) 1 3 4 FUZZY INFORMATION AND ENGINEERING 387 ρ g + (ρ /8)q 1 3 ρ > ρ , (A3) 2 3 ρ > (A4) and ρ > . (A5) Utilising (14), V (k) is obtained as 2 2 V (k) = ρ e (k + 1) − e (k) 1 1 2 2 = ρ [g(k) (k) + d (k)] − e (k) 1 a a 2 2 2 2 ≤ ρ 2g (k) (k) + 2d (k) − e (k) a a 2 2 2 2 ≤−ρ e (k) + 2ρ g (k) + 2ρ d . (A6) 1 1 1 M a aM Recalling the tuning law in (26), V (k) is expressed as T T ˜ ˜ ˜ ˜ V (k) = β (k + 1)β (k + 1) − β (k)β (k) 2 a a a a ˆ ˜ =−2ρ [g(k) (k) + L(k)]β (k)φ (k) + ρ η [g(k) (k) 2 a 2 a a 2 T + L(k)] φ (k)φ (k) =−2ρ (k)[g(k) (k)] − 2ρ (k)L(k) 2 a a 2 a 2 2 + ρ η ||φ (k)|| [g(k) (k) + L(k)] . (A7) 2 a a a Applying the lower bound and upper bound of g(k), it leads to 2 2 2 2 V (k) ≤−2ρ g (k) − 2ρ (k)L(k) + ρ η ||φ (k)|| g (k) 2 2 m 2 a 2 a a a M a 2 2 ˆ ˆ + ρ η ||φ (k)|| [L (k) + 2g(k) (k)L(k)] 2 a a a 2 2 2 2 = ρ − g (k) − (g − η ||φ (k)|| g ) (k) 2 m m a a a M a 2 2 2 ˆ ˆ − 2 (k)[I − η ||φ (k)|| g(k)]L(k) + η ||φ (k)|| L (k) a a a a a 2 2 2 2 = ρ − g (k) − (g − η ||φ (k)|| g ) (k) 2 m m a a a M a 2 (k)[I − η ||φ (k)|| g(k)]L(k) a a a 2 2 + + η ||φ (k)|| L (k) a a g − η ||φ (k)|| g m a a 2 2 2 =−ρ g (k) − ρ (g − η ||φ (k)|| g ) (k) 2 m 2 m a a a a M ˆ 2 [1 − η ||φ (k)|| g(k)]L(k) a a g − η ||φ (k)|| g m a a 1 − η ||φ (k)|| g a a m + ρ L (k) g − η ||φ (k)|| g m a a M 388 C. TREESATAYAPUN 2 2 2 2 ≤−ρ g (k) + L (k) − ρ (g − η ||φ (k)|| g ) (k) 2 m 2 m a a a a M ˆ 2 [1 − η ||φ (k)|| g(k)]L(k) a a + . (A8) g − η ||φ (k)|| g m a a By using the learning law of MiFRENc in (32), V (k) is derived as T T ˜ ˜ ˜ ˜ V (k) = β (k + 1)β (k + 1) − β (k)β (k) 3 c c c c T 2 2 2 2 = − 2η δe (k)β (k)φ (k) + η δ e (k)||φ (k)|| c c c c c c c 2 2 2 =−2ρ δ (k)e (k) + ρ η δ ||φ (k)|| e (k). (A9) 3 c c 3 c c Recalling e (k) in (28) with ±δL(k) and ±L(k − 1) and using (A10), it yields ˆ ˆ e (k) = δ[L(k) − L(k)] + δL(k) − [L(k − 1) − L(k − 1)] − L(k − 1) + l(k) T T = δ[β (k)φ (k) − β φ (k) − ε (k)] + δL(k) − L(k − 1) + l(k) c c c c c T T − [β (k − 1)F (k − 1) − β φ (k − 1) − ε (k − 1)] c c c c c T T ˜ ˜ = δβ (k)φ (k) − β (k − 1)φ (k − 1) + δL(k) − L(k − 1) c c c c + l(k) − δε (k) + ε (k − 1) c c = δ (k) − (k − 1) + δL(k) − L(k − 1) + l(k) − δε (k) c c c + ε (k − 1) (A10) or δ (k) = e (k) − δL(k) + (k − 1) + L(k − 1) − l(k) + δε (k) − ε (k − 1). (A11) c c c c c By using (A11) and (16), (A9) can be derived as V (k) =−2ρ e (k) e (k) − δL(k) + (k − 1) + L(k − 1) − l(k) 3 3 c c c 2 2 2 + δε (k) − ε (k − 1) + ρ η δ ||φ (k)|| e (k) c c 3 c c 2 2 2 2 =−ρ 1 − η δ ||φ (k)|| e (k) − ρ e (k) + 2ρ e (k) δL(k) 3 c c 3 3 c c c − (k − 1) − L(k − 1) + l(k) − δε (k) + ε (k − 1) c c c 2 2 2 2 2 =−ρ 1 − η δ ||φ (k)|| e (k) − ρ δ (k) + ρ δL(k) 3 c c 3 3 c c − (k − 1) − L(k − 1) + l(k) − δε (k) + ε (k − 1) c c c 2 2 2 2 2 2 ≤−ρ 1 − η δ ||φ (k)|| e (k) − ρ δ (k) + (k − 1) 3 c c 3 c c c ρ ρ 3 3 2 2 + l (k) + [δL(k) − L(k − 1)] 4 4 + δε (k) − ε (k − 1) c c 2 2 2 2 2 2 ≤−ρ 1 − η δ ||F (k)|| e (k) − ρ δ (k) + (k − 1) 3 c c 3 c c c 4 FUZZY INFORMATION AND ENGINEERING 389 ρ ρ ρ 3 3 3 2 2 T 2 + pe (k) + q (k) + ||β (k)φ (k)|| a a 4 8 8 2 2 + [δL(k) − L(k − 1)] + ρ ε . (A12) cM Finally, V (k) is formulated as 2 2 V (k) = ρ (k) − (k − 1) . (A13) 4 4 c c Recalling (A6), (A8), (A12) and (A13), V(k) is rewritten as 2 2 2 2 V(k) ≤− e (k) + ρ g (k) + ρ d 1 1 M a M 2 2 2 − ρ g (k) − ρ (g − η ||φ (k)|| g ) (k)+ 2 m 2 m a a a a M [1 − η ||φ (k)|| g(k)]L(k) ρ a a 2 + L (k) g − η ||φ (k)|| g m m a a 2 2 2 2 2 2 − ρ 1 − η δ ||φ (k)|| e (k) − ρ δ (k) + (k − 1) 3 c c 3 c c c ρ ρ ρ 3 3 3 2 2 T 2 + pe (k) + q (k) + ||β φ (k)|| a a 4 8 8 2 2 + [δL(k) − L(k − 1)] + ρ ε cM 2 2 ρ (k) − (k − 1) c c ρ ρ ρ 1 3 3 2 2 2 ≤− − p e (k) − ρ g − ρ g − q (k) 2 m 1 3 4 8 2 2 2 − ρ δ − ρ (k) − ρ − (k − 1) 3 4 4 c c 2 2 2 − ρ 1 − η δ ||φ (k)|| e (k) 3 c c [1 − η ||φ (k)|| g(k)]L(k) a a 2 2 − ρ [g − η ||φ (k)|| g ] (k) + 2 m a a a g − η ||φ (k)|| g m a a + V 2 2 2 2 ≤−V e (k) − V (k) − V (k) − V (k − 1) e a c0 c1 a c c − V e (k) + V , (A14) c M where ρ ρ ρ 3 3 2 2 2 2 2 2 V ≥ ρ d + ρ ε + β + (γ − 1) + L M 1 3 m cM aM M 8 8 g 2 2 − ρ [g − η ||φ (k)|| g ] (k) 2 m a a a [1 − η ||φ (k)|| g(k)]L(k) a a + , (A15) g − η ||φ (k)|| g m a a ρ ρ 1 3 V = − p, (A16) 3 4 V = ρ g − ρ g − q, (A17) a 2 m 1 V = ρ δ − ρ , (A18) c0 3 4 V = ρ − (A19) c1 4 4 390 C. TREESATAYAPUN and 2 2 V = ρ 1 − η δ ||φ (k)|| . (A20) c 3 c c According to the conditions in (A2) – (A5), V , V , V and V are always positive. Furthermore, by e a c0 c1 setting the membership functions of FRENa and MiFRENc, the upper limits exist such that 0 < ||φ (k)|| ≤ ν (A21) a a and 0 < ||φ (k)|| ≤ ν . (A22) c c Combining with (34) and (35), it leads to 2 2 g − η ||φ (k)|| g > 0 (A23) m a a and ρ ρ ρ 3 3 2 2 2 2 2 2 V ≥ ρ d + ρ ε + β + (γ − 1) + L . (A24) M 1 3 m cM aM M 8 8 g By this mean, we have . M |e(k)|≥ = , (A25) . M | (k)|≥ = (A26) a a and . M | (k)|≥ = . (A27) c c c0 The proof is completed here.
Journal
Fuzzy Information and Engineering
– Taylor & Francis
Published: Jul 3, 2021
Keywords: Model-free adaptive control; reinforcement learning; nonlinear discrete time systems; fuzzy neural network; DC motor current control