Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach machines Article Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach Ya-Ling Chen, Yan-Rou Cai and Ming-Yang Cheng * Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan * Correspondence: [email protected] Abstract: This paper focuses on developing a robotic object grasping approach that possesses the ability of self-learning, is suitable for small-volume large variety production, and has a high success rate in object grasping/pick-and-place tasks. The proposed approach consists of a computer vision- based object detection algorithm and a deep reinforcement learning algorithm with self-learning capability. In particular, the You Only Look Once (YOLO) algorithm is employed to detect and classify all objects of interest within the field of view of a camera. Based on the detection/localization and classification results provided by YOLO, the Soft Actor-Critic deep reinforcement learning algorithm is employed to provide a desired grasp pose for the robot manipulator (i.e., learning agent) to perform object grasping. In order to speed up the training process and reduce the cost of training data collection, this paper employs the Sim-to-Real technique so as to reduce the likelihood of damaging the robot manipulator due to improper actions during the training process. The V-REP platform is used to construct a simulation environment for training the deep reinforcement learning neural network. Several experiments have been conducted and experimental results indicate that the 6-DOF industrial manipulator successfully performs object grasping with the proposed approach, even for the case of previously unseen objects. Keywords: 6-DOF industrial manipulator; deep reinforcement learning; soft actor-critic; robotic object grasping; YOLO Citation: Chen, Y.-L.; Cai, Y.-R.; Cheng, M.-Y. Vision-Based Robotic Object Grasping—A Deep 1. Introduction Reinforcement Learning Approach. In most conventional approaches for vision-based pick-and-place tasks of industrial Machines 2023, 11, 275. manipulators used in a production line, a 3D model of the object to be grasped must be https://doi.org/10.3390/ known in advance. With the known 3D model, one can either analyze the geometric shape machines11020275 of the object and find a proper way for the industrial manipulator to grasp that object or Academic Editor: Shabnam exploit methods such as feature matching and shape recognition to find an appropriate Sadeghi Esfahlani pose for the industrial manipulator to perform object grasping as well as pick-and-place tasks. However, this kind of approach is sensitive to the illumination conditions and other Received: 5 January 2023 types of disturbances in the ambient environment. If the 3D model of the object to be Revised: 8 February 2023 Accepted: 10 February 2023 grasped is not known in advance or if there are a variety of objects to be grasped, the Published: 12 February 2023 aforementioned conventional approaches may fail. With the machine learning paradigm becoming popular, more and more research has been focused on applying the deep learning technique to deal with automatic object grasping tasks [1]. For example, Johns et al. [2] used a deep neural network to predict a score for the grasp pose of a parallel jaw gripper Copyright: © 2023 by the authors. for each object in a depth image, through which a physical simulator was employed to Licensee MDPI, Basel, Switzerland. obtain simulated depth images of objects as the training data set. Lenz et al. [3] used two This article is an open access article deep neural networks to detect robotic grasps from images captured by an RGBD camera. distributed under the terms and One deep neural network having a simpler structure and requiring fewer computation conditions of the Creative Commons resources was mainly used to retrieve candidate bounding rectangles for grasping. Another Attribution (CC BY) license (https:// deep neural network was used to rank the candidate bounding rectangles for a parallel creativecommons.org/licenses/by/ gripper [3]. In [4], 700 h were spent collecting data from 50,000 grasping attempts of robot 4.0/). Machines 2023, 11, 275. https://doi.org/10.3390/machines11020275 https://www.mdpi.com/journal/machines Machines 2023, 11, 275 2 of 19 manipulators, and a Convolutional Neural Network (CNN) was combined with a multi- stage learning approach to predict an appropriate grasping pose for robot manipulators. In [5], Levine et al. exploited the deep learning paradigm to train fourteen 7-DOF robot manipulators to perform object grasping using RGB images. A total of 800,000 grasp attempts by robot manipulators were recorded within two months to train the deep neural network. Experimental results indicate that the robot manipulator can successfully grasp 1100 objects of different sizes and shapes. Goldberg and his colleagues have done a series of studies on robot grasping using deep learning [6–9]. Mahler et al. proposed the Dex-Net 1.0 deep learning system for robot grasping [6]. More than 10,000 independent 3D models and 2.5 million samples of grasping data for the parallel gripper are used in Dex-Net 1.0. In order to shorten the training time, 1500 virtual cores in the Google cloud platform are used. In 2019, Mahler et al. proposed the Dex-Net 4.0 [9], for which five million depth images had been trained by a GQ-CNN. After the training is complete, the dual arm robot with a suction nozzle and a parallel-jaw gripper is able to empty a bin with an average grasping rate of 300 objects/hour [9]. Several past studies utilized CNN to produce suitable grasping poses to perform object grasping tasks [10–12]. All of the aforementioned studies demonstrated good performance in automatic object grasping, even for cases in which the objects to be grasped did not appear in the training data set. However, the subjects of these past studies all have common drawbacks, in that they are very time consuming and not cost effective in generating a grasping data set for training the deep neural network. Recently, the research topic of exploiting reinforcement learning in training robot manipulators to perform object grasping has received much attention [13–15]. Gualtieri et al. used deep reinforcement learning algorithms to solve the robotic pick-and-place problems for cases in which the geometrical models of the objects to be grasped are unknown [16]. In particular, using the deep reinforcement learning algorithm, the robot manipulator is able to determine a suitable pose (i.e., optimal action) to grasp certain types of objects. In [17], the image of an arbitrary pose of an object is used as an input to a distributed reinforcement learning algorithm. After learning, the robot is able to perform grasping tasks for objects that are either occluded or previously unseen. Deep reinforcement learning is also used in training robotic pushing/grasping/picking [18,19]. Kalashnikov et al. developed the QT-Opt algorithm and focused on scalable, off-policy deep reinforcement learning [20]. Seven real robot manipulators were used to perform and record more than 580 k grasp attempts for training a deep neural network. Once the learning process is complete, the real robot manipulator can successfully perform grasping, even for previously unseen objects [20]. Chen and Dai used a CNN to detect the image features of an object. Based on the detected image features of the object of interest, a deep Q-learning algorithm was used to determine the grasp pose corresponding to that object [21]. In [22], Chen et al. used a Mask R-CNN and PCA to estimate the 3D pose of objects to be grasped. Based on the estimated 3D object pose, a deep reinforcement learning algorithm is employed to train the control policy in a simulated environment. Once the learning process is complete, one can deploy the learned model to the real robot manipulator without further training. This paper proposes an object grasping approach that combines the YOLO algo- rithm [23–26] and the Soft Actor-Critic (SAC) algorithm [27,28]. It is well known that YOLO is capable of rapidly detecting, localizing and recognizing objects in an image. In particular, YOLO can find the location of the object of interest inside the field of view of a camera and use this location information as the input to a reinforcement learning algorithm. Since the search of an entire image in not essential, training time can therefore be substantially reduced. SAC is based on the Actor-Critic framework and exploits Off-Policy to improve the sample efficiency. SAC maximizes the expected return as well as the entropy of policy simultaneously. Since SAC exhibits excellent performance and is suitable for real-world applications, this paper employs SAC to train the robot manipulator to perform object grasping through self-learning. Machines 2023, 11, x FOR PEER REVIEW 3 of 20 Machines 2023, 11, 275 3 of 19 applications, this paper employs SAC to train the robot manipulator to perform object grasping through self-learning. 2. Framework 2. Framework This This pa paper per develops a develops a robot robotic ic object object gra grasping sping techn technique ique th that at comb combines ines compute computer r vi- vision-based sion-based ob object ject de detection/r tection/recogni ecognition/localization tion/localization anand d a d aeep re deep inforcement reinforcement lear learning ning al- algorithm with self-learning capability. Figure 1 shows the schematic diagram of the gorithm with self-learning capability. Figure 1 shows the schematic diagram of the robotic robotic pick-and-place system developed in this paper. As shown in Figure 1, YOLO will pick-and-place system developed in this paper. As shown in Figure 1, YOLO will detect detect the object of interest from the image captured by the camera. SAC will provide the the object of interest from the image captured by the camera. SAC will provide the desired desired grasping point in the image plane based on the depth image information of the grasping point in the image plane based on the depth image information of the object object bounding box. The grasping point on the 2D-image plane is converted to a desired bounding box. The grasping point on the 2D-image plane is converted to a desired 6D 6D grasping pose in the Cartesian space so as to control the robot manipulator to grasp grasping pose in the Cartesian space so as to control the robot manipulator to grasp objects objects of interest and place them at a desired position. The system will return the reward of interest and place them at a desired position. The system will return the reward infor- information based on the reward mechanism. mation based on the reward mechanism. Figure 1. Schematic diagram of robotic pick-and-place based on computer vision and deep Figure 1. Schematic diagram of robotic pick-and-place based on computer vision and deep rein- reinforcement learning. forcement learning. 3. Object Recognition and Localization Based on YOLO Algorithms 3. Object Recognition and Localization Based on YOLO Algorithms In computer vision-based object recognition/localization applications, many past stud- In computer vision-based object recognition/localization applications, many past ies have adopted a two-step approach. The first step focuses on detecting and segmenting studies have adopted a two-step approach. The first step focuses on detecting and seg- out the region that contains objects of interest within the image. The second step proceeds menting out the region that contains objects of interest within the image. The second step to object recognition/localization based on the region detected in the first step. Such an proceeds to object recognition/localization based on the region detected in the first step. approach often consumes enormous computation resources and time. Unlike the two-step Such an approach often consumes enormous computation resources and time. Unlike the approach, YOLO can simultaneously detect and recognize objects of interest [23–26]. The two-step approach, YOLO can simultaneously detect and recognize objects of interest [23– schematic diagram of the YOLO employed in this paper is shown in Figure 2, where 26]. The schematic diagram of the YOLO employed in this paper is shown in Figure 2, “Input” is the image input, “Conv” is the convolution layer, “Res_Block” is the residual where “Input” is the image input, “Conv” is the convolution layer, “Res_Block” is the block, and “Upsample” is the upsampling of image features. YOLO uses the Darknet-53 residual block, and “Upsample” is the upsampling of image features. YOLO uses the network structure to extract image features. In general, Darknet-53 consists of a series of Darknet-53 network structure to extract image features. In general, Darknet-53 consists of 1  1 and 3  3 convolution layers. Each convolution layer has a Leaky ReLU activation a series of 1 × 1 and 3 × 3 convolution layers. Each convolution layer has a Leaky ReLU function, a batch normalization unit and a residual block to cope with the problem of gradient disappearance/explosion caused by the large number of layers in the deep neural network. In addition, to improve the detection accuracy of small objects, YOLO adopts the Machines 2023, 11, x FOR PEER REVIEW 4 of 20 activation function, a batch normalization unit and a residual block to cope with the prob- Machines 2023, 11, 275 4 of 19 lem of gradient disappearance/explosion caused by the large number of layers in the deep neural network. In addition, to improve the detection accuracy of small objects, YOLO adopts the Feature Pyramid Network structure to perform multi-scale detection. The im- Feature Pyramid Network structure to perform multi-scale detection. The image input after age input after processing by the Darknet-53 will output three different sizes of image processing by the Darknet-53 will output three different sizes of image features—13  13, features—13 × 13, 26 × 26 and 52 × 52. Object detection will be performed on these image 26  26 and 52  52. Object detection will be performed on these image features and the features and the anchor box will then be equally distributed to three outputs. The final anchor box will then be equally distributed to three outputs. The final detection results will detection results will be the sum of the detection results of these three image features of be the sum of the detection results of these three image features of different sizes. different sizes. Figure 2. Schematic diagram of YOLO. Figure 2. Schematic diagram of YOLO. 4. Object Pick-and-Place Policy Based on SAC Algorithms 4. Object Pick-and-Place Policy Based on SAC Algorithms SAC is a deep reinforcement learning algorithm [27,28] that can enable a robot to SAC is a deep reinforcement learning algorithm [27,28] that can enable a robot to learn in the real world. The attractive features of SAC include: (1) it is based on the learn in the real world. The attractive features of SAC include: (1) it is based on the Actor- Actor-Critic framework; (2) it can learn based on past experience, i.e., off-policy, to achieve Critic framework; (2) it can learn based on past experience, i.e., off-policy, to achieve im- improved efficiency in sample usage; (3) it belongs to the category of Maximum Entropy proved efficiency in sample usage; (3) it belongs to the category of Maximum Entropy Reinforcement Learning and can improve stability and exploration; and (4) it requires Reinforcement Learning and can improve stability and exploration; and (4) it requires fewer parameters. fewer parameters. In this paper, both the state and action are defined in the continuous space. Therefore, SAC uses neural networks to parametrize the soft-action value function and the policy function as Q (s , a ) and p (a s ) , respectively. A total of five neural networks are q t t f t t constructed—two soft action-value networks Q (s , a ) and Q (s , a ); two target soft t t t t q q 1 2 0 0 action-value networks Q (s , a ) and Q (s , a ); and one policy network p (a s ) , 0 0 t t t t f t t q q 1 2 Machines 2023, 11, x FOR PEER REVIEW 5 of 20 In this paper, both the state and action are defined in the continuous space. Therefore, SAC uses neural networks to parametrize the soft-action value function and the policy function as and π (|as) , respectively. A total of five neural networks are Qs(,a ) θ tt φ tt Qs(,a) Qs(,a) constructed—two soft action-value networks and ; two target soft ac- θ tt θ tt 1 2 Machines 2023, 11, 275 5 of 19 tion-value networks Qs ′ (,a) and Qs ′ (,a) ; and one policy network , where π (|as ) θ tt tt φ tt 1 ′ ′ θ θ 1 2 ′ ′ , , θ , θ and are the parameter vectors of the neural networks as shown in Figure θ φ 2 1 2 0 0 wher 3. In p e qart ,qicu ,qla,rq, the and polic f ar y f e uthe nction parameter and the so vectors ft action of-va the lue neural functnetworks ion are theas acshown tor and in the 1 2 1 2 Figur critic e in 3. the A In particular ctor-Cri ,tthe ic frpolicy amework, re function spective and the ly. Und soft action-value er state s, the soft function action- are vthe alue f actor unc- and tion w the i critic ll outp inut the the Actor exp-Critic ected rewa framework, rd respectively for sele .ct Under ing acstate tion as, , the thus g soft uidin action-value g the pol- Qs(,a ) θ tt function will output the expected reward Q (s , a ) for selecting action a, thus guiding icy function to learn. Based on theq cur t ret nt state, the policy function will output π (|as ) φ tt the policy function p (a s ) to learn. Based on the current state, the policy function f t t an action to yield the system state for the next moment. By repeating these procedures, will output an action to yield the system state for the next moment. By repeating these one can collect past experience to be used in training the soft action-value function. Since procedures, one can collect past experience to be used in training the soft action-value SAC is a random policy, the outputs of SAC are therefore the mean and standard devia- function. Since SAC is a random policy, the outputs of SAC are therefore the mean and tion of probability distribution of the action space. standard deviation of probability distribution of the action space. Figure 3. Neural network architecture of SAC. Figure 3. Neural network architecture of SAC. The objective function for the Soft Action-Value Network is described by Equation (1), The objective function for the Soft Action-Value Network is described by Equation while Equation (2) is the learning target. The Mean-Square Error (MSE) is employed (1), while Equation (2) is the learning target. The Mean-Square Error (MSE) is employed to update the network parameters. The action-value network Q (s , a ) and the target t t to update the network parameters. The action-value network and the target ac- Qs(,a ) θ tt action-value network Q (s , a ) have the same network structure. The action-value network t t tion-value network ′ have the same network structure. The action-value network Qs(,a ) tt is used to predict the expected reward for executing action a under state s. The target is used to predict the expected reward for executing action a under state s. The target ac- action-value network is used to update the target so as to help train the action-value tion-value network is used to update the target so as to help train the action-value net- network. During training, only the action-value network will be trained, while the target work. During training, only the action-value network will be trained, while the target ac- action-value network will remain unchanged. In short, the target will change if the target tion-value network will remain unchanged. In short, the target will change if the target action-value network updates, which will make it difficult for the learning of the neural action-value network updates, which will make it difficult for the learning of the neural network to converge. network to converge. 1 2 J (q) = E (Q (s , a ) Q(s , a )) (1) t t t t Q (s ,a )D q t t 1  JE () θ=− (Q (s ,a )Q(s ,a )) (1) Q (,s a )~D θ tt tt  tt  Q(s , a ) = r(s , a )+gE [V 0(s )]  (2) t t t t s p t+1 t+1 In this paper, the Stochastic Gradient ˆ Descent (SGD) method is employed to calculate the Qs ( ,a )=r (s ,a )+γE [V (s )] ′ (2) tt tt s~1 p θ t + t +1 derivative of the objective function, as described by Equation (3): r J (q) = r Q (s , a )(Q (s , a ) r(s , a ) + g(Q 0(s , a ) a log p (a js )))) (3) q Q q q t t q t t t t q t+1 t+1 f t+1 t+1 The weights of the target soft action value network are updated using Equation (4), where t is a constant: 0 0 q tq + (1 t)q (4) t+1 t t The objective function of the policy network is described by Equation (5). To improve the policy, one should maximize the sum of action value and entropy: Machines 2023, 11, x FOR PEER REVIEW 6 of 20 In this paper, the Stochastic Gradient Descent (SGD) method is employed to calculate the derivative of the objective function, as described by Equation (3): ∇=JQ ()θγ ∇ (s ,a )(Q (s ,a )−r(s ,a )+ (Q (s ,a )−αlogπ (a |s )))) (3) θθQtθ t θtt tt θ ′t++ 11t φt+1t+1 The weights of the target soft action value network are updated using Equation (4), where 𝜏 is a constant: ′′ θτ←+θ (1−τ )θ (4) tt + 1 t The objective function of the policy network is described by Equation (5). To improve the policy, one should maximize the sum of action value and entropy: Machines 2023, 11, 275 6 of 19 J (φ)=− Ea [απ log( ( |s ))Q (s ,a )] ππ s ~, Da ~ φ t t θ t t tt θ (5) af = (; ε s ) tt φ t J (f) = E [a log(p (a s )) Q (s , a )] p s D,a p f t t q t t t t (5) a = f (# ; s ) t f t t where 𝜀 is the noise and Equation (5) can be rewritten as Equation (6): where # is the noise and Equation (5) can be rewritten as Equation (6): J (φ)=− Ef [απ log( ( (;εs )|s ))Q (s,f (;εs ))] (6) πε sD ~, ~N φφ t t t θ tφ t t tt J (f) = E [a log(p ( f (# ; s ) s )) Q (s , f (# ; s ))] (6) p s D,# N f f t t t q t f t t t t The derivative of the objective function of the policy network is described by Equa- The derivative of the objective function of the policy network is described by tion (7): Equation (7): ˆ ˆ r J (f) = r a log(p (a js )) + (r a log(p (a js )) Q(s , a ))r f ( # ; s ) (7) f p ∇= J () fφα∇ lfog(πt (tas | ))+(∇ a αlog(π f(as t| t))−Q(s ,a t))∇t f (εf;sf) t t (7) φπ φ φ tt ta φ t t t t φ φ t t The SAC reinforcement learning algorithm is illustrated in Figure 4. The SAC reinforcement learning algorithm is illustrated in Figure 4. Soft Actor-Critic Input: θθ,, φ 1. Initial parameters ′′ θθ←← ,θ θ 2. Initial target network weights 11 2 2 D ←∅ 3. Initial empty replay buffer 4. for each iteration do 5. for each environment step do aa ~( π |s) tt φ t 6. Sample action from the policy s ~(ps | s ,a ) 7. tt ++ 11 t t Sample transition from the environment DD ←∪ (,s a ,r(s ,a ),s ) { } 8. tt tt t +1 Store the transition in the replay buffer 9. end for 10. for each gradient step do θθ ←−λ ∇Ji (θ ) for ∈ 1, 2 { } ii Q θ Q i 11. Update the Q-function parameters φφ ←− λ ∇ J () φ πφ π 12. Update policy weights ′′ ′ θτ←+θ (1−τ )θ for i∈{1, 2} ii i 13. Update target network weights 14. end for 15. end for Output: θθ,, φ 16. Optimized parameters Figure 4. SAC reinforcement learning algorithm. Figure 4. SAC reinforcement learning algorithm. 4.1. Policy This paper applies SAC to robotic object grasping. The learning agent is the 6-DOF robot manipulator, while the policy output is the coordinate (u,v) of the object grasping point on the image plane. The state, action and reward mechanism are designed as follows. 4.1.1. State (State s) By exploiting YOLO, one can detect the objects of interest. The state of the SAC algorithm is defined to be the depth image of the object of interest. The state input designed in this paper is the depth information. Therefore, after obtaining the position of the object of interest in the RGB image, one needs to find its corresponding position in the depth Machines 2023, 11, x FOR PEER REVIEW 7 of 20 4.1. Policy This paper applies SAC to robotic object grasping. The learning agent is the 6-DOF robot manipulator, while the policy output is the coordinate (u,v) of the object grasping point on the image plane. The state, action and reward mechanism are designed as fol- lows. 4.1.1. State (State s) By exploiting YOLO, one can detect the objects of interest. The state of the SAC algo- Machines 2023, 11, 275 rithm is defined to be the depth image of the object of interest. The state input designed 7 of 19 in this paper is the depth information. Therefore, after obtaining the position of the object of interest in the RGB image, one needs to find its corresponding position in the depth image. Note that this depth image will be scaled to a size of 64 × 64. To be precise, the state image. Note that this depth image will be scaled to a size of 64 64. To be precise, the state used in this paper is a 64 × 64 × 1 depth image as shown in Figure 5. used in this paper is a 64  64  1 depth image as shown in Figure 5. Figure 5. Illustrative diagram of state acquisition. Figure 5. Illustrative diagram of state acquisition. 4.1.2. 4.1.2. Act Action ion ( (Action Action a) a) The The action of action of SAC SAC is define is defined d to be the to be thein input put displacement vector o displacement vectorfof the objec the object t of in- of terest on the image plane as described by Equation (8), for which its unit is a pixel. The interest on the image plane as described by Equation (8), for which its unit is a pixel. The length length and andwidth widthof of the the bo bounding unding box box o obtained btained by by YOLO YOLO ar are e den denoted oted as as x x and and y y,, res respec- pec- tively tively. In . In addition, addition, the the coor coordinate dinate o offthe the center center o offthe the bounding bounding bo box x is deno is denoted ted as as ( (u𝑢 ,,𝑣v ). ). c c Equation (9) gives the displacement vector of the object of interest on the image plane Equation (9) gives the displacement vector of the object of interest on the image plane corresponding to the action by the SAC. The coordinates of the object grasping point on the corresponding to the action by the SAC. The coordinates of the object grasping point on image plane as shown in Figure 6 are calculated using Equation (10). With the calculated the image plane as shown in Figure 6 are calculated using Equation (10). With the calcu- image coordinates of the object grasping point, by using coordinate transformation, depth lated image coordinates of the object grasping point, by using coordinate transformation, information and inverse kinematics, one can obtain the joint command for the 6-DOF robot depth information and inverse kinematics, one can obtain the joint command for the 6- manipulator to perform object grasping. DOF robot manipulator to perform object grasping. a 2 [1, 1] a ∈− [1,1] a = (a , a ), (8) 1 2 aa =(,a ), a2 [1, 1] (8) 12 2 a ∈− [1,1]  2 Du = 1 + (a  x/2 0.99) Machines 2023, 11, x FOR PEER REVIEW 1 8 of 20 (9) Δ=ua 1( + */x 2−0.99) Dv = 1 + (a  1 y/2 0.99) (9) Δ=va 1+( */y 20 − .99) u = Du + u (10) v = Dv + v uu =Δ +u (10) vv =Δ +v Figure 6. The displacement vector and the object grasping point on the image plane. Figure 6. The displacement vector and the object grasping point on the image plane. 4.1.3. Reward (Reward, r) A positive reward of 1 will be given if a successful object grasping occurs. In contrast, a negative reward −0.1 (i.e., penalty) will be given if failure occurs. As a result, the accu- mulated reward for an episode will be negative if the first ten attempts of object grasping fail. In order to help the learning agent find the optimal object grasping point as soon as possible, an extra positive reward 0.5 will be given if the first attempt of object grasping is successful. In addition, two termination conditions are adopted for the learning of SAC. To prevent the learning agent from continuously learning the wrong policy, if none of the first 100 object grasping attempts is successful, this episode will be terminated immedi- ately. In addition, when the learning agent successfully performs object grasping, this ep- isode will also be terminated immediately. The reward mechanism is described by Equa- tion (11). +1 , if successful r== +1.5 , if successful and the number of attempts in object grasping1  (11) −0.1 , for each failure attempt in object grasping 4.2. Architecture Design of SAC Neural Network Since state s adopted in this paper is a 64 × 64 × 1 depth image, a CNN is amended to the SAC so that the SAC can learn directly from the depth image. The hyperparameters of SAC are listed in Table 1 and its network architecture is shown in Figure 7. The input to the policy network is the depth image of the object of interest as detected by YOLO. The inputs to the soft action-value network and the target soft action-value network are com- prised of the depth image of the object of interest as detected by YOLO and the policy outputted by the policy network. As shown in Figure 7, the policy network, the soft action- value network and the target soft action-value network all consist of three CNNs and four full connected neural networks. The activation functions used in the soft action-value net- work and the target soft action-value network are ReLU. As for the policy network, the activation functions for the three CNNs and the first three full connected neural networks are ReLU. The output of the last layer of the policy network is the displacement vector on the image plane, having both positive and negative values. Therefore, the hyperbolic tan- gent function (i.e., Tanh) is chosen as the activation function for the last layer of the policy network. Note that the three CNNs and the first fully connected neural network are used to extract image features. Machines 2023, 11, 275 8 of 19 4.1.3. Reward (Reward, r) A positive reward of 1 will be given if a successful object grasping occurs. In contrast, a negative reward 0.1 (i.e., penalty) will be given if failure occurs. As a result, the accumulated reward for an episode will be negative if the first ten attempts of object grasping fail. In order to help the learning agent find the optimal object grasping point as soon as possible, an extra positive reward 0.5 will be given if the first attempt of object grasping is successful. In addition, two termination conditions are adopted for the learning of SAC. To prevent the learning agent from continuously learning the wrong policy, if none of the first 100 object grasping attempts is successful, this episode will be terminated immediately. In addition, when the learning agent successfully performs object grasping, this episode will also be terminated immediately. The reward mechanism is described by Equation (11). +1 , if successful r = +1.5 , if successful and the number of attempts in object grasping = 1 (11) 0.1 , for each failure attempt in object grasping 4.2. Architecture Design of SAC Neural Network Since state s adopted in this paper is a 64 64 1 depth image, a CNN is amended to the SAC so that the SAC can learn directly from the depth image. The hyperparameters of SAC are listed in Table 1 and its network architecture is shown in Figure 7. The input to the policy network is the depth image of the object of interest as detected by YOLO. The inputs to the soft action-value network and the target soft action-value network are comprised of the depth image of the object of interest as detected by YOLO and the policy outputted by the policy network. As shown in Figure 7, the policy network, the soft action-value network and the target soft action-value network all consist of three CNNs and four full connected neural networks. The activation functions used in the soft action-value network and the target soft action-value network are ReLU. As for the policy network, the activation functions for the three CNNs and the first three full connected neural networks are ReLU. The output of the last layer of the policy network is the displacement vector on the image plane, having both positive and negative values. Therefore, the hyperbolic tangent function (i.e., Tanh) is chosen as the activation function for the last layer of the policy network. Note that the three CNNs and the first fully connected neural network are used to extract image features. Table 1. Hyperparameters of SAC neural network. Hyperparameter Title 2 optimizer Adam learning rate 0.001 replay buffer size 200,000 batch size 64 discount factor ( ) 0.99 target smoothing coefficient () 0.005 entropy temperature parameter ( ) 0.01 Machines 2023, 11, x FOR PEER REVIEW 9 of 20 Machines 2023, 11, 275 9 of 19 (a) (b) (c) Figure 7. Architecture of SAC neural network. (a) Policy Network; (b) Soft Action-Value Network; Figure 7. Architecture of SAC neural network. (a) Policy Network; (b) Soft Action-Value Network; (c) Target Soft Action-Value Network. (c) Target Soft Action-Value Network. Table 1. Hyperparameters of SAC neural network. 5. Experimental Setup and Results The real experimental environment used in this paper is shown in Figure 8a, while Hyperparameter Title 2 Figure 8b shows the simulated environment constructed using the simulation platform optimizer Adam V-REP. The simulated environment is mainly used to train and test the deep neural network. learning rate 0.001 The 6-DOF A7 industrial articulated robot manipulator used in the real experiment is replay buffer size 200,000 manufactured by ITRI. The Mitsubishi AC servomotors installed at each joint of the robot batch size 64 manipulator are equipped with absolute encoders and are set to torque mode. A vacuum discount factor (γ) 0.99 sucker (maximum payload 3 kg) manufactured by Schmalz is mounted on the end-effector target smoothing coefficient (τ) 0.005 of the robot manipulator. The vision sensor used in the experiment is a Kinect v2 RGBD entropy temperature parameter (α) 0.01 camera (30 Hz frame rate) manufactured by Microsoft. The maximum resolution for the RGB camera is 1920  1080 pixels, while the maximum resolution for the depth 5. Experimental Setup and Results camera is 512  424 pixels. The Kinect v2 camera is located at the upper right side of the 6-DOF robot manipulator to capture the images of the objects. These object images will The real experimental environment used in this paper is shown in Figure 8a, while be used for YOLO to classify their categories. Two desktop computers are used in the Figure 8b shows the simulated environment constructed using the simulation platform V- experiment. The computer for controlling the 6-DOF robot manipulator and the vacuum REP. The simulated environment is mainly used to train and test the deep neural network. sucker is equipped with Intel(R) Core TM i7-2600 CPU @3.40 Ghz and 12 GB RAM. It runs The 6-DOF A7 industrial articulated robot manipulator used in the real experiment is under Microsoft Windows 7 and uses Microsoft Visual Studio 2015 as its programming manufactured by ITRI. The Mitsubishi AC servomotors installed at each joint of the robot development platform. The computer responsible for computer vision, the training of the manipulator are equipped with absolute encoders and are set to torque mode. A vacuum sucker (maximum payload 3 kg) manufactured by Schmalz is mounted on the end-effector Machines 2023, 11, x FOR PEER REVIEW 10 of 20 of the robot manipulator. The vision sensor used in the experiment is a Kinect v2 RGBD camera (30 Hz frame rate) manufactured by Microsoft. The maximum resolution for the RGB camera is 1920 × 1080 pixels, while the maximum resolution for the depth camera is 512 × 424 pixels. The Kinect v2 camera is located at the upper right side of the 6-DOF robot manipulator to capture the images of the objects. These object images will be used for YOLO to classify their categories. Two desktop computers are used in the experiment. The computer for controlling the 6-DOF robot manipulator and the vacuum sucker is Machines 2023, 11, 275 10 of 19 equipped with Intel(R) Core TM i7-2600 CPU @3.40 Ghz and 12 GB RAM. It runs under Microsoft Windows 7 and uses Microsoft Visual Studio 2015 as its programming devel- opment platform. The computer responsible for computer vision, the training of the deep deep reinforcement learning network, and the V-REP robot simulator is equipped with a reinforcement learning network, and the V-REP robot simulator is equipped with a NVIDIA GeForce RTX 2080 Ti and 26.9 GB RAM. It runs under Microsoft Windows 10 and NVIDIA GeForce RTX 2080 Ti and 26.9 GB RAM. It runs under Microsoft Windows 10 uses PyCharm as its development platform. The Python and the tool kit of the PyTorch are and uses PyCharm as its development platform. The Python and the tool kit of the used in training the deep reinforcement learning network. PyTorch are used in training the deep reinforcement learning network. (a) (b) Figure 8. Experimental and simulated environment: (a) real experimental environment; (b) simu- Figure 8. Experimental and simulated environment: (a) real experimental environment; (b) simu- lated environment. lated environment. 5.1. Training Results of YOLO 5.1. Training Results of YOLO As shown in Figure 9, the objects of interest used in the experiment included apples, As shown in Figure 9, the objects of interest used in the experiment included apples, Machines 2023, 11, x FOR PEER REVIEW 11 of 20 oranges, a banana, a cup, a box and building blocks. oranges, a banana, a cup, a box and building blocks. Figure 9. Objects of interest used in the experiment. Figure 9. Objects of interest used in the experiment. The COCO Dataset was used to train the YOLOv3 in this paper. However, the COCO Dataset does not include objects such as the building blocks used in the experiment. As a result, it was necessary to collect a training data set for the building blocks. In particular, a total of 635 images of the building blocks were taken. The transfer learning technique [29] was employed in this paper to speed up the training process, in which the weights provided by the authors of YOLO were adopted as the initial weights for training the YOLOv3. Figure 10 shows the training results of YOLO. The total number of iterations was 45,000. The value of the loss function converged to 0.0391. To test the performance of the trained YOLOv3, several objects were randomly placed on the table, with the detection results shown in Figure 11. Clearly, YOLOv3 can successfully detect and classify the ob- jects of interest. Figure 10. Training results of YOLOv3. Machines 2023, 11, x FOR PEER REVIEW 11 of 20 Machines 2023, 11, 275 11 of 19 Figure 9. Objects of interest used in the experiment. The COCO Dataset was used to train the YOLOv3 in this paper. However, the COCO The COCO Dataset was used to train the YOLOv3 in this paper. However, the COCO Dataset does not include objects such as the building blocks used in the experiment. As a Dataset does not include objects such as the building blocks used in the experiment. As a result, it was necessary to collect a training data set for the building blocks. In particular, result, it was necessary to collect a training data set for the building blocks. In particular, a a total of 635 images of the building blocks were taken. The transfer learning technique total of 635 images of the building blocks were taken. The transfer learning technique [29] [29] was employed in this paper to speed up the training process, in which the weights was employed in this paper to speed up the training process, in which the weights provided provided by the authors of YOLO were adopted as the initial weights for training the by the authors of YOLO were adopted as the initial weights for training the YOLOv3. YOLOv3. Figure 10 shows the training results of YOLO. The total number of iterations Figure 10 shows the training results of YOLO. The total number of iterations was 45,000. The was value 45,000. T of the he value o loss function f the loss conver func ged tionto conv 0.0391. erged to To test 0.0the 391. To performance test the pe of rforma the trained nce of YOLOv3, the trained several YOLOv objects 3, sever wer al ob e jec randomly ts were rplaced andomlon y pthe laced table, on the with tabthe le, wi detect th the ion detec results tion shown in Figure 11. Clearly, YOLOv3 can successfully detect and classify the objects results shown in Figure 11. Clearly, YOLOv3 can successfully detect and classify the ob- of interest. jects of interest. Machines 2023, 11, x FOR PEER REVIEW 12 of 20 Figure 10. Training results of YOLOv3. Figure 10. Training results of YOLOv3. (a) (b) (c) (d) (e) (f) Figure 11. Detection/classification results of YOLOv3 after training (a) 1st test (b) 2nd test (c) 3rd Figure 11. Detection/classification results of YOLOv3 after training (a) 1st test (b) 2nd test (c) 3rd test (d) 4th test (e) 5th test (f) 6th test. test (d) 4th test (e) 5th test (f) 6th test. 5.2. Training and Simulation Results of Object Grasping Policy Based on SAC Figure 12 illustrates the flowchart of the training process for the proposed object grasping approach based on SAC. At the beginning of each episode, the experimental/sim- ulation environment was reset, namely, the robot manipulator was returned to the home position, objects were placed on the table, and the camera took images of the environment. Based on the image captured by the camera, the object recognition/localization approach based on YOLO developed in Section 3 was used to find the position of the object of in- terest so as to obtain its current state (s) (detailed procedures are indicated by the red dash block in Figure 12). According to its current state, the SAC would output an action (a), i.e., the input displacement vector of the object of interest on the image plane. The joint com- mand of the robot manipulator could be obtained by using coordinate transformation, depth information and inverse kinematics. According to the obtained joint command, the end-effector was controlled to move to a desired position and a suction nozzle was turned on to perform object grasping. A positive reward was given for a successful grasp. The termination conditions for an episode occurred either when the total number of object grasping attempts was more than 100, or when an object grasping attempt was successful. In the real world, objects to be grasped are randomly placed. However, if the objects to be grasped are randomly placed for each episode in the training initially, the training time for learning object grasping successfully could be very long. In order to speed up the learning process, the idea of incremental learning is exploited in this paper to set up the learning environment. For instance, a building block was the object of interest for grasp- ing. Firstly, the pose of the building block on the table was fixed and the deep reinforce- ment neural network was trained over 1000 episodes in the simulated environment con- structed by the V-REP robot simulator. The training results are shown in Figure 13. Machines 2023, 11, 275 12 of 19 5.2. Training and Simulation Results of Object Grasping Policy Based on SAC Figure 12 illustrates the flowchart of the training process for the proposed object grasp- ing approach based on SAC. At the beginning of each episode, the experimental/simulation environment was reset, namely, the robot manipulator was returned to the home position, objects were placed on the table, and the camera took images of the environment. Based on the image captured by the camera, the object recognition/localization approach based on YOLO developed in Section 3 was used to find the position of the object of interest so as to obtain its current state (s) (detailed procedures are indicated by the red dash block in Figure 12). According to its current state, the SAC would output an action (a), i.e., the input displacement vector of the object of interest on the image plane. The joint command of the robot manipulator could be obtained by using coordinate transformation, depth information and inverse kinematics. According to the obtained joint command, the end-effector was controlled to move to a desired position and a suction nozzle was turned on to perform object grasping. A positive reward was given for a successful grasp. The termination conditions for an episode occurred either when the total number of object Machines 2023, 11, x FOR PEER REVIEW 13 of 20 grasping attempts was more than 100, or when an object grasping attempt was successful. Figure 12. Flowchart of the training process for the proposed object grasping approach based on Figure 12. Flowchart of the training process for the proposed object grasping approach based on SAC. SAC. As described in Equation (11), a positive reward of 1 will be given if the robot suc- cessfully grasps an object. In contrast, a negative reward −0.1 (i.e., penalty) will be given if the robot fails to grasp an object. That is, the accumulated reward for an episode will be negative if the robot needs more than ten attempts to successfully grasp an object. In ad- dition, since an extra positive reward 0.5 will be given if the robot successfully grasps an object on its first attempt, the maximum accumulated reward for an episode will be 1.5. Machines 2023, 11, 275 13 of 19 In the real world, objects to be grasped are randomly placed. However, if the objects to be grasped are randomly placed for each episode in the training initially, the training Machines 2023, 11, x FOR PEER REVIEW 14 of 20 time for learning object grasping successfully could be very long. In order to speed up the learning process, the idea of incremental learning is exploited in this paper to set up the learning environment. For instance, a building block was the object of interest for grasping. Firstly, the pose of the building block on the table was fixed and the deep reinforcement From the results shown in Figure 13, it was found that after 100 episodes of training, the neural network was trained over 1000 episodes in the simulated environment constructed 6-DOF robot manipulator was able to find a correct grasping pose for the case of a building by the V-REP robot simulator. The training results are shown in Figure 13. block with a fixed pose. (a) (b) Figure 13. Training results of a fixed pose building block (a) accumulated reward for each episode Figure 13. Training results of a fixed pose building block (a) accumulated reward for each episode (b) number of grasping attempts for each episode. (b) number of grasping attempts for each episode. After the 6-DOF robot manipulator could successfully grasp the building block with As described in Equation (11), a positive reward of 1 will be given if the robot suc- a fixed pose, the deep reinforcement neural network was retrained for another 1000 epi- cessfully grasps an object. In contrast, a negative reward 0.1 (i.e., penalty) will be given sodes. This time, the building block as well as other objects (used as the environmental if the robot fails to grasp an object. That is, the accumulated reward for an episode will disturbance) were randomly placed on a table. By exploiting the paradigm of transfer be negative if the robot needs more than ten attempts to successfully grasp an object. In learning, the weights of the deep reinforcement neural network after learning for the case addition, since an extra positive reward 0.5 will be given if the robot successfully grasps of fixed object poses were used as the initial weights for the deep reinforcement neural an object on its first attempt, the maximum accumulated reward for an episode will be 1.5. network in the retraining process. By taking into account the fact that objects of the same From the results shown in Figure 13, it was found that after 100 episodes of training, the category may have different sizes or colors, for every 100 episodes in the retraining pro- 6-DOF robot manipulator was able to find a correct grasping pose for the case of a building cess, the colors block and sizes of with a fixed objects in pose. each category were changed. This strategy served to enhance the robustne After ss o the f the trained p 6-DOF robotomanipulator licy toward environmental could successfully uncer grasp tainty dur the building ing block with a verification in the real world. Figure 14 shows the training results for the case of randomly fixed pose, the deep reinforcement neural network was retrained for another 1000 episodes. placed objects, wher This tie th me,e y the ebui llow lding line rep block resen as well ts the r as other esults of explo objects (used iting astr the ansfer envir learning onmental disturbance) (i.e., using the weigh were randomly ts for the c placed ase ofon fixed objec a table. By t po exploiting ses as the the initparadigm ial weightsof ) and transfer the learning, the weights of the deep reinforcement neural network after learning for the case of fixed object purple line shows the results without using transfer learning. The results shown in Figure poses were used as the initial weights for the deep reinforcement neural network in the 14b indicate that the number of grasping attempts required to find correct grasping points retraining process. By taking into account the fact that objects of the same category may without using transfer learning was much larger than that for using transfer learning over have different sizes or colors, for every 100 episodes in the retraining process, the colors the first 200 episodes. Table 2 shows similar results in total training time and total number and sizes of objects in each category were changed. This strategy served to enhance the of grasping attempts for 1000 episodes. robustness of the trained policy toward environmental uncertainty during verification in the real world. Figure 14 shows the training results for the case of randomly placed objects, where the yellow line represents the results of exploiting transfer learning (i.e., using the weights for the case of fixed object poses as the initial weights) and the purple line shows the results without using transfer learning. The results shown in Figure 14b indicate that the number of grasping attempts required to find correct grasping points without using transfer learning was much larger than that for using transfer learning over the first 200 episodes. Table 2 shows similar results in total training time and total number of grasping attempts for 1000 episodes. (a) (b) Figure 14. Training results for the case of randomly placed objects: the yellow line represents the results of exploiting transfer learning (i.e., use the weights for the case of fixed object poses as the Machines 2023, 11, x FOR PEER REVIEW 14 of 20 From the results shown in Figure 13, it was found that after 100 episodes of training, the 6-DOF robot manipulator was able to find a correct grasping pose for the case of a building block with a fixed pose. (a) (b) Figure 13. Training results of a fixed pose building block (a) accumulated reward for each episode (b) number of grasping attempts for each episode. After the 6-DOF robot manipulator could successfully grasp the building block with a fixed pose, the deep reinforcement neural network was retrained for another 1000 epi- sodes. This time, the building block as well as other objects (used as the environmental disturbance) were randomly placed on a table. By exploiting the paradigm of transfer learning, the weights of the deep reinforcement neural network after learning for the case of fixed object poses were used as the initial weights for the deep reinforcement neural network in the retraining process. By taking into account the fact that objects of the same category may have different sizes or colors, for every 100 episodes in the retraining pro- cess, the colors and sizes of objects in each category were changed. This strategy served to enhance the robustness of the trained policy toward environmental uncertainty during verification in the real world. Figure 14 shows the training results for the case of randomly placed objects, where the yellow line represents the results of exploiting transfer learning (i.e., using the weights for the case of fixed object poses as the initial weights) and the purple line shows the results without using transfer learning. The results shown in Figure 14b indicate that the number of grasping attempts required to find correct grasping points without using transfer learning was much larger than that for using transfer learning over Machines 2023, 11, 275 14 of 19 the first 200 episodes. Table 2 shows similar results in total training time and total number of grasping attempts for 1000 episodes. (a) (b) Figure 14. Training results for the case of randomly placed objects: the yellow line represents the Figure 14. Training results for the case of randomly placed objects: the yellow line represents the results of exploiting transfer learning (i.e., use the weights for the case of fixed object poses as the results of exploiting transfer learning (i.e., use the weights for the case of fixed object poses as the initial weights), while the purple line shows the results without using transfer learning. (a) Accumulated reward for each episode; (b) number of grasping attempts for each episode. Table 2. Total training time and total number of grasping attempts. Machines 2023, 11, x FOR PEER REVIEW 15 of 20 Pre_Train No_Pre_Train Without_YOLO (Use Transfer Learning) Training time 6443 (s) 15,076 (s) 102,580 (s) initial weights), while the purple line shows the results without using transfer learning. (a) Accu- Number of grasping attempts 1323 (attempts) 3635 (attempts) 38,066 (attempts) mulated reward for each episode; (b) number of grasping attempts for each episode. Figure 15 shows the results of directly using the entire image (rather than using the Figure 15 shows the results of directly using the entire image (rather than using the object of interest detected by YOLOv3) as the input state for the deep reinforcement learning object of interest detected by YOLOv3) as the input state for the deep reinforcement learn- network. The results shown in Figure 15 indicate that correct grasping points cannot be ing network. The results shown in Figure 15 indicate that correct grasping points cannot obtained after 1000 episodes of training. Table 2 indicates that the training time for the be obtained after 1000 episodes of training. Table 2 indicates that the training time for the case of using the entire image as the input is 15.9 times longer than that for using the case of using the entire image as the input is 15.9 times longer than that for using the proposed approach (i.e., transfer learning + YOLO + SAC). In addition, the number of proposed approach (i.e., transfer learning + YOLO + SAC). In addition, the number of grasping attempts for the case of using the entire image as the input is 28.8 times larger grasping attempts for the case of using the entire image as the input is 28.8 times larger than that for using the proposed approach. The above simulation results reveal that the than that for using the proposed approach. The above simulation results reveal that the proposed approach indeed can effectively reduce the total training time and total number proposed approach indeed can effectively reduce the total training time and total number of grasping attempts. of grasping attempts. (a) (b) Figure 15. Results of the V-REP robot simulator without combining YOLOv3: (a) accumulated re- Figure 15. Results of the V-REP robot simulator without combining YOLOv3: (a) accumulated reward ward for each episode; (b) number of grasping attempts for each episode. for each episode; (b) number of grasping attempts for each episode. Table 2. Total training time and total number of grasping attempts. Pre_Train No_Pre_Train Without_YOLO (Use Transfer Learning) Training time 6443 (s) 15,076 (s) 102,580 (s) Number of grasping 1323 (attempts) 3635 (attempts) 38,066 (attempts) attempts 5.3. Object Grasping Using a Real Robot Manipulator As mentioned previously, the input to the proposed deep reinforcement learning- based object grasping approach is the depth image (provided by Kinect v2) of the objects of interest detected by YOLOv3. Since YOLOv3 uses the RGB image (provided by Kinect v2) to detect the objects of interest, there is a need to construct the correspondence between the depth image and the RGB image so that the depth information of a point on the object of interest can be retrieved. In this paper, such a correspondence is constructed by using SDK accompanied with Kinect v2. In addition, with camera calibration [30] and the ob- tained depth information, the 3D information of a point on the object of interest in the camera frame can be retrieved. Hand-eye calibration [31] is then conducted to obtain the coordination transformation relationship between the camera frame and the end-effector fame. Using the results of hand-eye calibration and robot kinematics, the 3D information of a point on the object of interest in the camera frame can be converted into 3D infor- mation in the robot base frame. Moreover, using robot inverse kinematics, the joint com- mands for the robot to perform the task of grasping the object of interest can be obtained. Machines 2023, 11, 275 15 of 19 5.3. Object Grasping Using a Real Robot Manipulator As mentioned previously, the input to the proposed deep reinforcement learning- based object grasping approach is the depth image (provided by Kinect v2) of the objects of interest detected by YOLOv3. Since YOLOv3 uses the RGB image (provided by Kinect v2) to detect the objects of interest, there is a need to construct the correspondence between the depth image and the RGB image so that the depth information of a point on the object of interest can be retrieved. In this paper, such a correspondence is constructed by using SDK accompanied with Kinect v2. In addition, with camera calibration [30] and the ob- tained depth information, the 3D information of a point on the object of interest in the Machines 2023, 11, x FOR PEER REVIEW 16 of 20 camera frame can be retrieved. Hand-eye calibration [31] is then conducted to obtain the coordination transformation relationship between the camera frame and the end-effector Figure 16 illustrates the flowchart for grasping a specific object. In this experiment, fame. Using the results of hand-eye calibration and robot kinematics, the 3D information of several different types of objects were randomly placed on a table. Note that the vacuum a point on the object of interest in the camera frame can be converted into 3D infor-mation sucker mounted on the end-effector rather than a gripper is used in this paper to grasp in the robot base frame. Moreover, using robot inverse kinematics, the joint com-mands for the object of interest. In order to perform a successful grasp, the suction force needs to the robot to perform the task of grasping the object of interest can be obtained. overcome the gravity force of the object of interest. As a result, the rim of the cup is not facing up in t Figure h16 e experimen illustrates t. The the Kinec flowchart t v2 camera for toograsping k an image of a the env specific ironm object. ent. The In this experiment, user assigned a specific object of interest for the robot manipulator to grasp. The SAC several different types of objects were randomly placed on a table. Note that the vacuum outputted a prediction of the position coordinate of the assigned object to be grasped. The sucker mounted on the end-effector rather than a gripper is used in this paper to grasp joint command of the robot manipulator was obtained by using coordinate transfor- the object of interest. In order to perform a successful grasp, the suction force needs to mation, depth information and inverse kinematics. According to the obtained joint com- overcome the gravity force of the object of interest. As a result, the rim of the cup is not mand, the end-effector was controlled to move to a desired position and a suction nozzle facing was turned o up innthe to pexperiment. erform object gr The asping. If Kinect thev2 atte camera mpt for ob took ject gr anasp image ing faof iled, the the environment. The Kinect v2 camera took an image at the environment again and the object grasping process user assigned a specific object of interest for the robot manipulator to grasp. The SAC was repeated. If the attempts of object grasping failed three consecutive times, the task for outputted a prediction of the position coordinate of the assigned object to be grasped. The grasping an assigned specific object was regarded as a failure. joint command of the robot manipulator was obtained by using coordinate transformation, In particular, SAC was employed to train a 6-DOF robot manipulator to grasp build- depth ing block information s and bananas and in a sim inverse ulated ekinematics. nvironment constr Accor ucted by ding a V- toREP robot the obtained simu- joint command, lator. By exploiting the concept of Sim-to-Real [32], the trained network was deployed to the end-effector was controlled to move to a desired position and a suction nozzle was the real 6-DOF robot manipulator to perform object grasping in the real world. In addi- turned on to perform object grasping. If the attempt for object grasping failed, the Kinect tion, in real-world experiments, objects such as apples, oranges and cups which are not in v2 camera took an image at the environment again and the object grasping process was the training data set were added to the list of objects of interest. From the experimental repeated. If the attempts of object grasping failed three consecutive times, the task for results shown in Figure 17, it is evident that the trained SAC can indeed provide correct object grasping points for objects of interest in real-world environments. Experimental grasping an assigned specific object was regarded as a failure. results for the success rate of grasping different objects are listed in Table 3. Figure 16. Flowchart for grasping a specific object. Figure 16. Flowchart for grasping a specific object. Machines 2023, 11, 275 16 of 19 In particular, SAC was employed to train a 6-DOF robot manipulator to grasp building blocks and bananas in a simulated environment constructed by a V-REP robot simulator. By exploiting the concept of Sim-to-Real [32], the trained network was deployed to the real 6-DOF robot manipulator to perform object grasping in the real world. In addition, in real-world experiments, objects such as apples, oranges and cups which are not in the training data set were added to the list of objects of interest. From the experimental results shown in Figure 17, it is evident that the trained SAC can indeed provide correct object Machines 2023, 11, x FOR PEER REVIEW 17 of 20 grasping points for objects of interest in real-world environments. Experimental results for the success rate of grasping different objects are listed in Table 3. Figure 17. Object grasping point provided by SAC for different objects of interest (red point inside Figure 17. Object grasping point provided by SAC for different objects of interest (red point inside the the bounding box in the upper environment image; white point in the lower depth image); the “ar- bounding box in the upper environment image; white point in the lower depth image); the “arrow” row” sign is used to indicate the object of interest. sign is used to indicate the object of interest. Table 3. Rate of successful grasping for different objects. Table 3. Rate of successful grasping for different objects. Building Object of Interest Apple Banana Orange Cup Object of Interest Building Block Apple Banana Orange Cup Block Rate of successful grasping 19/20 6/10 6/10 8/10 9/10 Rate of successful grasping 19/20 6/10 6/10 8/10 9/10 Object is in the training set yes no yes no no Object is in the training set yes no yes no no The results listed in Table 3 indicate that for the objects in the training set, the building The results listed in Table 3 indicate that for the objects in the training set, the build- block has a much higher rate of being successfully grasped than the banana. The reason ing block has a much higher rate of being successfully grasped than the banana. The rea- for this discrepancy is that in the simulated environment, the banana has a fixed shape son for this discrepancy is that in the simulated environment, the banana has a fixed shape and smooth surface. However, the bananas used in real-world experiments have different and smooth surface. However, the bananas used in real-world experiments have different shapes/sizes and their surfaces are not smooth enough. Therefore, the significant differ- shapes/sizes and their surfaces are not smooth enough. Therefore, the significant differ- ences between the simulated environment and that of the real-world experiment lead to ences between the simulated environment and that of the real-world experiment lead to a a lower rate of successful grasping for bananas. As for the objects not in the training set, lower rate of successful grasping for bananas. As for the objects not in the training set, the the apples had the lowest rate of being successfully grasped. One possibility is that the apples had the lowest rate of being successfully grasped. One possibility is that the two two apples used in the real-world experiments have significant differences in size/shapes. apples used in the real-world experiments have significant differences in size/shapes. In In addition, in real-world experiments, hand-eye calibration error and robot calibration addition, in real-world experiments, hand-eye calibration error and robot calibration er- errors all contribute to the fact that the end-effector cannot 100% accurately move to the rors all contribute to the fact that the end-effector cannot 100% accurately move to the grasping position determined by the proposed deep reinforcement learning-based object grasping position determined by the proposed deep reinforcement learning-based object grasping approach. Since bananas and apples require a more accurate grasping point, it is grasping approach. Since bananas and apples require a more accurate grasping point, it not surprising that their rates of being successfully grasped are lower. is not surprising that their rates of being successfully grasped are lower. In summary, there are several interesting observations from the experimental results. In summary, there are several interesting observations from the experimental results. First of all, the suction nozzle used in this paper requires a smooth object surface to achieve First of all, the suction nozzle used in this paper requires a smooth object surface to successful grasping. That explains why apples and bananas have lower successful grasping achieve successful grasping. That explains why apples and bananas have lower successful rates. Secondly, without further training, the proposed approach exhibits decent grasping grasping rates. Secondly, without further training, the proposed approach exhibits decent performance, even for cases in which the objects of interest are previously unseen. Thirdly, grasping performance, even for cases in which the objects of interest are previously un- experimental results indicate that the SAC can be trained in the robot simulator and the seen. Thirdly, experimental results indicate that the SAC can be trained in the robot sim- trained SAC can be deployed to the real 6-DOF robot manipulator to successfully perform ulator and the trained SAC can be deployed to the real 6-DOF robot manipulator to suc- object grasping in the real world. cessfully perform object grasping in the real world. The next experiment was to grasp and classify all the objects randomly placed on the table and to put the grasped objects into the bin where they belonged. First of all, several objects were randomly placed on the table, after which YOLOv3 detected and classified all of the objects on the table. The SAC then provided information for the grasping points corresponding to all the objects of interest to the robot manipulator. The 6-DOF robot ma- nipulator then performed the grasping task and put the grasped objects into their respec- tive bins. Note that during the grasping process, the robot manipulator may collide with other objects so that their poses may change and result in grasping failures. In order to deal with the aforementioned problem, after performing the object grasping task, if some Machines 2023, 11, 275 17 of 19 The next experiment was to grasp and classify all the objects randomly placed on the table and to put the grasped objects into the bin where they belonged. First of all, several objects were randomly placed on the table, after which YOLOv3 detected and classified all of the objects on the table. The SAC then provided information for the grasping points corresponding to all the objects of interest to the robot manipulator. The 6-DOF robot manipulator then performed the grasping task and put the grasped objects into their Machines 2023, 11, x FOR PEER REVIEW respective bins. Note that during the grasping process, the robot manipulator may18 of collide 20 with other objects so that their poses may change and result in grasping failures. In order to deal with the aforementioned problem, after performing the object grasping task, if some objects remained on the table, the object grasping tasks were repeated until all of the objects objects remained on the table, the object grasping tasks were repeated until all of the ob- on the table had been grasped and correctly put into the bin. Figure 18 shows an image jects on the table had been grasped and correctly put into the bin. Figure 18 shows an sequence of the object grasping/classification experiment. image sequence of the object grasping/classification experiment. Figure 18. Image sequence of object grasping/classification experiment (a) original image (b) clas- Figure 18. Image sequence of object grasping/classification experiment (a) original image (b) classifi- sification results of YOLOv3. cation results of YOLOv3. 6. Conclusions 6. Conclusions This paper proposes an approach that combines YOLO and deep reinforcement This paper proposes an approach that combines YOLO and deep reinforcement learning learning SAC algorithms for the 6-DOF robot manipulator to perform object grasp- SAC algorithms for the 6-DOF robot manipulator to perform object grasping/classification thr ing/c ough lass self-learning. ification through In particular self-learn,ing. the Iobjects n particu oflar inter , the objec est in this ts of paper interest in are detected this paper by are detected by YOLOv3. By considering the fact that objects of the same type may have YOLOv3. By considering the fact that objects of the same type may have different colors, different colors, only their depth images provided by Kinect v2 are thus used as the inputs only their depth images provided by Kinect v2 are thus used as the inputs for the proposed deep for th reinfor e proposed deep r cement learning- einforcemen based object t learn grasping ing-base appr d object gr oach. Inasping appro this way, theaexploration ch. In this way, the exploration space can be substantially reduced so as to improve the success rate space can be substantially reduced so as to improve the success rate and enable SAC to conver and enab ge quickly le SAC .to conver Moreover ge , quick to spe led y. M up oreover the training , to speed process, up the a training V-REP r proce obot simulator ss, a V-REP is robot simulator is employed to construct a simulated environment to train the SAC. Sim- employed to construct a simulated environment to train the SAC. Simulation results indicate that ulation re the prsults ind oposed appr icateoach that the proposed can indeed ef approach fectively can reduce indeed the effec totalttraining ively redtime uce the to and tal the training time and the total number of grasping attempts compared with an approach that total number of grasping attempts compared with an approach that directly uses the entire image directas ly the uses the inputen state tirefor imthe age as the in deep reinfor put cement state for learning the deep re network. inforcement In addition, learning ne to further t- speed work. In add up the training ition, to fur process, ther speed the paradigms up the of trainin transfer g process, the paradigm learning and incremental s of tr learning ansfer learning and incremental learning are employed in the proposed approach. Moreover, the are employed in the proposed approach. Moreover, the trained SAC was transferred to a real trained SAC was tr 6-DOF robot manipulator ansferred to a for r re eal-world al 6-DOF verification. robot manipu Experimental lator for real-wor results ld ve indicate rificatithat on. Experimental results indicate that in using the proposed approach, the real 6-DOF robot manipulator successfully performed object grasping/classification, even for previously unseen objects. Author Contributions: Conceptualization, Y.-L.C., Y.-R.C. and M.-Y.C.; methodology, Y.-L.C. and Y.-R.C.; software, Y.-L.C. and Y.-R.C.; validation, Y.-L.C. and Y.-R.C.; formal analysis, Y.-L.C. and Y.-R.C.; writing original draft preparation, Y.-L.C., Y.-R.C. and M.-Y.C.; writing review and editing, M.-Y.C.; Project administration, M.-Y.C.; Funding acquisition, M.-Y.C.; supervision, M.-Y.C. All au- thors have read and agreed to the published version of the manuscript. Machines 2023, 11, 275 18 of 19 in using the proposed approach, the real 6-DOF robot manipulator successfully performed object grasping/classification, even for previously unseen objects. Author Contributions: Conceptualization, Y.-L.C., Y.-R.C. and M.-Y.C.; methodology, Y.-L.C. and Y.-R.C.; software, Y.-L.C. and Y.-R.C.; validation, Y.-L.C. and Y.-R.C.; formal analysis, Y.-L.C. and Y.-R.C.; writing original draft preparation, Y.-L.C., Y.-R.C. and M.-Y.C.; writing review and editing, M.-Y.C.; project administration, M.-Y.C.; Funding acquisition, M.-Y.C.; supervision, M.-Y.C. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the Ministry of Science and Technology, Taiwan, grant number MOST 108-2221-E-006-217-MY2. Data Availability Statement: The data presented in this study are available on request from the corresponding author. Conflicts of Interest: The authors declare no conflict of interest. References 1. Kyprianou, G.; Doitsidis, L.; Chatzichristofis, S.A. Collaborative Viewpoint Adjusting and Grasping via Deep Reinforcement Learning in Clutter Scenes. Machines 2022, 10, 1135. [CrossRef] 2. Johns, E.; Leutenegger, S.; Davison, A.J. Deep learning a grasp function for grasping under gripper pose uncertainty. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, Daejeon, Republic of Korea, 9–14 October 2016; pp. 4461–4468. [CrossRef] 3. Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [CrossRef] 4. Pinto, L.; Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 3406–3413. [CrossRef] 5. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [CrossRef] 6. Mahler, J.; Pokorny, F.T.; Hou, B.; Roderick, M.; Laskey, M.; Aubry, M.; Kohlhoff, K.; Kröger, T.; Kuffner, J.; Goldberg, K. Dex-Net 1.0: A cloud-based network of 3D objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 1957–1964. [CrossRef] 7. Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv 2017, arXiv:1703.09312. [CrossRef] 8. Mahler, J.; Matl, M.; Liu, X.; Li, A.; Gealy, D.; Goldberg, K. Dex-Net 3.0: Computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation, Brisbane, QLD, Australia, 21–25 May 2018; pp. 5620–5627. [CrossRef] 9. Mahler, J.; Matl, M.; Satish, V.; Danielczuk, M.; DeRose, B.; McKinley, S.; Goldberg, K. Learning ambidextrous robot grasping policies. Sci. Robot. 2019, 4, eaau4984. [CrossRef] 10. Zhang, H.; Peeters, J.; Demeester, E.; Kellens, K. A CNN-Based Grasp Planning Method for Random Picking of Unknown Objects with a Vacuum Gripper. J. Intell. Robot. Syst. 2021, 103, 1–19. [CrossRef] 11. Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [CrossRef] 12. Fang, K.; Zhu, Y.; Garg, A.; Kurenkov, A.; Mehta, V.; Li, F.F.; Savarese, S. Learning task-oriented grasping for tool manipulation from simulated self-supervision. Int. J. Robot. Res. 2020, 39, 202–216. [CrossRef] 13. Ji, X.; Xiong, F.; Kong, W.; Wei, D.; Shen, Z. Grasping Control of a Vision Robot Based on a Deep Attentive Deterministic Policy Gradient. IEEE Access 2021, 10, 867–878. [CrossRef] 14. Horng, J.R.; Yang, S.Y.; Wang, M.S. Self-Correction for Eye-In-Hand Robotic Grasping Using Action Learning. IEEE Access 2021, 9, 156422–156436. [CrossRef] 15. Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to train your robot with deep reinforcement learning: Lessons we have learned. Int. J. Robot. Res. 2021, 40, 698–721. [CrossRef] 16. Gualtieri, M.; Ten Pas, A.; Platt, R. Pick and place without geometric object models. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation, Brisbane, QLD, Australia, 21–25 May 2018; pp. 7433–7440. [CrossRef] 17. Fujita, Y.; Uenishi, K.; Ummadisingu, A.; Nagarajan, P.; Masuda, S.; Castro, M.Y. Distributed reinforcement learning of targeted grasping with active vision for mobile manipulators. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 9712–9719. [CrossRef] 18. Zeng, A.; Song, S.; Welker, S.; Lee, J.; Rodriguez, A.; Funkhouser, T. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 4238–4245. [CrossRef] Machines 2023, 11, 275 19 of 19 19. Deng, Y.; Guo, X.; Wei, Y.; Lu, K.; Fang, B.; Guo, D.; Liu, H.; Sun, F. Deep reinforcement learning for robotic pushing and picking in cluttered environment. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, China, 3–8 November 2019; pp. 619–626. [CrossRef] 20. Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. QT-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv 2018, arXiv:1806.10293. [CrossRef] 21. Chen, R.; Dai, X.Y. Robotic grasp control policy with target pre-detection based on deep q-learning. In Proceedings of the 2018 3rd International Conference on Robotics and Automation Engineering, Guangzhou, China, 17–19 November 2018; pp. 29–33. [CrossRef] 22. Chen, Z.; Lin, M.; Jia, Z.; Jian, S. Towards generalization and data efficient learning of deep robotic grasping. arXiv 2020, arXiv:2007.00982. [CrossRef] 23. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [CrossRef] 24. Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [CrossRef] 25. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [CrossRef] 26. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [CrossRef] 27. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. 28. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2019, arXiv:1812.05905. [CrossRef] 29. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [CrossRef] 30. Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [CrossRef] 31. Cai, C.; Somani, N.; Nair, S.; Mendoza, D.; Knoll, A. Uncalibrated stereo visual servoing for manipulators using virtual impedance control. In Proceedings of the 13th International Conference on Control Automation Robotics & Vision, Singapore, 10–12 December 2014; pp. 1888–1893. 32. Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018. [CrossRef] Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Machines Multidisciplinary Digital Publishing Institute

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach

Machines , Volume 11 (2) – Feb 12, 2023

Loading next page...
 
/lp/multidisciplinary-digital-publishing-institute/vision-based-robotic-object-grasping-a-deep-reinforcement-learning-W61NED6yc2

References (32)

Publisher
Multidisciplinary Digital Publishing Institute
Copyright
© 1996-2024 MDPI (Basel, Switzerland) unless otherwise stated Disclaimer Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. Terms and Conditions Privacy Policy
ISSN
2075-1702
DOI
10.3390/machines11020275
Publisher site
See Article on Publisher Site

Abstract

machines Article Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach Ya-Ling Chen, Yan-Rou Cai and Ming-Yang Cheng * Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan * Correspondence: [email protected] Abstract: This paper focuses on developing a robotic object grasping approach that possesses the ability of self-learning, is suitable for small-volume large variety production, and has a high success rate in object grasping/pick-and-place tasks. The proposed approach consists of a computer vision- based object detection algorithm and a deep reinforcement learning algorithm with self-learning capability. In particular, the You Only Look Once (YOLO) algorithm is employed to detect and classify all objects of interest within the field of view of a camera. Based on the detection/localization and classification results provided by YOLO, the Soft Actor-Critic deep reinforcement learning algorithm is employed to provide a desired grasp pose for the robot manipulator (i.e., learning agent) to perform object grasping. In order to speed up the training process and reduce the cost of training data collection, this paper employs the Sim-to-Real technique so as to reduce the likelihood of damaging the robot manipulator due to improper actions during the training process. The V-REP platform is used to construct a simulation environment for training the deep reinforcement learning neural network. Several experiments have been conducted and experimental results indicate that the 6-DOF industrial manipulator successfully performs object grasping with the proposed approach, even for the case of previously unseen objects. Keywords: 6-DOF industrial manipulator; deep reinforcement learning; soft actor-critic; robotic object grasping; YOLO Citation: Chen, Y.-L.; Cai, Y.-R.; Cheng, M.-Y. Vision-Based Robotic Object Grasping—A Deep 1. Introduction Reinforcement Learning Approach. In most conventional approaches for vision-based pick-and-place tasks of industrial Machines 2023, 11, 275. manipulators used in a production line, a 3D model of the object to be grasped must be https://doi.org/10.3390/ known in advance. With the known 3D model, one can either analyze the geometric shape machines11020275 of the object and find a proper way for the industrial manipulator to grasp that object or Academic Editor: Shabnam exploit methods such as feature matching and shape recognition to find an appropriate Sadeghi Esfahlani pose for the industrial manipulator to perform object grasping as well as pick-and-place tasks. However, this kind of approach is sensitive to the illumination conditions and other Received: 5 January 2023 types of disturbances in the ambient environment. If the 3D model of the object to be Revised: 8 February 2023 Accepted: 10 February 2023 grasped is not known in advance or if there are a variety of objects to be grasped, the Published: 12 February 2023 aforementioned conventional approaches may fail. With the machine learning paradigm becoming popular, more and more research has been focused on applying the deep learning technique to deal with automatic object grasping tasks [1]. For example, Johns et al. [2] used a deep neural network to predict a score for the grasp pose of a parallel jaw gripper Copyright: © 2023 by the authors. for each object in a depth image, through which a physical simulator was employed to Licensee MDPI, Basel, Switzerland. obtain simulated depth images of objects as the training data set. Lenz et al. [3] used two This article is an open access article deep neural networks to detect robotic grasps from images captured by an RGBD camera. distributed under the terms and One deep neural network having a simpler structure and requiring fewer computation conditions of the Creative Commons resources was mainly used to retrieve candidate bounding rectangles for grasping. Another Attribution (CC BY) license (https:// deep neural network was used to rank the candidate bounding rectangles for a parallel creativecommons.org/licenses/by/ gripper [3]. In [4], 700 h were spent collecting data from 50,000 grasping attempts of robot 4.0/). Machines 2023, 11, 275. https://doi.org/10.3390/machines11020275 https://www.mdpi.com/journal/machines Machines 2023, 11, 275 2 of 19 manipulators, and a Convolutional Neural Network (CNN) was combined with a multi- stage learning approach to predict an appropriate grasping pose for robot manipulators. In [5], Levine et al. exploited the deep learning paradigm to train fourteen 7-DOF robot manipulators to perform object grasping using RGB images. A total of 800,000 grasp attempts by robot manipulators were recorded within two months to train the deep neural network. Experimental results indicate that the robot manipulator can successfully grasp 1100 objects of different sizes and shapes. Goldberg and his colleagues have done a series of studies on robot grasping using deep learning [6–9]. Mahler et al. proposed the Dex-Net 1.0 deep learning system for robot grasping [6]. More than 10,000 independent 3D models and 2.5 million samples of grasping data for the parallel gripper are used in Dex-Net 1.0. In order to shorten the training time, 1500 virtual cores in the Google cloud platform are used. In 2019, Mahler et al. proposed the Dex-Net 4.0 [9], for which five million depth images had been trained by a GQ-CNN. After the training is complete, the dual arm robot with a suction nozzle and a parallel-jaw gripper is able to empty a bin with an average grasping rate of 300 objects/hour [9]. Several past studies utilized CNN to produce suitable grasping poses to perform object grasping tasks [10–12]. All of the aforementioned studies demonstrated good performance in automatic object grasping, even for cases in which the objects to be grasped did not appear in the training data set. However, the subjects of these past studies all have common drawbacks, in that they are very time consuming and not cost effective in generating a grasping data set for training the deep neural network. Recently, the research topic of exploiting reinforcement learning in training robot manipulators to perform object grasping has received much attention [13–15]. Gualtieri et al. used deep reinforcement learning algorithms to solve the robotic pick-and-place problems for cases in which the geometrical models of the objects to be grasped are unknown [16]. In particular, using the deep reinforcement learning algorithm, the robot manipulator is able to determine a suitable pose (i.e., optimal action) to grasp certain types of objects. In [17], the image of an arbitrary pose of an object is used as an input to a distributed reinforcement learning algorithm. After learning, the robot is able to perform grasping tasks for objects that are either occluded or previously unseen. Deep reinforcement learning is also used in training robotic pushing/grasping/picking [18,19]. Kalashnikov et al. developed the QT-Opt algorithm and focused on scalable, off-policy deep reinforcement learning [20]. Seven real robot manipulators were used to perform and record more than 580 k grasp attempts for training a deep neural network. Once the learning process is complete, the real robot manipulator can successfully perform grasping, even for previously unseen objects [20]. Chen and Dai used a CNN to detect the image features of an object. Based on the detected image features of the object of interest, a deep Q-learning algorithm was used to determine the grasp pose corresponding to that object [21]. In [22], Chen et al. used a Mask R-CNN and PCA to estimate the 3D pose of objects to be grasped. Based on the estimated 3D object pose, a deep reinforcement learning algorithm is employed to train the control policy in a simulated environment. Once the learning process is complete, one can deploy the learned model to the real robot manipulator without further training. This paper proposes an object grasping approach that combines the YOLO algo- rithm [23–26] and the Soft Actor-Critic (SAC) algorithm [27,28]. It is well known that YOLO is capable of rapidly detecting, localizing and recognizing objects in an image. In particular, YOLO can find the location of the object of interest inside the field of view of a camera and use this location information as the input to a reinforcement learning algorithm. Since the search of an entire image in not essential, training time can therefore be substantially reduced. SAC is based on the Actor-Critic framework and exploits Off-Policy to improve the sample efficiency. SAC maximizes the expected return as well as the entropy of policy simultaneously. Since SAC exhibits excellent performance and is suitable for real-world applications, this paper employs SAC to train the robot manipulator to perform object grasping through self-learning. Machines 2023, 11, x FOR PEER REVIEW 3 of 20 Machines 2023, 11, 275 3 of 19 applications, this paper employs SAC to train the robot manipulator to perform object grasping through self-learning. 2. Framework 2. Framework This This pa paper per develops a develops a robot robotic ic object object gra grasping sping techn technique ique th that at comb combines ines compute computer r vi- vision-based sion-based ob object ject de detection/r tection/recogni ecognition/localization tion/localization anand d a d aeep re deep inforcement reinforcement lear learning ning al- algorithm with self-learning capability. Figure 1 shows the schematic diagram of the gorithm with self-learning capability. Figure 1 shows the schematic diagram of the robotic robotic pick-and-place system developed in this paper. As shown in Figure 1, YOLO will pick-and-place system developed in this paper. As shown in Figure 1, YOLO will detect detect the object of interest from the image captured by the camera. SAC will provide the the object of interest from the image captured by the camera. SAC will provide the desired desired grasping point in the image plane based on the depth image information of the grasping point in the image plane based on the depth image information of the object object bounding box. The grasping point on the 2D-image plane is converted to a desired bounding box. The grasping point on the 2D-image plane is converted to a desired 6D 6D grasping pose in the Cartesian space so as to control the robot manipulator to grasp grasping pose in the Cartesian space so as to control the robot manipulator to grasp objects objects of interest and place them at a desired position. The system will return the reward of interest and place them at a desired position. The system will return the reward infor- information based on the reward mechanism. mation based on the reward mechanism. Figure 1. Schematic diagram of robotic pick-and-place based on computer vision and deep Figure 1. Schematic diagram of robotic pick-and-place based on computer vision and deep rein- reinforcement learning. forcement learning. 3. Object Recognition and Localization Based on YOLO Algorithms 3. Object Recognition and Localization Based on YOLO Algorithms In computer vision-based object recognition/localization applications, many past stud- In computer vision-based object recognition/localization applications, many past ies have adopted a two-step approach. The first step focuses on detecting and segmenting studies have adopted a two-step approach. The first step focuses on detecting and seg- out the region that contains objects of interest within the image. The second step proceeds menting out the region that contains objects of interest within the image. The second step to object recognition/localization based on the region detected in the first step. Such an proceeds to object recognition/localization based on the region detected in the first step. approach often consumes enormous computation resources and time. Unlike the two-step Such an approach often consumes enormous computation resources and time. Unlike the approach, YOLO can simultaneously detect and recognize objects of interest [23–26]. The two-step approach, YOLO can simultaneously detect and recognize objects of interest [23– schematic diagram of the YOLO employed in this paper is shown in Figure 2, where 26]. The schematic diagram of the YOLO employed in this paper is shown in Figure 2, “Input” is the image input, “Conv” is the convolution layer, “Res_Block” is the residual where “Input” is the image input, “Conv” is the convolution layer, “Res_Block” is the block, and “Upsample” is the upsampling of image features. YOLO uses the Darknet-53 residual block, and “Upsample” is the upsampling of image features. YOLO uses the network structure to extract image features. In general, Darknet-53 consists of a series of Darknet-53 network structure to extract image features. In general, Darknet-53 consists of 1  1 and 3  3 convolution layers. Each convolution layer has a Leaky ReLU activation a series of 1 × 1 and 3 × 3 convolution layers. Each convolution layer has a Leaky ReLU function, a batch normalization unit and a residual block to cope with the problem of gradient disappearance/explosion caused by the large number of layers in the deep neural network. In addition, to improve the detection accuracy of small objects, YOLO adopts the Machines 2023, 11, x FOR PEER REVIEW 4 of 20 activation function, a batch normalization unit and a residual block to cope with the prob- Machines 2023, 11, 275 4 of 19 lem of gradient disappearance/explosion caused by the large number of layers in the deep neural network. In addition, to improve the detection accuracy of small objects, YOLO adopts the Feature Pyramid Network structure to perform multi-scale detection. The im- Feature Pyramid Network structure to perform multi-scale detection. The image input after age input after processing by the Darknet-53 will output three different sizes of image processing by the Darknet-53 will output three different sizes of image features—13  13, features—13 × 13, 26 × 26 and 52 × 52. Object detection will be performed on these image 26  26 and 52  52. Object detection will be performed on these image features and the features and the anchor box will then be equally distributed to three outputs. The final anchor box will then be equally distributed to three outputs. The final detection results will detection results will be the sum of the detection results of these three image features of be the sum of the detection results of these three image features of different sizes. different sizes. Figure 2. Schematic diagram of YOLO. Figure 2. Schematic diagram of YOLO. 4. Object Pick-and-Place Policy Based on SAC Algorithms 4. Object Pick-and-Place Policy Based on SAC Algorithms SAC is a deep reinforcement learning algorithm [27,28] that can enable a robot to SAC is a deep reinforcement learning algorithm [27,28] that can enable a robot to learn in the real world. The attractive features of SAC include: (1) it is based on the learn in the real world. The attractive features of SAC include: (1) it is based on the Actor- Actor-Critic framework; (2) it can learn based on past experience, i.e., off-policy, to achieve Critic framework; (2) it can learn based on past experience, i.e., off-policy, to achieve im- improved efficiency in sample usage; (3) it belongs to the category of Maximum Entropy proved efficiency in sample usage; (3) it belongs to the category of Maximum Entropy Reinforcement Learning and can improve stability and exploration; and (4) it requires Reinforcement Learning and can improve stability and exploration; and (4) it requires fewer parameters. fewer parameters. In this paper, both the state and action are defined in the continuous space. Therefore, SAC uses neural networks to parametrize the soft-action value function and the policy function as Q (s , a ) and p (a s ) , respectively. A total of five neural networks are q t t f t t constructed—two soft action-value networks Q (s , a ) and Q (s , a ); two target soft t t t t q q 1 2 0 0 action-value networks Q (s , a ) and Q (s , a ); and one policy network p (a s ) , 0 0 t t t t f t t q q 1 2 Machines 2023, 11, x FOR PEER REVIEW 5 of 20 In this paper, both the state and action are defined in the continuous space. Therefore, SAC uses neural networks to parametrize the soft-action value function and the policy function as and π (|as) , respectively. A total of five neural networks are Qs(,a ) θ tt φ tt Qs(,a) Qs(,a) constructed—two soft action-value networks and ; two target soft ac- θ tt θ tt 1 2 Machines 2023, 11, 275 5 of 19 tion-value networks Qs ′ (,a) and Qs ′ (,a) ; and one policy network , where π (|as ) θ tt tt φ tt 1 ′ ′ θ θ 1 2 ′ ′ , , θ , θ and are the parameter vectors of the neural networks as shown in Figure θ φ 2 1 2 0 0 wher 3. In p e qart ,qicu ,qla,rq, the and polic f ar y f e uthe nction parameter and the so vectors ft action of-va the lue neural functnetworks ion are theas acshown tor and in the 1 2 1 2 Figur critic e in 3. the A In particular ctor-Cri ,tthe ic frpolicy amework, re function spective and the ly. Und soft action-value er state s, the soft function action- are vthe alue f actor unc- and tion w the i critic ll outp inut the the Actor exp-Critic ected rewa framework, rd respectively for sele .ct Under ing acstate tion as, , the thus g soft uidin action-value g the pol- Qs(,a ) θ tt function will output the expected reward Q (s , a ) for selecting action a, thus guiding icy function to learn. Based on theq cur t ret nt state, the policy function will output π (|as ) φ tt the policy function p (a s ) to learn. Based on the current state, the policy function f t t an action to yield the system state for the next moment. By repeating these procedures, will output an action to yield the system state for the next moment. By repeating these one can collect past experience to be used in training the soft action-value function. Since procedures, one can collect past experience to be used in training the soft action-value SAC is a random policy, the outputs of SAC are therefore the mean and standard devia- function. Since SAC is a random policy, the outputs of SAC are therefore the mean and tion of probability distribution of the action space. standard deviation of probability distribution of the action space. Figure 3. Neural network architecture of SAC. Figure 3. Neural network architecture of SAC. The objective function for the Soft Action-Value Network is described by Equation (1), The objective function for the Soft Action-Value Network is described by Equation while Equation (2) is the learning target. The Mean-Square Error (MSE) is employed (1), while Equation (2) is the learning target. The Mean-Square Error (MSE) is employed to update the network parameters. The action-value network Q (s , a ) and the target t t to update the network parameters. The action-value network and the target ac- Qs(,a ) θ tt action-value network Q (s , a ) have the same network structure. The action-value network t t tion-value network ′ have the same network structure. The action-value network Qs(,a ) tt is used to predict the expected reward for executing action a under state s. The target is used to predict the expected reward for executing action a under state s. The target ac- action-value network is used to update the target so as to help train the action-value tion-value network is used to update the target so as to help train the action-value net- network. During training, only the action-value network will be trained, while the target work. During training, only the action-value network will be trained, while the target ac- action-value network will remain unchanged. In short, the target will change if the target tion-value network will remain unchanged. In short, the target will change if the target action-value network updates, which will make it difficult for the learning of the neural action-value network updates, which will make it difficult for the learning of the neural network to converge. network to converge. 1 2 J (q) = E (Q (s , a ) Q(s , a )) (1) t t t t Q (s ,a )D q t t 1  JE () θ=− (Q (s ,a )Q(s ,a )) (1) Q (,s a )~D θ tt tt  tt  Q(s , a ) = r(s , a )+gE [V 0(s )]  (2) t t t t s p t+1 t+1 In this paper, the Stochastic Gradient ˆ Descent (SGD) method is employed to calculate the Qs ( ,a )=r (s ,a )+γE [V (s )] ′ (2) tt tt s~1 p θ t + t +1 derivative of the objective function, as described by Equation (3): r J (q) = r Q (s , a )(Q (s , a ) r(s , a ) + g(Q 0(s , a ) a log p (a js )))) (3) q Q q q t t q t t t t q t+1 t+1 f t+1 t+1 The weights of the target soft action value network are updated using Equation (4), where t is a constant: 0 0 q tq + (1 t)q (4) t+1 t t The objective function of the policy network is described by Equation (5). To improve the policy, one should maximize the sum of action value and entropy: Machines 2023, 11, x FOR PEER REVIEW 6 of 20 In this paper, the Stochastic Gradient Descent (SGD) method is employed to calculate the derivative of the objective function, as described by Equation (3): ∇=JQ ()θγ ∇ (s ,a )(Q (s ,a )−r(s ,a )+ (Q (s ,a )−αlogπ (a |s )))) (3) θθQtθ t θtt tt θ ′t++ 11t φt+1t+1 The weights of the target soft action value network are updated using Equation (4), where 𝜏 is a constant: ′′ θτ←+θ (1−τ )θ (4) tt + 1 t The objective function of the policy network is described by Equation (5). To improve the policy, one should maximize the sum of action value and entropy: Machines 2023, 11, 275 6 of 19 J (φ)=− Ea [απ log( ( |s ))Q (s ,a )] ππ s ~, Da ~ φ t t θ t t tt θ (5) af = (; ε s ) tt φ t J (f) = E [a log(p (a s )) Q (s , a )] p s D,a p f t t q t t t t (5) a = f (# ; s ) t f t t where 𝜀 is the noise and Equation (5) can be rewritten as Equation (6): where # is the noise and Equation (5) can be rewritten as Equation (6): J (φ)=− Ef [απ log( ( (;εs )|s ))Q (s,f (;εs ))] (6) πε sD ~, ~N φφ t t t θ tφ t t tt J (f) = E [a log(p ( f (# ; s ) s )) Q (s , f (# ; s ))] (6) p s D,# N f f t t t q t f t t t t The derivative of the objective function of the policy network is described by Equa- The derivative of the objective function of the policy network is described by tion (7): Equation (7): ˆ ˆ r J (f) = r a log(p (a js )) + (r a log(p (a js )) Q(s , a ))r f ( # ; s ) (7) f p ∇= J () fφα∇ lfog(πt (tas | ))+(∇ a αlog(π f(as t| t))−Q(s ,a t))∇t f (εf;sf) t t (7) φπ φ φ tt ta φ t t t t φ φ t t The SAC reinforcement learning algorithm is illustrated in Figure 4. The SAC reinforcement learning algorithm is illustrated in Figure 4. Soft Actor-Critic Input: θθ,, φ 1. Initial parameters ′′ θθ←← ,θ θ 2. Initial target network weights 11 2 2 D ←∅ 3. Initial empty replay buffer 4. for each iteration do 5. for each environment step do aa ~( π |s) tt φ t 6. Sample action from the policy s ~(ps | s ,a ) 7. tt ++ 11 t t Sample transition from the environment DD ←∪ (,s a ,r(s ,a ),s ) { } 8. tt tt t +1 Store the transition in the replay buffer 9. end for 10. for each gradient step do θθ ←−λ ∇Ji (θ ) for ∈ 1, 2 { } ii Q θ Q i 11. Update the Q-function parameters φφ ←− λ ∇ J () φ πφ π 12. Update policy weights ′′ ′ θτ←+θ (1−τ )θ for i∈{1, 2} ii i 13. Update target network weights 14. end for 15. end for Output: θθ,, φ 16. Optimized parameters Figure 4. SAC reinforcement learning algorithm. Figure 4. SAC reinforcement learning algorithm. 4.1. Policy This paper applies SAC to robotic object grasping. The learning agent is the 6-DOF robot manipulator, while the policy output is the coordinate (u,v) of the object grasping point on the image plane. The state, action and reward mechanism are designed as follows. 4.1.1. State (State s) By exploiting YOLO, one can detect the objects of interest. The state of the SAC algorithm is defined to be the depth image of the object of interest. The state input designed in this paper is the depth information. Therefore, after obtaining the position of the object of interest in the RGB image, one needs to find its corresponding position in the depth Machines 2023, 11, x FOR PEER REVIEW 7 of 20 4.1. Policy This paper applies SAC to robotic object grasping. The learning agent is the 6-DOF robot manipulator, while the policy output is the coordinate (u,v) of the object grasping point on the image plane. The state, action and reward mechanism are designed as fol- lows. 4.1.1. State (State s) By exploiting YOLO, one can detect the objects of interest. The state of the SAC algo- Machines 2023, 11, 275 rithm is defined to be the depth image of the object of interest. The state input designed 7 of 19 in this paper is the depth information. Therefore, after obtaining the position of the object of interest in the RGB image, one needs to find its corresponding position in the depth image. Note that this depth image will be scaled to a size of 64 × 64. To be precise, the state image. Note that this depth image will be scaled to a size of 64 64. To be precise, the state used in this paper is a 64 × 64 × 1 depth image as shown in Figure 5. used in this paper is a 64  64  1 depth image as shown in Figure 5. Figure 5. Illustrative diagram of state acquisition. Figure 5. Illustrative diagram of state acquisition. 4.1.2. 4.1.2. Act Action ion ( (Action Action a) a) The The action of action of SAC SAC is define is defined d to be the to be thein input put displacement vector o displacement vectorfof the objec the object t of in- of terest on the image plane as described by Equation (8), for which its unit is a pixel. The interest on the image plane as described by Equation (8), for which its unit is a pixel. The length length and andwidth widthof of the the bo bounding unding box box o obtained btained by by YOLO YOLO ar are e den denoted oted as as x x and and y y,, res respec- pec- tively tively. In . In addition, addition, the the coor coordinate dinate o offthe the center center o offthe the bounding bounding bo box x is deno is denoted ted as as ( (u𝑢 ,,𝑣v ). ). c c Equation (9) gives the displacement vector of the object of interest on the image plane Equation (9) gives the displacement vector of the object of interest on the image plane corresponding to the action by the SAC. The coordinates of the object grasping point on the corresponding to the action by the SAC. The coordinates of the object grasping point on image plane as shown in Figure 6 are calculated using Equation (10). With the calculated the image plane as shown in Figure 6 are calculated using Equation (10). With the calcu- image coordinates of the object grasping point, by using coordinate transformation, depth lated image coordinates of the object grasping point, by using coordinate transformation, information and inverse kinematics, one can obtain the joint command for the 6-DOF robot depth information and inverse kinematics, one can obtain the joint command for the 6- manipulator to perform object grasping. DOF robot manipulator to perform object grasping. a 2 [1, 1] a ∈− [1,1] a = (a , a ), (8) 1 2 aa =(,a ), a2 [1, 1] (8) 12 2 a ∈− [1,1]  2 Du = 1 + (a  x/2 0.99) Machines 2023, 11, x FOR PEER REVIEW 1 8 of 20 (9) Δ=ua 1( + */x 2−0.99) Dv = 1 + (a  1 y/2 0.99) (9) Δ=va 1+( */y 20 − .99) u = Du + u (10) v = Dv + v uu =Δ +u (10) vv =Δ +v Figure 6. The displacement vector and the object grasping point on the image plane. Figure 6. The displacement vector and the object grasping point on the image plane. 4.1.3. Reward (Reward, r) A positive reward of 1 will be given if a successful object grasping occurs. In contrast, a negative reward −0.1 (i.e., penalty) will be given if failure occurs. As a result, the accu- mulated reward for an episode will be negative if the first ten attempts of object grasping fail. In order to help the learning agent find the optimal object grasping point as soon as possible, an extra positive reward 0.5 will be given if the first attempt of object grasping is successful. In addition, two termination conditions are adopted for the learning of SAC. To prevent the learning agent from continuously learning the wrong policy, if none of the first 100 object grasping attempts is successful, this episode will be terminated immedi- ately. In addition, when the learning agent successfully performs object grasping, this ep- isode will also be terminated immediately. The reward mechanism is described by Equa- tion (11). +1 , if successful r== +1.5 , if successful and the number of attempts in object grasping1  (11) −0.1 , for each failure attempt in object grasping 4.2. Architecture Design of SAC Neural Network Since state s adopted in this paper is a 64 × 64 × 1 depth image, a CNN is amended to the SAC so that the SAC can learn directly from the depth image. The hyperparameters of SAC are listed in Table 1 and its network architecture is shown in Figure 7. The input to the policy network is the depth image of the object of interest as detected by YOLO. The inputs to the soft action-value network and the target soft action-value network are com- prised of the depth image of the object of interest as detected by YOLO and the policy outputted by the policy network. As shown in Figure 7, the policy network, the soft action- value network and the target soft action-value network all consist of three CNNs and four full connected neural networks. The activation functions used in the soft action-value net- work and the target soft action-value network are ReLU. As for the policy network, the activation functions for the three CNNs and the first three full connected neural networks are ReLU. The output of the last layer of the policy network is the displacement vector on the image plane, having both positive and negative values. Therefore, the hyperbolic tan- gent function (i.e., Tanh) is chosen as the activation function for the last layer of the policy network. Note that the three CNNs and the first fully connected neural network are used to extract image features. Machines 2023, 11, 275 8 of 19 4.1.3. Reward (Reward, r) A positive reward of 1 will be given if a successful object grasping occurs. In contrast, a negative reward 0.1 (i.e., penalty) will be given if failure occurs. As a result, the accumulated reward for an episode will be negative if the first ten attempts of object grasping fail. In order to help the learning agent find the optimal object grasping point as soon as possible, an extra positive reward 0.5 will be given if the first attempt of object grasping is successful. In addition, two termination conditions are adopted for the learning of SAC. To prevent the learning agent from continuously learning the wrong policy, if none of the first 100 object grasping attempts is successful, this episode will be terminated immediately. In addition, when the learning agent successfully performs object grasping, this episode will also be terminated immediately. The reward mechanism is described by Equation (11). +1 , if successful r = +1.5 , if successful and the number of attempts in object grasping = 1 (11) 0.1 , for each failure attempt in object grasping 4.2. Architecture Design of SAC Neural Network Since state s adopted in this paper is a 64 64 1 depth image, a CNN is amended to the SAC so that the SAC can learn directly from the depth image. The hyperparameters of SAC are listed in Table 1 and its network architecture is shown in Figure 7. The input to the policy network is the depth image of the object of interest as detected by YOLO. The inputs to the soft action-value network and the target soft action-value network are comprised of the depth image of the object of interest as detected by YOLO and the policy outputted by the policy network. As shown in Figure 7, the policy network, the soft action-value network and the target soft action-value network all consist of three CNNs and four full connected neural networks. The activation functions used in the soft action-value network and the target soft action-value network are ReLU. As for the policy network, the activation functions for the three CNNs and the first three full connected neural networks are ReLU. The output of the last layer of the policy network is the displacement vector on the image plane, having both positive and negative values. Therefore, the hyperbolic tangent function (i.e., Tanh) is chosen as the activation function for the last layer of the policy network. Note that the three CNNs and the first fully connected neural network are used to extract image features. Table 1. Hyperparameters of SAC neural network. Hyperparameter Title 2 optimizer Adam learning rate 0.001 replay buffer size 200,000 batch size 64 discount factor ( ) 0.99 target smoothing coefficient () 0.005 entropy temperature parameter ( ) 0.01 Machines 2023, 11, x FOR PEER REVIEW 9 of 20 Machines 2023, 11, 275 9 of 19 (a) (b) (c) Figure 7. Architecture of SAC neural network. (a) Policy Network; (b) Soft Action-Value Network; Figure 7. Architecture of SAC neural network. (a) Policy Network; (b) Soft Action-Value Network; (c) Target Soft Action-Value Network. (c) Target Soft Action-Value Network. Table 1. Hyperparameters of SAC neural network. 5. Experimental Setup and Results The real experimental environment used in this paper is shown in Figure 8a, while Hyperparameter Title 2 Figure 8b shows the simulated environment constructed using the simulation platform optimizer Adam V-REP. The simulated environment is mainly used to train and test the deep neural network. learning rate 0.001 The 6-DOF A7 industrial articulated robot manipulator used in the real experiment is replay buffer size 200,000 manufactured by ITRI. The Mitsubishi AC servomotors installed at each joint of the robot batch size 64 manipulator are equipped with absolute encoders and are set to torque mode. A vacuum discount factor (γ) 0.99 sucker (maximum payload 3 kg) manufactured by Schmalz is mounted on the end-effector target smoothing coefficient (τ) 0.005 of the robot manipulator. The vision sensor used in the experiment is a Kinect v2 RGBD entropy temperature parameter (α) 0.01 camera (30 Hz frame rate) manufactured by Microsoft. The maximum resolution for the RGB camera is 1920  1080 pixels, while the maximum resolution for the depth 5. Experimental Setup and Results camera is 512  424 pixels. The Kinect v2 camera is located at the upper right side of the 6-DOF robot manipulator to capture the images of the objects. These object images will The real experimental environment used in this paper is shown in Figure 8a, while be used for YOLO to classify their categories. Two desktop computers are used in the Figure 8b shows the simulated environment constructed using the simulation platform V- experiment. The computer for controlling the 6-DOF robot manipulator and the vacuum REP. The simulated environment is mainly used to train and test the deep neural network. sucker is equipped with Intel(R) Core TM i7-2600 CPU @3.40 Ghz and 12 GB RAM. It runs The 6-DOF A7 industrial articulated robot manipulator used in the real experiment is under Microsoft Windows 7 and uses Microsoft Visual Studio 2015 as its programming manufactured by ITRI. The Mitsubishi AC servomotors installed at each joint of the robot development platform. The computer responsible for computer vision, the training of the manipulator are equipped with absolute encoders and are set to torque mode. A vacuum sucker (maximum payload 3 kg) manufactured by Schmalz is mounted on the end-effector Machines 2023, 11, x FOR PEER REVIEW 10 of 20 of the robot manipulator. The vision sensor used in the experiment is a Kinect v2 RGBD camera (30 Hz frame rate) manufactured by Microsoft. The maximum resolution for the RGB camera is 1920 × 1080 pixels, while the maximum resolution for the depth camera is 512 × 424 pixels. The Kinect v2 camera is located at the upper right side of the 6-DOF robot manipulator to capture the images of the objects. These object images will be used for YOLO to classify their categories. Two desktop computers are used in the experiment. The computer for controlling the 6-DOF robot manipulator and the vacuum sucker is Machines 2023, 11, 275 10 of 19 equipped with Intel(R) Core TM i7-2600 CPU @3.40 Ghz and 12 GB RAM. It runs under Microsoft Windows 7 and uses Microsoft Visual Studio 2015 as its programming devel- opment platform. The computer responsible for computer vision, the training of the deep deep reinforcement learning network, and the V-REP robot simulator is equipped with a reinforcement learning network, and the V-REP robot simulator is equipped with a NVIDIA GeForce RTX 2080 Ti and 26.9 GB RAM. It runs under Microsoft Windows 10 and NVIDIA GeForce RTX 2080 Ti and 26.9 GB RAM. It runs under Microsoft Windows 10 uses PyCharm as its development platform. The Python and the tool kit of the PyTorch are and uses PyCharm as its development platform. The Python and the tool kit of the used in training the deep reinforcement learning network. PyTorch are used in training the deep reinforcement learning network. (a) (b) Figure 8. Experimental and simulated environment: (a) real experimental environment; (b) simu- Figure 8. Experimental and simulated environment: (a) real experimental environment; (b) simu- lated environment. lated environment. 5.1. Training Results of YOLO 5.1. Training Results of YOLO As shown in Figure 9, the objects of interest used in the experiment included apples, As shown in Figure 9, the objects of interest used in the experiment included apples, Machines 2023, 11, x FOR PEER REVIEW 11 of 20 oranges, a banana, a cup, a box and building blocks. oranges, a banana, a cup, a box and building blocks. Figure 9. Objects of interest used in the experiment. Figure 9. Objects of interest used in the experiment. The COCO Dataset was used to train the YOLOv3 in this paper. However, the COCO Dataset does not include objects such as the building blocks used in the experiment. As a result, it was necessary to collect a training data set for the building blocks. In particular, a total of 635 images of the building blocks were taken. The transfer learning technique [29] was employed in this paper to speed up the training process, in which the weights provided by the authors of YOLO were adopted as the initial weights for training the YOLOv3. Figure 10 shows the training results of YOLO. The total number of iterations was 45,000. The value of the loss function converged to 0.0391. To test the performance of the trained YOLOv3, several objects were randomly placed on the table, with the detection results shown in Figure 11. Clearly, YOLOv3 can successfully detect and classify the ob- jects of interest. Figure 10. Training results of YOLOv3. Machines 2023, 11, x FOR PEER REVIEW 11 of 20 Machines 2023, 11, 275 11 of 19 Figure 9. Objects of interest used in the experiment. The COCO Dataset was used to train the YOLOv3 in this paper. However, the COCO The COCO Dataset was used to train the YOLOv3 in this paper. However, the COCO Dataset does not include objects such as the building blocks used in the experiment. As a Dataset does not include objects such as the building blocks used in the experiment. As a result, it was necessary to collect a training data set for the building blocks. In particular, result, it was necessary to collect a training data set for the building blocks. In particular, a a total of 635 images of the building blocks were taken. The transfer learning technique total of 635 images of the building blocks were taken. The transfer learning technique [29] [29] was employed in this paper to speed up the training process, in which the weights was employed in this paper to speed up the training process, in which the weights provided provided by the authors of YOLO were adopted as the initial weights for training the by the authors of YOLO were adopted as the initial weights for training the YOLOv3. YOLOv3. Figure 10 shows the training results of YOLO. The total number of iterations Figure 10 shows the training results of YOLO. The total number of iterations was 45,000. The was value 45,000. T of the he value o loss function f the loss conver func ged tionto conv 0.0391. erged to To test 0.0the 391. To performance test the pe of rforma the trained nce of YOLOv3, the trained several YOLOv objects 3, sever wer al ob e jec randomly ts were rplaced andomlon y pthe laced table, on the with tabthe le, wi detect th the ion detec results tion shown in Figure 11. Clearly, YOLOv3 can successfully detect and classify the objects results shown in Figure 11. Clearly, YOLOv3 can successfully detect and classify the ob- of interest. jects of interest. Machines 2023, 11, x FOR PEER REVIEW 12 of 20 Figure 10. Training results of YOLOv3. Figure 10. Training results of YOLOv3. (a) (b) (c) (d) (e) (f) Figure 11. Detection/classification results of YOLOv3 after training (a) 1st test (b) 2nd test (c) 3rd Figure 11. Detection/classification results of YOLOv3 after training (a) 1st test (b) 2nd test (c) 3rd test (d) 4th test (e) 5th test (f) 6th test. test (d) 4th test (e) 5th test (f) 6th test. 5.2. Training and Simulation Results of Object Grasping Policy Based on SAC Figure 12 illustrates the flowchart of the training process for the proposed object grasping approach based on SAC. At the beginning of each episode, the experimental/sim- ulation environment was reset, namely, the robot manipulator was returned to the home position, objects were placed on the table, and the camera took images of the environment. Based on the image captured by the camera, the object recognition/localization approach based on YOLO developed in Section 3 was used to find the position of the object of in- terest so as to obtain its current state (s) (detailed procedures are indicated by the red dash block in Figure 12). According to its current state, the SAC would output an action (a), i.e., the input displacement vector of the object of interest on the image plane. The joint com- mand of the robot manipulator could be obtained by using coordinate transformation, depth information and inverse kinematics. According to the obtained joint command, the end-effector was controlled to move to a desired position and a suction nozzle was turned on to perform object grasping. A positive reward was given for a successful grasp. The termination conditions for an episode occurred either when the total number of object grasping attempts was more than 100, or when an object grasping attempt was successful. In the real world, objects to be grasped are randomly placed. However, if the objects to be grasped are randomly placed for each episode in the training initially, the training time for learning object grasping successfully could be very long. In order to speed up the learning process, the idea of incremental learning is exploited in this paper to set up the learning environment. For instance, a building block was the object of interest for grasp- ing. Firstly, the pose of the building block on the table was fixed and the deep reinforce- ment neural network was trained over 1000 episodes in the simulated environment con- structed by the V-REP robot simulator. The training results are shown in Figure 13. Machines 2023, 11, 275 12 of 19 5.2. Training and Simulation Results of Object Grasping Policy Based on SAC Figure 12 illustrates the flowchart of the training process for the proposed object grasp- ing approach based on SAC. At the beginning of each episode, the experimental/simulation environment was reset, namely, the robot manipulator was returned to the home position, objects were placed on the table, and the camera took images of the environment. Based on the image captured by the camera, the object recognition/localization approach based on YOLO developed in Section 3 was used to find the position of the object of interest so as to obtain its current state (s) (detailed procedures are indicated by the red dash block in Figure 12). According to its current state, the SAC would output an action (a), i.e., the input displacement vector of the object of interest on the image plane. The joint command of the robot manipulator could be obtained by using coordinate transformation, depth information and inverse kinematics. According to the obtained joint command, the end-effector was controlled to move to a desired position and a suction nozzle was turned on to perform object grasping. A positive reward was given for a successful grasp. The termination conditions for an episode occurred either when the total number of object Machines 2023, 11, x FOR PEER REVIEW 13 of 20 grasping attempts was more than 100, or when an object grasping attempt was successful. Figure 12. Flowchart of the training process for the proposed object grasping approach based on Figure 12. Flowchart of the training process for the proposed object grasping approach based on SAC. SAC. As described in Equation (11), a positive reward of 1 will be given if the robot suc- cessfully grasps an object. In contrast, a negative reward −0.1 (i.e., penalty) will be given if the robot fails to grasp an object. That is, the accumulated reward for an episode will be negative if the robot needs more than ten attempts to successfully grasp an object. In ad- dition, since an extra positive reward 0.5 will be given if the robot successfully grasps an object on its first attempt, the maximum accumulated reward for an episode will be 1.5. Machines 2023, 11, 275 13 of 19 In the real world, objects to be grasped are randomly placed. However, if the objects to be grasped are randomly placed for each episode in the training initially, the training Machines 2023, 11, x FOR PEER REVIEW 14 of 20 time for learning object grasping successfully could be very long. In order to speed up the learning process, the idea of incremental learning is exploited in this paper to set up the learning environment. For instance, a building block was the object of interest for grasping. Firstly, the pose of the building block on the table was fixed and the deep reinforcement From the results shown in Figure 13, it was found that after 100 episodes of training, the neural network was trained over 1000 episodes in the simulated environment constructed 6-DOF robot manipulator was able to find a correct grasping pose for the case of a building by the V-REP robot simulator. The training results are shown in Figure 13. block with a fixed pose. (a) (b) Figure 13. Training results of a fixed pose building block (a) accumulated reward for each episode Figure 13. Training results of a fixed pose building block (a) accumulated reward for each episode (b) number of grasping attempts for each episode. (b) number of grasping attempts for each episode. After the 6-DOF robot manipulator could successfully grasp the building block with As described in Equation (11), a positive reward of 1 will be given if the robot suc- a fixed pose, the deep reinforcement neural network was retrained for another 1000 epi- cessfully grasps an object. In contrast, a negative reward 0.1 (i.e., penalty) will be given sodes. This time, the building block as well as other objects (used as the environmental if the robot fails to grasp an object. That is, the accumulated reward for an episode will disturbance) were randomly placed on a table. By exploiting the paradigm of transfer be negative if the robot needs more than ten attempts to successfully grasp an object. In learning, the weights of the deep reinforcement neural network after learning for the case addition, since an extra positive reward 0.5 will be given if the robot successfully grasps of fixed object poses were used as the initial weights for the deep reinforcement neural an object on its first attempt, the maximum accumulated reward for an episode will be 1.5. network in the retraining process. By taking into account the fact that objects of the same From the results shown in Figure 13, it was found that after 100 episodes of training, the category may have different sizes or colors, for every 100 episodes in the retraining pro- 6-DOF robot manipulator was able to find a correct grasping pose for the case of a building cess, the colors block and sizes of with a fixed objects in pose. each category were changed. This strategy served to enhance the robustne After ss o the f the trained p 6-DOF robotomanipulator licy toward environmental could successfully uncer grasp tainty dur the building ing block with a verification in the real world. Figure 14 shows the training results for the case of randomly fixed pose, the deep reinforcement neural network was retrained for another 1000 episodes. placed objects, wher This tie th me,e y the ebui llow lding line rep block resen as well ts the r as other esults of explo objects (used iting astr the ansfer envir learning onmental disturbance) (i.e., using the weigh were randomly ts for the c placed ase ofon fixed objec a table. By t po exploiting ses as the the initparadigm ial weightsof ) and transfer the learning, the weights of the deep reinforcement neural network after learning for the case of fixed object purple line shows the results without using transfer learning. The results shown in Figure poses were used as the initial weights for the deep reinforcement neural network in the 14b indicate that the number of grasping attempts required to find correct grasping points retraining process. By taking into account the fact that objects of the same category may without using transfer learning was much larger than that for using transfer learning over have different sizes or colors, for every 100 episodes in the retraining process, the colors the first 200 episodes. Table 2 shows similar results in total training time and total number and sizes of objects in each category were changed. This strategy served to enhance the of grasping attempts for 1000 episodes. robustness of the trained policy toward environmental uncertainty during verification in the real world. Figure 14 shows the training results for the case of randomly placed objects, where the yellow line represents the results of exploiting transfer learning (i.e., using the weights for the case of fixed object poses as the initial weights) and the purple line shows the results without using transfer learning. The results shown in Figure 14b indicate that the number of grasping attempts required to find correct grasping points without using transfer learning was much larger than that for using transfer learning over the first 200 episodes. Table 2 shows similar results in total training time and total number of grasping attempts for 1000 episodes. (a) (b) Figure 14. Training results for the case of randomly placed objects: the yellow line represents the results of exploiting transfer learning (i.e., use the weights for the case of fixed object poses as the Machines 2023, 11, x FOR PEER REVIEW 14 of 20 From the results shown in Figure 13, it was found that after 100 episodes of training, the 6-DOF robot manipulator was able to find a correct grasping pose for the case of a building block with a fixed pose. (a) (b) Figure 13. Training results of a fixed pose building block (a) accumulated reward for each episode (b) number of grasping attempts for each episode. After the 6-DOF robot manipulator could successfully grasp the building block with a fixed pose, the deep reinforcement neural network was retrained for another 1000 epi- sodes. This time, the building block as well as other objects (used as the environmental disturbance) were randomly placed on a table. By exploiting the paradigm of transfer learning, the weights of the deep reinforcement neural network after learning for the case of fixed object poses were used as the initial weights for the deep reinforcement neural network in the retraining process. By taking into account the fact that objects of the same category may have different sizes or colors, for every 100 episodes in the retraining pro- cess, the colors and sizes of objects in each category were changed. This strategy served to enhance the robustness of the trained policy toward environmental uncertainty during verification in the real world. Figure 14 shows the training results for the case of randomly placed objects, where the yellow line represents the results of exploiting transfer learning (i.e., using the weights for the case of fixed object poses as the initial weights) and the purple line shows the results without using transfer learning. The results shown in Figure 14b indicate that the number of grasping attempts required to find correct grasping points without using transfer learning was much larger than that for using transfer learning over Machines 2023, 11, 275 14 of 19 the first 200 episodes. Table 2 shows similar results in total training time and total number of grasping attempts for 1000 episodes. (a) (b) Figure 14. Training results for the case of randomly placed objects: the yellow line represents the Figure 14. Training results for the case of randomly placed objects: the yellow line represents the results of exploiting transfer learning (i.e., use the weights for the case of fixed object poses as the results of exploiting transfer learning (i.e., use the weights for the case of fixed object poses as the initial weights), while the purple line shows the results without using transfer learning. (a) Accumulated reward for each episode; (b) number of grasping attempts for each episode. Table 2. Total training time and total number of grasping attempts. Machines 2023, 11, x FOR PEER REVIEW 15 of 20 Pre_Train No_Pre_Train Without_YOLO (Use Transfer Learning) Training time 6443 (s) 15,076 (s) 102,580 (s) initial weights), while the purple line shows the results without using transfer learning. (a) Accu- Number of grasping attempts 1323 (attempts) 3635 (attempts) 38,066 (attempts) mulated reward for each episode; (b) number of grasping attempts for each episode. Figure 15 shows the results of directly using the entire image (rather than using the Figure 15 shows the results of directly using the entire image (rather than using the object of interest detected by YOLOv3) as the input state for the deep reinforcement learning object of interest detected by YOLOv3) as the input state for the deep reinforcement learn- network. The results shown in Figure 15 indicate that correct grasping points cannot be ing network. The results shown in Figure 15 indicate that correct grasping points cannot obtained after 1000 episodes of training. Table 2 indicates that the training time for the be obtained after 1000 episodes of training. Table 2 indicates that the training time for the case of using the entire image as the input is 15.9 times longer than that for using the case of using the entire image as the input is 15.9 times longer than that for using the proposed approach (i.e., transfer learning + YOLO + SAC). In addition, the number of proposed approach (i.e., transfer learning + YOLO + SAC). In addition, the number of grasping attempts for the case of using the entire image as the input is 28.8 times larger grasping attempts for the case of using the entire image as the input is 28.8 times larger than that for using the proposed approach. The above simulation results reveal that the than that for using the proposed approach. The above simulation results reveal that the proposed approach indeed can effectively reduce the total training time and total number proposed approach indeed can effectively reduce the total training time and total number of grasping attempts. of grasping attempts. (a) (b) Figure 15. Results of the V-REP robot simulator without combining YOLOv3: (a) accumulated re- Figure 15. Results of the V-REP robot simulator without combining YOLOv3: (a) accumulated reward ward for each episode; (b) number of grasping attempts for each episode. for each episode; (b) number of grasping attempts for each episode. Table 2. Total training time and total number of grasping attempts. Pre_Train No_Pre_Train Without_YOLO (Use Transfer Learning) Training time 6443 (s) 15,076 (s) 102,580 (s) Number of grasping 1323 (attempts) 3635 (attempts) 38,066 (attempts) attempts 5.3. Object Grasping Using a Real Robot Manipulator As mentioned previously, the input to the proposed deep reinforcement learning- based object grasping approach is the depth image (provided by Kinect v2) of the objects of interest detected by YOLOv3. Since YOLOv3 uses the RGB image (provided by Kinect v2) to detect the objects of interest, there is a need to construct the correspondence between the depth image and the RGB image so that the depth information of a point on the object of interest can be retrieved. In this paper, such a correspondence is constructed by using SDK accompanied with Kinect v2. In addition, with camera calibration [30] and the ob- tained depth information, the 3D information of a point on the object of interest in the camera frame can be retrieved. Hand-eye calibration [31] is then conducted to obtain the coordination transformation relationship between the camera frame and the end-effector fame. Using the results of hand-eye calibration and robot kinematics, the 3D information of a point on the object of interest in the camera frame can be converted into 3D infor- mation in the robot base frame. Moreover, using robot inverse kinematics, the joint com- mands for the robot to perform the task of grasping the object of interest can be obtained. Machines 2023, 11, 275 15 of 19 5.3. Object Grasping Using a Real Robot Manipulator As mentioned previously, the input to the proposed deep reinforcement learning- based object grasping approach is the depth image (provided by Kinect v2) of the objects of interest detected by YOLOv3. Since YOLOv3 uses the RGB image (provided by Kinect v2) to detect the objects of interest, there is a need to construct the correspondence between the depth image and the RGB image so that the depth information of a point on the object of interest can be retrieved. In this paper, such a correspondence is constructed by using SDK accompanied with Kinect v2. In addition, with camera calibration [30] and the ob- tained depth information, the 3D information of a point on the object of interest in the Machines 2023, 11, x FOR PEER REVIEW 16 of 20 camera frame can be retrieved. Hand-eye calibration [31] is then conducted to obtain the coordination transformation relationship between the camera frame and the end-effector Figure 16 illustrates the flowchart for grasping a specific object. In this experiment, fame. Using the results of hand-eye calibration and robot kinematics, the 3D information of several different types of objects were randomly placed on a table. Note that the vacuum a point on the object of interest in the camera frame can be converted into 3D infor-mation sucker mounted on the end-effector rather than a gripper is used in this paper to grasp in the robot base frame. Moreover, using robot inverse kinematics, the joint com-mands for the object of interest. In order to perform a successful grasp, the suction force needs to the robot to perform the task of grasping the object of interest can be obtained. overcome the gravity force of the object of interest. As a result, the rim of the cup is not facing up in t Figure h16 e experimen illustrates t. The the Kinec flowchart t v2 camera for toograsping k an image of a the env specific ironm object. ent. The In this experiment, user assigned a specific object of interest for the robot manipulator to grasp. The SAC several different types of objects were randomly placed on a table. Note that the vacuum outputted a prediction of the position coordinate of the assigned object to be grasped. The sucker mounted on the end-effector rather than a gripper is used in this paper to grasp joint command of the robot manipulator was obtained by using coordinate transfor- the object of interest. In order to perform a successful grasp, the suction force needs to mation, depth information and inverse kinematics. According to the obtained joint com- overcome the gravity force of the object of interest. As a result, the rim of the cup is not mand, the end-effector was controlled to move to a desired position and a suction nozzle facing was turned o up innthe to pexperiment. erform object gr The asping. If Kinect thev2 atte camera mpt for ob took ject gr anasp image ing faof iled, the the environment. The Kinect v2 camera took an image at the environment again and the object grasping process user assigned a specific object of interest for the robot manipulator to grasp. The SAC was repeated. If the attempts of object grasping failed three consecutive times, the task for outputted a prediction of the position coordinate of the assigned object to be grasped. The grasping an assigned specific object was regarded as a failure. joint command of the robot manipulator was obtained by using coordinate transformation, In particular, SAC was employed to train a 6-DOF robot manipulator to grasp build- depth ing block information s and bananas and in a sim inverse ulated ekinematics. nvironment constr Accor ucted by ding a V- toREP robot the obtained simu- joint command, lator. By exploiting the concept of Sim-to-Real [32], the trained network was deployed to the end-effector was controlled to move to a desired position and a suction nozzle was the real 6-DOF robot manipulator to perform object grasping in the real world. In addi- turned on to perform object grasping. If the attempt for object grasping failed, the Kinect tion, in real-world experiments, objects such as apples, oranges and cups which are not in v2 camera took an image at the environment again and the object grasping process was the training data set were added to the list of objects of interest. From the experimental repeated. If the attempts of object grasping failed three consecutive times, the task for results shown in Figure 17, it is evident that the trained SAC can indeed provide correct object grasping points for objects of interest in real-world environments. Experimental grasping an assigned specific object was regarded as a failure. results for the success rate of grasping different objects are listed in Table 3. Figure 16. Flowchart for grasping a specific object. Figure 16. Flowchart for grasping a specific object. Machines 2023, 11, 275 16 of 19 In particular, SAC was employed to train a 6-DOF robot manipulator to grasp building blocks and bananas in a simulated environment constructed by a V-REP robot simulator. By exploiting the concept of Sim-to-Real [32], the trained network was deployed to the real 6-DOF robot manipulator to perform object grasping in the real world. In addition, in real-world experiments, objects such as apples, oranges and cups which are not in the training data set were added to the list of objects of interest. From the experimental results shown in Figure 17, it is evident that the trained SAC can indeed provide correct object Machines 2023, 11, x FOR PEER REVIEW 17 of 20 grasping points for objects of interest in real-world environments. Experimental results for the success rate of grasping different objects are listed in Table 3. Figure 17. Object grasping point provided by SAC for different objects of interest (red point inside Figure 17. Object grasping point provided by SAC for different objects of interest (red point inside the the bounding box in the upper environment image; white point in the lower depth image); the “ar- bounding box in the upper environment image; white point in the lower depth image); the “arrow” row” sign is used to indicate the object of interest. sign is used to indicate the object of interest. Table 3. Rate of successful grasping for different objects. Table 3. Rate of successful grasping for different objects. Building Object of Interest Apple Banana Orange Cup Object of Interest Building Block Apple Banana Orange Cup Block Rate of successful grasping 19/20 6/10 6/10 8/10 9/10 Rate of successful grasping 19/20 6/10 6/10 8/10 9/10 Object is in the training set yes no yes no no Object is in the training set yes no yes no no The results listed in Table 3 indicate that for the objects in the training set, the building The results listed in Table 3 indicate that for the objects in the training set, the build- block has a much higher rate of being successfully grasped than the banana. The reason ing block has a much higher rate of being successfully grasped than the banana. The rea- for this discrepancy is that in the simulated environment, the banana has a fixed shape son for this discrepancy is that in the simulated environment, the banana has a fixed shape and smooth surface. However, the bananas used in real-world experiments have different and smooth surface. However, the bananas used in real-world experiments have different shapes/sizes and their surfaces are not smooth enough. Therefore, the significant differ- shapes/sizes and their surfaces are not smooth enough. Therefore, the significant differ- ences between the simulated environment and that of the real-world experiment lead to ences between the simulated environment and that of the real-world experiment lead to a a lower rate of successful grasping for bananas. As for the objects not in the training set, lower rate of successful grasping for bananas. As for the objects not in the training set, the the apples had the lowest rate of being successfully grasped. One possibility is that the apples had the lowest rate of being successfully grasped. One possibility is that the two two apples used in the real-world experiments have significant differences in size/shapes. apples used in the real-world experiments have significant differences in size/shapes. In In addition, in real-world experiments, hand-eye calibration error and robot calibration addition, in real-world experiments, hand-eye calibration error and robot calibration er- errors all contribute to the fact that the end-effector cannot 100% accurately move to the rors all contribute to the fact that the end-effector cannot 100% accurately move to the grasping position determined by the proposed deep reinforcement learning-based object grasping position determined by the proposed deep reinforcement learning-based object grasping approach. Since bananas and apples require a more accurate grasping point, it is grasping approach. Since bananas and apples require a more accurate grasping point, it not surprising that their rates of being successfully grasped are lower. is not surprising that their rates of being successfully grasped are lower. In summary, there are several interesting observations from the experimental results. In summary, there are several interesting observations from the experimental results. First of all, the suction nozzle used in this paper requires a smooth object surface to achieve First of all, the suction nozzle used in this paper requires a smooth object surface to successful grasping. That explains why apples and bananas have lower successful grasping achieve successful grasping. That explains why apples and bananas have lower successful rates. Secondly, without further training, the proposed approach exhibits decent grasping grasping rates. Secondly, without further training, the proposed approach exhibits decent performance, even for cases in which the objects of interest are previously unseen. Thirdly, grasping performance, even for cases in which the objects of interest are previously un- experimental results indicate that the SAC can be trained in the robot simulator and the seen. Thirdly, experimental results indicate that the SAC can be trained in the robot sim- trained SAC can be deployed to the real 6-DOF robot manipulator to successfully perform ulator and the trained SAC can be deployed to the real 6-DOF robot manipulator to suc- object grasping in the real world. cessfully perform object grasping in the real world. The next experiment was to grasp and classify all the objects randomly placed on the table and to put the grasped objects into the bin where they belonged. First of all, several objects were randomly placed on the table, after which YOLOv3 detected and classified all of the objects on the table. The SAC then provided information for the grasping points corresponding to all the objects of interest to the robot manipulator. The 6-DOF robot ma- nipulator then performed the grasping task and put the grasped objects into their respec- tive bins. Note that during the grasping process, the robot manipulator may collide with other objects so that their poses may change and result in grasping failures. In order to deal with the aforementioned problem, after performing the object grasping task, if some Machines 2023, 11, 275 17 of 19 The next experiment was to grasp and classify all the objects randomly placed on the table and to put the grasped objects into the bin where they belonged. First of all, several objects were randomly placed on the table, after which YOLOv3 detected and classified all of the objects on the table. The SAC then provided information for the grasping points corresponding to all the objects of interest to the robot manipulator. The 6-DOF robot manipulator then performed the grasping task and put the grasped objects into their Machines 2023, 11, x FOR PEER REVIEW respective bins. Note that during the grasping process, the robot manipulator may18 of collide 20 with other objects so that their poses may change and result in grasping failures. In order to deal with the aforementioned problem, after performing the object grasping task, if some objects remained on the table, the object grasping tasks were repeated until all of the objects objects remained on the table, the object grasping tasks were repeated until all of the ob- on the table had been grasped and correctly put into the bin. Figure 18 shows an image jects on the table had been grasped and correctly put into the bin. Figure 18 shows an sequence of the object grasping/classification experiment. image sequence of the object grasping/classification experiment. Figure 18. Image sequence of object grasping/classification experiment (a) original image (b) clas- Figure 18. Image sequence of object grasping/classification experiment (a) original image (b) classifi- sification results of YOLOv3. cation results of YOLOv3. 6. Conclusions 6. Conclusions This paper proposes an approach that combines YOLO and deep reinforcement This paper proposes an approach that combines YOLO and deep reinforcement learning learning SAC algorithms for the 6-DOF robot manipulator to perform object grasp- SAC algorithms for the 6-DOF robot manipulator to perform object grasping/classification thr ing/c ough lass self-learning. ification through In particular self-learn,ing. the Iobjects n particu oflar inter , the objec est in this ts of paper interest in are detected this paper by are detected by YOLOv3. By considering the fact that objects of the same type may have YOLOv3. By considering the fact that objects of the same type may have different colors, different colors, only their depth images provided by Kinect v2 are thus used as the inputs only their depth images provided by Kinect v2 are thus used as the inputs for the proposed deep for th reinfor e proposed deep r cement learning- einforcemen based object t learn grasping ing-base appr d object gr oach. Inasping appro this way, theaexploration ch. In this way, the exploration space can be substantially reduced so as to improve the success rate space can be substantially reduced so as to improve the success rate and enable SAC to conver and enab ge quickly le SAC .to conver Moreover ge , quick to spe led y. M up oreover the training , to speed process, up the a training V-REP r proce obot simulator ss, a V-REP is robot simulator is employed to construct a simulated environment to train the SAC. Sim- employed to construct a simulated environment to train the SAC. Simulation results indicate that ulation re the prsults ind oposed appr icateoach that the proposed can indeed ef approach fectively can reduce indeed the effec totalttraining ively redtime uce the to and tal the training time and the total number of grasping attempts compared with an approach that total number of grasping attempts compared with an approach that directly uses the entire image directas ly the uses the inputen state tirefor imthe age as the in deep reinfor put cement state for learning the deep re network. inforcement In addition, learning ne to further t- speed work. In add up the training ition, to fur process, ther speed the paradigms up the of trainin transfer g process, the paradigm learning and incremental s of tr learning ansfer learning and incremental learning are employed in the proposed approach. Moreover, the are employed in the proposed approach. Moreover, the trained SAC was transferred to a real trained SAC was tr 6-DOF robot manipulator ansferred to a for r re eal-world al 6-DOF verification. robot manipu Experimental lator for real-wor results ld ve indicate rificatithat on. Experimental results indicate that in using the proposed approach, the real 6-DOF robot manipulator successfully performed object grasping/classification, even for previously unseen objects. Author Contributions: Conceptualization, Y.-L.C., Y.-R.C. and M.-Y.C.; methodology, Y.-L.C. and Y.-R.C.; software, Y.-L.C. and Y.-R.C.; validation, Y.-L.C. and Y.-R.C.; formal analysis, Y.-L.C. and Y.-R.C.; writing original draft preparation, Y.-L.C., Y.-R.C. and M.-Y.C.; writing review and editing, M.-Y.C.; Project administration, M.-Y.C.; Funding acquisition, M.-Y.C.; supervision, M.-Y.C. All au- thors have read and agreed to the published version of the manuscript. Machines 2023, 11, 275 18 of 19 in using the proposed approach, the real 6-DOF robot manipulator successfully performed object grasping/classification, even for previously unseen objects. Author Contributions: Conceptualization, Y.-L.C., Y.-R.C. and M.-Y.C.; methodology, Y.-L.C. and Y.-R.C.; software, Y.-L.C. and Y.-R.C.; validation, Y.-L.C. and Y.-R.C.; formal analysis, Y.-L.C. and Y.-R.C.; writing original draft preparation, Y.-L.C., Y.-R.C. and M.-Y.C.; writing review and editing, M.-Y.C.; project administration, M.-Y.C.; Funding acquisition, M.-Y.C.; supervision, M.-Y.C. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the Ministry of Science and Technology, Taiwan, grant number MOST 108-2221-E-006-217-MY2. Data Availability Statement: The data presented in this study are available on request from the corresponding author. Conflicts of Interest: The authors declare no conflict of interest. References 1. Kyprianou, G.; Doitsidis, L.; Chatzichristofis, S.A. Collaborative Viewpoint Adjusting and Grasping via Deep Reinforcement Learning in Clutter Scenes. Machines 2022, 10, 1135. [CrossRef] 2. Johns, E.; Leutenegger, S.; Davison, A.J. Deep learning a grasp function for grasping under gripper pose uncertainty. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, Daejeon, Republic of Korea, 9–14 October 2016; pp. 4461–4468. [CrossRef] 3. Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [CrossRef] 4. Pinto, L.; Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 3406–3413. [CrossRef] 5. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [CrossRef] 6. Mahler, J.; Pokorny, F.T.; Hou, B.; Roderick, M.; Laskey, M.; Aubry, M.; Kohlhoff, K.; Kröger, T.; Kuffner, J.; Goldberg, K. Dex-Net 1.0: A cloud-based network of 3D objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 1957–1964. [CrossRef] 7. Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv 2017, arXiv:1703.09312. [CrossRef] 8. Mahler, J.; Matl, M.; Liu, X.; Li, A.; Gealy, D.; Goldberg, K. Dex-Net 3.0: Computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation, Brisbane, QLD, Australia, 21–25 May 2018; pp. 5620–5627. [CrossRef] 9. Mahler, J.; Matl, M.; Satish, V.; Danielczuk, M.; DeRose, B.; McKinley, S.; Goldberg, K. Learning ambidextrous robot grasping policies. Sci. Robot. 2019, 4, eaau4984. [CrossRef] 10. Zhang, H.; Peeters, J.; Demeester, E.; Kellens, K. A CNN-Based Grasp Planning Method for Random Picking of Unknown Objects with a Vacuum Gripper. J. Intell. Robot. Syst. 2021, 103, 1–19. [CrossRef] 11. Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [CrossRef] 12. Fang, K.; Zhu, Y.; Garg, A.; Kurenkov, A.; Mehta, V.; Li, F.F.; Savarese, S. Learning task-oriented grasping for tool manipulation from simulated self-supervision. Int. J. Robot. Res. 2020, 39, 202–216. [CrossRef] 13. Ji, X.; Xiong, F.; Kong, W.; Wei, D.; Shen, Z. Grasping Control of a Vision Robot Based on a Deep Attentive Deterministic Policy Gradient. IEEE Access 2021, 10, 867–878. [CrossRef] 14. Horng, J.R.; Yang, S.Y.; Wang, M.S. Self-Correction for Eye-In-Hand Robotic Grasping Using Action Learning. IEEE Access 2021, 9, 156422–156436. [CrossRef] 15. Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to train your robot with deep reinforcement learning: Lessons we have learned. Int. J. Robot. Res. 2021, 40, 698–721. [CrossRef] 16. Gualtieri, M.; Ten Pas, A.; Platt, R. Pick and place without geometric object models. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation, Brisbane, QLD, Australia, 21–25 May 2018; pp. 7433–7440. [CrossRef] 17. Fujita, Y.; Uenishi, K.; Ummadisingu, A.; Nagarajan, P.; Masuda, S.; Castro, M.Y. Distributed reinforcement learning of targeted grasping with active vision for mobile manipulators. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 9712–9719. [CrossRef] 18. Zeng, A.; Song, S.; Welker, S.; Lee, J.; Rodriguez, A.; Funkhouser, T. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 4238–4245. [CrossRef] Machines 2023, 11, 275 19 of 19 19. Deng, Y.; Guo, X.; Wei, Y.; Lu, K.; Fang, B.; Guo, D.; Liu, H.; Sun, F. Deep reinforcement learning for robotic pushing and picking in cluttered environment. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, China, 3–8 November 2019; pp. 619–626. [CrossRef] 20. Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. QT-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv 2018, arXiv:1806.10293. [CrossRef] 21. Chen, R.; Dai, X.Y. Robotic grasp control policy with target pre-detection based on deep q-learning. In Proceedings of the 2018 3rd International Conference on Robotics and Automation Engineering, Guangzhou, China, 17–19 November 2018; pp. 29–33. [CrossRef] 22. Chen, Z.; Lin, M.; Jia, Z.; Jian, S. Towards generalization and data efficient learning of deep robotic grasping. arXiv 2020, arXiv:2007.00982. [CrossRef] 23. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [CrossRef] 24. Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [CrossRef] 25. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [CrossRef] 26. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [CrossRef] 27. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. 28. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2019, arXiv:1812.05905. [CrossRef] 29. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [CrossRef] 30. Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [CrossRef] 31. Cai, C.; Somani, N.; Nair, S.; Mendoza, D.; Knoll, A. Uncalibrated stereo visual servoing for manipulators using virtual impedance control. In Proceedings of the 13th International Conference on Control Automation Robotics & Vision, Singapore, 10–12 December 2014; pp. 1888–1893. 32. Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018. [CrossRef] Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Journal

MachinesMultidisciplinary Digital Publishing Institute

Published: Feb 12, 2023

Keywords: 6-DOF industrial manipulator; deep reinforcement learning; soft actor-critic; robotic object grasping; YOLO

There are no references for this article.