TY - JOUR AU1 - Wang, Qiujiao AU2 - Xie, Zhijie AB - Introduction Augmented reality (AR) is a technology based on graphics and image processing. It can overlay rich virtual objects in the real scene, making the description of real things more intuitive, detailed and interesting [1, 2]. People’s cognitive speed and acceptance of real things will be improved by using AR technology [3, 4]. The practical application of augmented reality began in 1968, and has developed rapidly in recent years. There are relatively mature research results in many fields, such as education [5–7], medicine [8–10], man-machine cooperation [11–13], venue experience [14–17] and advertising [18, 19]. The AR marketing revenue in the United States is expected to reach $8.02 billion by 2024. AR advertising contains more human emotions [20], improving the relationship between customers and brands, and then, engaging consumers’ purchase more frequently and rendering them more passionate about brands [21]. Marketers embrace AR technology for its increased visual attention and curiosity, enhanced memory encoding as compared to non-AR equivalent [22]. Interactive modalities, such as gestures and body positions, can improve users’ sense of self-presence and psychological engagement more strongly, and then, positively affect their satisfaction, purchase intention and memory [23]. This paper proposes an advertising video display system based on augmented reality. The plane of video display is not the normal screen plane, but the faces of a virtual cube. The faces are created from a planar marker collected by the camera, and have certain angles with the screen plane. When the angle of the marker changes, the angle of the display video relative to the display plane will change correspondingly in real time; when multiple cameras collect the marker from different directions at the same time, the displaying videos will be observed differently by customers. In short, customers can see different advertising content by observing the marker from different positions. At the same time, we have added the function of gesture operation to support users to front, move, rotate and zoom in or out the virtual cube with only one hand. Foundations External library We use several excellent open source program libraries and frameworks to realize the design of the system and the algorithm, such as OpenCV, OpenGL, QT and MediaPipe. By virtue of these libraries and frameworks, we do not need to implement the algorithms one by one, which makes the implementation of the algorithm more convenient. Marker design A marker is the foundation for marker-based augmented reality system and this system finally creates virtual object on the marker. The marker can be determined according to the actual requirements, which can be a text [11], a specific image [12], a face image [13], etc. For different markers, corresponding recognition algorithms can be designed. In this system, white cells with 5 rows and 5 columns on a black background are used to establish the marker. As shown in Fig 1, the position of white cells represents the number 1 and the others represent the number 0. In fact, the content of the marker can be regarded as coding information. By identifying the image color corresponding to each cell block, the information of each cell is identified, and the matrix information of the whole marker will also be decoded immediately. For easier recognition, we use a black border to surround the marker content. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Matrix of markers and information. https://doi.org/10.1371/journal.pone.0285838.g001 The appropriate marker has the following preconditions: In order to eliminate the rotational symmetry of the marker and ensure the uniqueness of the matrix information, the marker has one and only one isolated cell among its four corner cells, indicating that all the cells surrounding the isolated cell are black. The marker with the isolated cell in the upper left corner is regarded as a standard marker. One marker is considered the result of the planar rotation of the standard marker when the isolated cell is placed at the other three corners. In order to avoid errors while recognizing the whole marker area, one white cell at least is requested in the last row and column. It means that the sum value of the last row and column cannot be 0. Camera calibration When calculating the pose matrix of the marker mapped from the three-dimensional space to the two-dimensional image, it is necessary to input camera parameters, which can be obtained through camera calibration [24, 25]. Chessboard camera calibration is one of the most popular calibration methods for plane information acquisition and it is of high accuracy. This system uses the chessboard method for camera calibration. External library We use several excellent open source program libraries and frameworks to realize the design of the system and the algorithm, such as OpenCV, OpenGL, QT and MediaPipe. By virtue of these libraries and frameworks, we do not need to implement the algorithms one by one, which makes the implementation of the algorithm more convenient. Marker design A marker is the foundation for marker-based augmented reality system and this system finally creates virtual object on the marker. The marker can be determined according to the actual requirements, which can be a text [11], a specific image [12], a face image [13], etc. For different markers, corresponding recognition algorithms can be designed. In this system, white cells with 5 rows and 5 columns on a black background are used to establish the marker. As shown in Fig 1, the position of white cells represents the number 1 and the others represent the number 0. In fact, the content of the marker can be regarded as coding information. By identifying the image color corresponding to each cell block, the information of each cell is identified, and the matrix information of the whole marker will also be decoded immediately. For easier recognition, we use a black border to surround the marker content. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Matrix of markers and information. https://doi.org/10.1371/journal.pone.0285838.g001 The appropriate marker has the following preconditions: In order to eliminate the rotational symmetry of the marker and ensure the uniqueness of the matrix information, the marker has one and only one isolated cell among its four corner cells, indicating that all the cells surrounding the isolated cell are black. The marker with the isolated cell in the upper left corner is regarded as a standard marker. One marker is considered the result of the planar rotation of the standard marker when the isolated cell is placed at the other three corners. In order to avoid errors while recognizing the whole marker area, one white cell at least is requested in the last row and column. It means that the sum value of the last row and column cannot be 0. Camera calibration When calculating the pose matrix of the marker mapped from the three-dimensional space to the two-dimensional image, it is necessary to input camera parameters, which can be obtained through camera calibration [24, 25]. Chessboard camera calibration is one of the most popular calibration methods for plane information acquisition and it is of high accuracy. This system uses the chessboard method for camera calibration. System design A video is composed of a sequence of images, by which, generally, we mean frames. The system obtains the frames of the camera in real time, and detects whether there are predefined markers in the images. If a marker is found, the posture of the marker will be calculated by the parameters of the marker’s corner coordinates in the 3D world coordinate system and the corresponding coordinates in the 2D image coordinate system. Then the coordinate system of the current OpenGL model view will be set and a virtual cube in the coordinate system will be created, on whose faces, the frames of the advertising video will be set immediately. By performing the above operations repeatedly, the frames are continuously transformed to achieve the purpose of displaying the videos. The detailed design is given in following chapters. This system supports the recognition of multiple markers, and only one of them will be presented and explained. Major processes Step 1 Basic data preparation Camera calibration: Obtain the internal parameters of the camera through the referred camera calibration method. Marker size setting: For the subsequent perspective transformation and marker recognition, it is necessary at first to clarify the size of the marker and the coordinate matrix Mc of the four corners of the marker in the real world coordinate system. The coordinate matrix Mc can be expressed as (1) where i is the index of the four corners in the real world coordinate system. Xci, Yci and Zci are the coordinate values of each corner. It is necessary for the perspective transformation and extracting the image area where the marker is located after the transformation. The corner coordinates of the marker are related to the side length of the white cell and the position of the marker. In this paper, the side length of the white cell LC is set to 10, and the upper left corner is set at the origin. Therefore, the side length of the marker is (2) And the coordinates of the marker’s corners can be set to four points: upper left (0,0), lower left (0,50), lower right (50,50) and upper right (50,0). Step 2 Calculate the coordinate matrix Mw corresponding to the marker’s corners in the image coordinate system. The coordinate matrix Mw can be expressed as (3) where iis the index of the four corners in the coordinate system. Xwi, Ywi and Zwi are the coordinate values of each corner. Details of this step will be elaborated in the Marker image recognition Section. Step 3 Calculate the rotation and translation parameters transformed from Mc to Mw. They meet the following simplified relationship (4) (5) where MT is the transform matrix, R is the 3×3 rotation matrix to be solved, T is 3×1 translation matrix to be solved. In this paper, the OpenCV library function cv::SolvePnP is used to calculate the R and T combined with the camera internal parameters. The model view transformation matrix should be further calculated according to the matrix format of OpenGL. Step 4 Set the projection matrix of OpenGL according to the projection relationship. There are two ways of projection: perspective projection and orthogonal projection, and this system adopts the perspective projection. When the scene is symmetrical, its projection matrix is (6) As shown in Fig 2, where w is the width near the clipping face h is the height near the clipping face n is the distance between the near clipping plane and the camera f is the distance between the near clipping plane and the camera Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Schematic diagram of a visual body. https://doi.org/10.1371/journal.pone.0285838.g002 Step 5 Set the projection matrix and model view matrix of OpenGL calculated in the step 3 and step 4 respectively. Step 6 Create the virtual cube. The local coordinate system of the model is now in effect after the step 5. Draw the virtual cube with OpenGL functions under the current coordinate system, and the side length of the cube is equal to the marker’s side length LM. Step 7 Set image texture for each face of the virtual cube. The form of videos can be set as needed and the design method will be detailed in the Texture Creation Section. The major flow of the system is shown as Fig 3. After the above series of processing, the virtual cube will be attached to the marker and displayed together with the original frame of the video, as shown in Fig 4. The top face of the cube shows advertising video while the four vertical faces show static images. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Major processing flow. https://doi.org/10.1371/journal.pone.0285838.g003 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Original image and result after being augmented. https://doi.org/10.1371/journal.pone.0285838.g004 Marker image recognition Before identifying different markers, we need to design the corresponding recognition algorithm according to the characteristics of the markers. For the markers referred in this paper, the recognition process is as follows: Step 1 Gray the original image. Graying operation is frequently used image processing because it can reduce the image memory. Step 2 The image should be binarized to highlight the target area. The difference between the foreground and background of the image may be great when collecting markers information in different scenes. Generally, using the same threshold globally cannot distinguish the foreground well from the background, because of which, it is necessary to use an adaptive threshold for binarization. The adaptive threshold algorithm calculates the local threshold for each pixel by calculating the weighted average of the neighborhood of the pixel [26], and uses the local threshold to process the current pixel, which can be expressed by (7) where a and b are non-negative constants, μ(x,y) and σ(x,y) are the mean value and the standard deviation value of neighborhood of the pixel (x,y) respectively. This paper uses OpenCV function cv::adaptiveThreshold and the calculation method of local threshold t(x,y) can be abstracted as (8) where F(x,y) represents the convolution value at the pixel (x,y) using the filter kernel, c is a constant value, which can be determined by experience or debugging the system. In this paper, Gaussian filter kernel is used. At the same time, the operation of inverse binarization is used in view of the characteristics of low overall pixel value and high surrounding pixel value of the marker, as a result, the pixels greater than the threshold are set as background pixels and the others are set as foreground pixels. The binarization result is shown as Fig 5(B). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Marker recognition. (a) Original image. (b) Binarization image. (c) Feature extraction. (d) Marker. (e) Binary image of marker. (f) Corner location. (g) Augmented results. https://doi.org/10.1371/journal.pone.0285838.g005 Step 3 Remove image noises and extract features. Here, the open operation of morphology is adopted, that is, the image is processed by the erosion operation followed by a dilation operation. The result is shown as Fig 5(C). Step 4 Detecting corners, whose results are a series of point groups. This paper uses the OpenCV library function cv::findContours to find corner groups. For each point group, judge whether it meets the basic conditions of marker’s posture, such as whether it is a convex quadrilateral, whether the length of the four edges is within a reasonable interval, and so on. If the basic conditions are not met, exclude the point group, otherwise continue to the step 5. Step 5 Obtain the image perspective transformation matrix. For the point group that meets the basic conditions of predefined marker, the perspective transformation matrix of the marker image is obtained by using the OpenCV library function cv::getPerspectiveTransform with the coordinate matrices Mc and Mw. Step 6 Use the OpenCV library function cv::warpPerspective to perspective transform the gray image and extract the image area where the marker is located (hereinafter referred to as marker image). The extracting side length equals to LM. Step 7 Binarization of marker image. It can be seen from Fig 5(A) that affected by the influence of camera parameters, ambient light and other factors, the pixels of the marker image are not the two extreme values of black and white. Therefore, binarize the image after it is obtained. Different from the image in step 2, the processed object at this time is the marker image excluding other areas. The marker image should be binarized as a whole because its histogram has two typical peaks [27]. Set the proportion of foreground pixels of the image as ωf, the average gray value as μ, the proportion of background pixels is ωb and the average gray value is μ0. The four parameters all are the function of the segmentation threshold t of foreground and background. Then the variance between the total gray levels can be expressed as (9) The pixel proportion of foreground and background and its average gray level will change when the threshold t changes, which will affect the value of inter class variance finally. When the inter class variance reaches the maximum, the segmentation threshold t is the best threshold. This is the maximum interclass variance algorithm, also known as OTSU algorithm. It can adaptively determine the optimal threshold of binarization according to the information of the image. The corresponding function cv::threshold of OpenCV library provides the input argument option of OTSU algorithm. The gray marker image and its binary image are shown in Fig 5(D) and 5(E). Step 8 Identify the marker content. Even if the optimal binarization threshold is found through the adaptive method in the step 7, not all pixels of each cell are as expected after the binarization, that is, there are black pixels in white cells and white pixels in black cells, as shown in Fig 5(E). Consequently, count the number of non-zero pixels in each cell of the marker Nr by means of OpenCV function cv::countNonZero, and set a certain threshold Nt from the count of all pixels in the cell Nw, such as three quarters of Nw, to confirm whether the color of each cell is white or black. The color of the cell is white if (10) or black if (11) Then the information of each cell (white means 1, black means 0) and the information matrix of the whole marker are obtained. Continue to the step 9 if the definition of the marker is met, otherwise, exclude the corner group. Step 9 Set the isolated cell as the upper left corner cell according to the definition of the marker to calculate the plane rotation state of the current marker. The information matrix of the marker is also updated while solving the plane rotation of the marker. Step 10 Now, the coordinates and order of the four corner points of marker are obtained, and then the corner points can achieve sub-pixel accuracy to obtain more exact coordinates. We make an image of Fig 5(E) to show the coordinates of the corner on the original image. The black frame is displayed together with the marker for ease of elucidation. The algorithm flow of marker recognition is shown as Figs 5(G) and 6 shows the final augmented effect, from which we can see that the virtual cube is accurately created on the marker. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Marker recognition algorithm flow. https://doi.org/10.1371/journal.pone.0285838.g006 Texture creation The system can display both images and videos data. The data can be existing files or collected in real time through cameras. The application obtains the displaying images or videos data through the configuration file. For the display of images, the image texture can be set on the faces of the virtual cube. In order to display the existing advertising videos on the faces of the virtual cube or display the video stream of the current camera in real time (customers can see themselves by this way), the frame data of the video stream is obtained when the OpenGL window data is updated. The image textures are created from the images by using the OpenGL library function and set on the faces of the virtual cube. Through the continuous updating of frames, the textures displayed on the cube faces will also change, so as to achieve the purpose of playing videos on the faces of virtual cube. Since the image textures are created constantly from the frames of the videos, they should be deleted by the application after use, otherwise it will continue to occupy the memory and have a negative impact on the operation of the system. Interface design The interactive window is used to display the final result. If there are predefined markers appearing in the frame collected by the camera, the interactive window will display the original image and the virtual object. Otherwise, only the original frame of the world collected by the camera is displayed on the background. The implementation is to create a graph window inherited from QGLWidget and rewrite its initializeGL, paintGL and other useful functions. The initialization data such as camera parameters and standard marker definition is set in the initializeGL function; the real-time marker detection, virtual objects creation, and display of the combination of virtual and real data are set in the paintGL function. Major processes Step 1 Basic data preparation Camera calibration: Obtain the internal parameters of the camera through the referred camera calibration method. Marker size setting: For the subsequent perspective transformation and marker recognition, it is necessary at first to clarify the size of the marker and the coordinate matrix Mc of the four corners of the marker in the real world coordinate system. The coordinate matrix Mc can be expressed as (1) where i is the index of the four corners in the real world coordinate system. Xci, Yci and Zci are the coordinate values of each corner. It is necessary for the perspective transformation and extracting the image area where the marker is located after the transformation. The corner coordinates of the marker are related to the side length of the white cell and the position of the marker. In this paper, the side length of the white cell LC is set to 10, and the upper left corner is set at the origin. Therefore, the side length of the marker is (2) And the coordinates of the marker’s corners can be set to four points: upper left (0,0), lower left (0,50), lower right (50,50) and upper right (50,0). Step 2 Calculate the coordinate matrix Mw corresponding to the marker’s corners in the image coordinate system. The coordinate matrix Mw can be expressed as (3) where iis the index of the four corners in the coordinate system. Xwi, Ywi and Zwi are the coordinate values of each corner. Details of this step will be elaborated in the Marker image recognition Section. Step 3 Calculate the rotation and translation parameters transformed from Mc to Mw. They meet the following simplified relationship (4) (5) where MT is the transform matrix, R is the 3×3 rotation matrix to be solved, T is 3×1 translation matrix to be solved. In this paper, the OpenCV library function cv::SolvePnP is used to calculate the R and T combined with the camera internal parameters. The model view transformation matrix should be further calculated according to the matrix format of OpenGL. Step 4 Set the projection matrix of OpenGL according to the projection relationship. There are two ways of projection: perspective projection and orthogonal projection, and this system adopts the perspective projection. When the scene is symmetrical, its projection matrix is (6) As shown in Fig 2, where w is the width near the clipping face h is the height near the clipping face n is the distance between the near clipping plane and the camera f is the distance between the near clipping plane and the camera Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Schematic diagram of a visual body. https://doi.org/10.1371/journal.pone.0285838.g002 Step 5 Set the projection matrix and model view matrix of OpenGL calculated in the step 3 and step 4 respectively. Step 6 Create the virtual cube. The local coordinate system of the model is now in effect after the step 5. Draw the virtual cube with OpenGL functions under the current coordinate system, and the side length of the cube is equal to the marker’s side length LM. Step 7 Set image texture for each face of the virtual cube. The form of videos can be set as needed and the design method will be detailed in the Texture Creation Section. The major flow of the system is shown as Fig 3. After the above series of processing, the virtual cube will be attached to the marker and displayed together with the original frame of the video, as shown in Fig 4. The top face of the cube shows advertising video while the four vertical faces show static images. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Major processing flow. https://doi.org/10.1371/journal.pone.0285838.g003 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Original image and result after being augmented. https://doi.org/10.1371/journal.pone.0285838.g004 Marker image recognition Before identifying different markers, we need to design the corresponding recognition algorithm according to the characteristics of the markers. For the markers referred in this paper, the recognition process is as follows: Step 1 Gray the original image. Graying operation is frequently used image processing because it can reduce the image memory. Step 2 The image should be binarized to highlight the target area. The difference between the foreground and background of the image may be great when collecting markers information in different scenes. Generally, using the same threshold globally cannot distinguish the foreground well from the background, because of which, it is necessary to use an adaptive threshold for binarization. The adaptive threshold algorithm calculates the local threshold for each pixel by calculating the weighted average of the neighborhood of the pixel [26], and uses the local threshold to process the current pixel, which can be expressed by (7) where a and b are non-negative constants, μ(x,y) and σ(x,y) are the mean value and the standard deviation value of neighborhood of the pixel (x,y) respectively. This paper uses OpenCV function cv::adaptiveThreshold and the calculation method of local threshold t(x,y) can be abstracted as (8) where F(x,y) represents the convolution value at the pixel (x,y) using the filter kernel, c is a constant value, which can be determined by experience or debugging the system. In this paper, Gaussian filter kernel is used. At the same time, the operation of inverse binarization is used in view of the characteristics of low overall pixel value and high surrounding pixel value of the marker, as a result, the pixels greater than the threshold are set as background pixels and the others are set as foreground pixels. The binarization result is shown as Fig 5(B). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Marker recognition. (a) Original image. (b) Binarization image. (c) Feature extraction. (d) Marker. (e) Binary image of marker. (f) Corner location. (g) Augmented results. https://doi.org/10.1371/journal.pone.0285838.g005 Step 3 Remove image noises and extract features. Here, the open operation of morphology is adopted, that is, the image is processed by the erosion operation followed by a dilation operation. The result is shown as Fig 5(C). Step 4 Detecting corners, whose results are a series of point groups. This paper uses the OpenCV library function cv::findContours to find corner groups. For each point group, judge whether it meets the basic conditions of marker’s posture, such as whether it is a convex quadrilateral, whether the length of the four edges is within a reasonable interval, and so on. If the basic conditions are not met, exclude the point group, otherwise continue to the step 5. Step 5 Obtain the image perspective transformation matrix. For the point group that meets the basic conditions of predefined marker, the perspective transformation matrix of the marker image is obtained by using the OpenCV library function cv::getPerspectiveTransform with the coordinate matrices Mc and Mw. Step 6 Use the OpenCV library function cv::warpPerspective to perspective transform the gray image and extract the image area where the marker is located (hereinafter referred to as marker image). The extracting side length equals to LM. Step 7 Binarization of marker image. It can be seen from Fig 5(A) that affected by the influence of camera parameters, ambient light and other factors, the pixels of the marker image are not the two extreme values of black and white. Therefore, binarize the image after it is obtained. Different from the image in step 2, the processed object at this time is the marker image excluding other areas. The marker image should be binarized as a whole because its histogram has two typical peaks [27]. Set the proportion of foreground pixels of the image as ωf, the average gray value as μ, the proportion of background pixels is ωb and the average gray value is μ0. The four parameters all are the function of the segmentation threshold t of foreground and background. Then the variance between the total gray levels can be expressed as (9) The pixel proportion of foreground and background and its average gray level will change when the threshold t changes, which will affect the value of inter class variance finally. When the inter class variance reaches the maximum, the segmentation threshold t is the best threshold. This is the maximum interclass variance algorithm, also known as OTSU algorithm. It can adaptively determine the optimal threshold of binarization according to the information of the image. The corresponding function cv::threshold of OpenCV library provides the input argument option of OTSU algorithm. The gray marker image and its binary image are shown in Fig 5(D) and 5(E). Step 8 Identify the marker content. Even if the optimal binarization threshold is found through the adaptive method in the step 7, not all pixels of each cell are as expected after the binarization, that is, there are black pixels in white cells and white pixels in black cells, as shown in Fig 5(E). Consequently, count the number of non-zero pixels in each cell of the marker Nr by means of OpenCV function cv::countNonZero, and set a certain threshold Nt from the count of all pixels in the cell Nw, such as three quarters of Nw, to confirm whether the color of each cell is white or black. The color of the cell is white if (10) or black if (11) Then the information of each cell (white means 1, black means 0) and the information matrix of the whole marker are obtained. Continue to the step 9 if the definition of the marker is met, otherwise, exclude the corner group. Step 9 Set the isolated cell as the upper left corner cell according to the definition of the marker to calculate the plane rotation state of the current marker. The information matrix of the marker is also updated while solving the plane rotation of the marker. Step 10 Now, the coordinates and order of the four corner points of marker are obtained, and then the corner points can achieve sub-pixel accuracy to obtain more exact coordinates. We make an image of Fig 5(E) to show the coordinates of the corner on the original image. The black frame is displayed together with the marker for ease of elucidation. The algorithm flow of marker recognition is shown as Figs 5(G) and 6 shows the final augmented effect, from which we can see that the virtual cube is accurately created on the marker. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Marker recognition algorithm flow. https://doi.org/10.1371/journal.pone.0285838.g006 Texture creation The system can display both images and videos data. The data can be existing files or collected in real time through cameras. The application obtains the displaying images or videos data through the configuration file. For the display of images, the image texture can be set on the faces of the virtual cube. In order to display the existing advertising videos on the faces of the virtual cube or display the video stream of the current camera in real time (customers can see themselves by this way), the frame data of the video stream is obtained when the OpenGL window data is updated. The image textures are created from the images by using the OpenGL library function and set on the faces of the virtual cube. Through the continuous updating of frames, the textures displayed on the cube faces will also change, so as to achieve the purpose of playing videos on the faces of virtual cube. Since the image textures are created constantly from the frames of the videos, they should be deleted by the application after use, otherwise it will continue to occupy the memory and have a negative impact on the operation of the system. Interface design The interactive window is used to display the final result. If there are predefined markers appearing in the frame collected by the camera, the interactive window will display the original image and the virtual object. Otherwise, only the original frame of the world collected by the camera is displayed on the background. The implementation is to create a graph window inherited from QGLWidget and rewrite its initializeGL, paintGL and other useful functions. The initialization data such as camera parameters and standard marker definition is set in the initializeGL function; the real-time marker detection, virtual objects creation, and display of the combination of virtual and real data are set in the paintGL function. Interactive gesture operation Motivation The system can display different advertising videos on the six faces of the virtual cube. It can be predicted that the videos on some faces will be blocked by other faces. Moreover, in most cases, the video display area is only a small part of the entire screen area, and the advertising content may not be clearly displayed because the playback area is too small. Therefore, the system supports the gesture operation of the virtual cube to move, rotate and zoom in or out the virtual cube. The virtual cube always follows the marker if the marker exists in the view of the camera, otherwise, it will maintain its final posture and wait for the intervention of gesture instruction. There are many mature frameworks for gesture recognition [28–32], but most gestures cannot be used directly by this system. We select MediaPipe Hands [31] to extract the landmarks of hands, and on this basis, to design a set of gestures suitable for this system. MediaPipe Hands was open source by Google research in 2019. It supports five fingers and gesture tracking, and can infer the 3D-coordinates of 21 landmarks of the hand as shown in Fig 7 from a frame. It has high robustness even if the partial display of the palm or the hands self-occluded, and the comprehensive recognition accuracy has reached 95.7%, which has the characteristics of high performance and low time consumption. Based on this, it has been applied in many research [33–37]. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. The 21 landmarks of a hand. https://doi.org/10.1371/journal.pone.0285838.g007 Mapping of operation area and screen area As shown in Fig 8, the range of the screen is M0 and the active area of the index fingertip is M1. The hand recognition rate will decrease if the index finger tip is out of M1, due to incomplete display of hands in a single frame. It is necessary to map the fingertip coordinates in M1 to M0 to calculate the coordinates of the hand relative to the whole screen. The X coordinate of the mouse pointer meets the following formula (12) where W is the screen pixel width, w is the width of the fingertip active effective area, xc is the current fingertip coordinate, and xr is the X coordinate of the starting point of the fingertip area. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Operation area and screen area. https://doi.org/10.1371/journal.pone.0285838.g008 When the fingertip is on the left side of the effective area, the X coordinate of the mouse pointer is 0, indicating that the pointer is considered to be on the leftmost side of the screen. Similarly, when the fingertip is on the right side of the effective area, the X coordinate of the mouse pointer is the width of the screen resolution. When the fingertip is inside the effective area, the X coordinate of the mouse pointer is the linear mapping value. For the same reason, the Y coordinate of the mouse pointer meets (13) where H is the height of the screen pixel, h is the height of the active area of the fingertip, yc is the current fingertip coordinate, and yr is the Y coordinate of the starting point of the fingertip area. In Fig 8, yr = 0 since M1 on the top of M0. In fact, the setting of area M1 needs to consider many factors. The swing amplitude of the user’s arms will be increased if the area is too large, which may increase the user’s fatigue. The area cannot too small because of the mapping from the finger active area to the whole screen area, otherwise, small changes in the effective area M1 will have great feedback on the whole screen M0, which will affect the positioning accuracy. We have tested and verified that it will be better when the area of the operation area M1 is about half of the screen area M0. Mouse sensitivity design There are many sample or mature applications at https://www.github.com, but most applications use the absolute position of one fingertip as the mouse pointer, which has two disadvantages. Firstly, because the human hands cannot remain absolutely stationary in front of the camera, when operating with gestures, the small shaking of the hand may lead to the irregular transformation of the virtual cube. At this time, it is necessary to eliminate the possible hand shaking error. If the distance moved is within a certain error range, it is considered that the movement is an invalid signal caused by the shaking of the hand. This method generally has a filtering effect with the shaking of the hand, but when the user really wants to make a small adjustment within the error range, the signal will be ignored as an error. Secondly, for the sake of the filtering hand shaking, when using this algorithm for gesture operation, the movement of the mouse pointer is not continuous but jumping. The step size of jumping is related to the set error value, which will bring users the feeling of insufficient fluency. For the coordinates setting of the mouse pointer, this system detects the coordinates of the index fingertip on the screen, and takes the fingertip coordinates and the last mouse pointer position as the weighted value as the new pointer position. The weighted results of fingertip coordinates P(xr,yr) may have values beyond the screen range, so the following constraints need to be imposed on the weighted results for getting the final mouse pointer position (14) (15) Gesture unit definition We defined two status of each finger, extended one and closed one. The status represents a signal of each finger and every finger has its independent signal. The final gesture is the result of all five finger signals. The gesture of a single finger is called a gesture unit. For example, the V-shaped gesture is formed by combining the index finger and the middle finger at an angle. For the thumb, we estimate the direction of the hand first, namely, estimating the front or back of the palm facing the camera. Then, give the following definition for the determination of finger extension (16) if the front of the palm facing the camera, otherwise, (17) where X(P2) and X(P4) are landmarks of the thumb as shown in Fig 7. For the other four fingers except the thumb, we give the following definition for the determination of finger extension (18) where i ∈ [8,12,16,20] is the index of the fingertip landmarks as shown in Fig 7. Pi is the four fingertips. For example, the condition for judging the extension of the index finger means that the Y coordinate value of the 8th point is less than the Y coordinate value of the 6th landmark, namely Y(P8)