TY - JOUR AU - Ishigaki, Shafina Abd Karim AB - Introduction Due to the high expense of three-dimensional (3D) reconstruction technology during the last two decades, virtual environments have been mostly restricted to research institutes, the medical, and other fields. Technological advancements in consumer-grade depth sensors have pushed this technology closer to the consumer market in recent years and expanded the research in this field exponentially. Research in 3D reconstruction is a subset of the computer vision area, and it is also significantly and positively connected to computer graphics research in several ways [1]. The goal of 3D reconstruction is to create a digital replica of a target object or the environment that exists in the actual world. This can be seen as numerous applications of 3D reconstruction have been discovered in a variety of disciplines, including health care [2], archaeology [3, 4] and telepresence [5–8]. 3D reconstruction is one of the fundamental requirements for the most immersive telepresence [9]. Telepresence has the opportunity that could benefit applications such as remote collaboration, entertainment, advertising, teaching, hazard site research, and rehabilitation [7, 10, 11]. Communication over long distances is essential to our everyday lives and jobs nowadays. Family and friends are relocating away from home to live and work in another location. International business excursions are something that many companies send their staff on. Video conferencing is a common communication mode that lets us instantly see and hear our friends and colleagues from anywhere. Remote expert advice through video is well-established in academia, health care, and industry. Despite its appeal, video is a relatively restricted means of communication compared to face-to-face encounters since the interlocutors are perceived as flat and remote. Besides that, video conferencing limits the sharing of a restricted view area and the fixed perspective of the local user [12], leading to weak interactivity and relatively poor user experience [13]. Hence, an immersive application such as telepresence represents a new generation of interactive services that provide end users with a rich and immersive experience. Telepresence technology enables a local user to connect with a remote user, and it is necessary to consider how the local user may capture and transmit his surroundings to the remote user. Currently, video calls face several drawbacks, including sharing a restricted view area and the fixed perspective of the local user [12]. The ability for the remote user to see an overview of the local user through advanced display technology could make it more efficient to overcome these limitations, allowing the user to experience a more expansive viewing experience compared to a conventional phone or monitor [14–17]. With the potential of integrating telepresence and 3D reconstruction technology, there is an opportunity to eliminate various constraints of traditional video-based communication mediums, and this advancement opens doors to new possibilities for remote collaboration [18, 19]. By utilizing realistic 3D user representations, modern telepresence systems enable individuals far apart to convene in virtual environments and interact with each other. However, it was challenging for researchers, programmers, or innovators to find a report presenting a survey on previous works, as few systematic reviews of 3D reconstruction for telepresence have been published in recent years. We ensure that it is essential to produce a comprehensive review to describe the most current methods and research findings in 3D reconstruction for telepresence systems. Therefore, this report examines, analyzes, and answers the research question. There are three primary advantages to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement, which are as follows: First, identifies the research issue that leads to systematic research as shown in the PRISMA flow in Fig 1. Secondly, it helps to specify inclusion and exclusion criteria for systematic reviews. Thirdly, it attempts to analyze, within a specified term, a broad scientific literature database [20]. The PRISMA Statement can assist the authors in thoroughly searching for terms relevant to 3D reconstruction methods for telepresence systems. Through the analysis and summarization of many dimensions, we hope this report can provide researchers with a systematic and more in-depth understanding of real-time 3D reconstruction methods for telepresence systems and some references for this field of study. With metaverse’s ability to extend the physical world utilizing augmented reality (AR) and virtual reality (VR) technologies to allow users to seamlessly engage between real and simulated surroundings using reconstructed representation and holograms, we hope the technical breakthrough that has been covered throughout this report can be used as a guide to see the trend, strength, and weakness of implemented 3D reconstruction method. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. PRISMA flow of this report. https://doi.org/10.1371/journal.pone.0287155.g001 Background Real-time 3D reconstruction can be defined as a process where the scene or the shape of an object in a physical world is captured, and the virtual representation of the scene or object is created in real-time. In computer vision, the term 3D reconstruction pertains to the process of restoring a 3D scene or target object within the scene from either a single view or multiple views of it. A 3D representation of the entire scene, as classified in Fig 2, can be created using either a single photograph or multiple images captured from various perspectives as input. The past few years have witnessed multi-image 3D reconstruction with several traditional algorithms being presented, including stereo vision, SFM (structure from motion), and bundle adjustment. 3D reconstruction from a single image has been a long-standing and challenging task due to a large amount of information loss from two-dimensional (2D) images to 3D. With the advancement of neural networks and deep learning, it became clear that neural networks could be trained to learn the 3D structure of objects inside a single image [20]. Red Green Blue-Depth (RGB-D) sensors produce a detailed real-time measurement of 3D surfaces as a 4-channel signal. Colour channels in the RGB colour characterize the appearance of the surface, while a fourth depth channel offers local surface geometry metrics. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Classification of 3D reconstruction into image-based and RGB-Depth sensor-based. https://doi.org/10.1371/journal.pone.0287155.g002 Since its initial introduction to the market ten years ago, RGB-D sensor hardware as can be seen in Fig 3, has played a crucial role in developing advanced mapping and 3D reconstruction systems. Its significance remains unchanged as it continues to contribute to these technologies. RGB-D cameras, such as the Microsoft Kinect, Intel RealSense, and Stereolabs Zed, are sensing systems that include an RGB camera, an infrared projector, and an infrared sensor. They may collect RGB data as well as the depth map at the same time [21]. 3D reconstruction with a single sensor can be accomplished in a variety of ways, including moving the sensor around a static target object or environment, capturing the target object or environment with a static or unmoving sensor, moving the target object in front of the static sensor, or moving the sensor around a moving object. While 3D reconstruction using multiple sensors required a suitable setup to set the position of the capturing depth sensors, considering the number of RGB-D sensors used and the field of view of the devices. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Type of sensor. https://doi.org/10.1371/journal.pone.0287155.g003 The fundamental technology that enables today’s structured light or time-of-flight-based depth cameras has existed for several decades. The release of these consumer-grade sensors that pack this technology into mass-produced devices with compact form has made RGB-D cameras a commodity available to a broader consumer. Several other devices, which include RGB-D cameras such as the Intel RealSense, PrimeSense Carmine, Google Tango, and Occipital’s Structure Sensor, have followed in the aftermath of the Kinect, which was introduced by Microsoft in 2010. These cameras are affordable, and their lightweight sensors can capture per-pixel colour and depth images at sufficient resolution and rapid frame rates. These characteristics put them ahead of even more expensive 3D scanning systems, which is especially important when creating solutions for consumer-grade applications. The potential of these new sensors in the field of visual computing has been swiftly recognized. Methods Within this section, we explore the process involved in generating articles that are relevant to 3D reconstruction, specifically for telepresence systems. In this report, we employ the PRISMA technique, which comprises resources for systematic literature reviews (ACM Digital Library, IEEE Xplore, Springer Link, Scopus, and ProQuest journals). The inclusion and exclusion criteria, along with the review process steps (e.g., identification, screening, qualifying) and abstraction and analysis of information, are also carried out in compliance with the PRISMA approach. Defining the research questions The primary objective of the systematic literature review (SLR) is to comprehend and recognize the 3D reconstruction method implemented in telepresence based on the research questions (RQs) and summarize it. The study topics and domains to match the performance of existing methods are also further employed. A total of three RQs were discussed as follows in order to achieve this objective: RQ1: What are the input data for the 3D reconstruction method? RQ2: What are the real-time 3D reconstruction methods implemented in telepresence systems? RQ3: How can the real-time 3D reconstruction method be evaluated for the telepresence system or application? Inclusion and exclusion criteria A considerable measure of inclusion and exclusion criteria have been decided, as in Table 1. Regarding the literature type, we have selected article journals and conference proceedings that specifically concentrate on the study or design of 3D reconstruction methods or techniques employed in telepresence systems. Only available full-text literature was included. Review articles, book series, and chapters in a book have been excluded from consideration. Non-English publications were also withdrawn to avoid misunderstanding and confusion over the translated works. Finally, in terms of chronology (between 2010 and 2022), a period of thirteen years is chosen as an acceptable length of time long enough to grasp the evolution of research and related publications. Because the evaluation process concentrated on real-time 3D reconstruction for the telepresence system, articles published on 3D reconstruction that did not specifically target the telepresence system were removed from consideration. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. The inclusion and exclusion criteria. https://doi.org/10.1371/journal.pone.0287155.t001 Source and search study The search was carried out using online scientific databases in the form of an online electronic search, relying on several journal databases. These online resources were selected because they were deemed to be the best relevant databases for delivering comprehensive information in the field of 3D reconstruction at the time of selection. Regarding peer-reviewed literature databases in electrical engineering, computer science, and electronics, IEEE Xplore gives web access to more than five million full-text documents from some of the world’s most highly cited journals. In academic literature, Scopus is one of the world’s largest and most well-regarded abstract and citation databases. The collection contains more than 40,000 titles from more than 10,000 foreign publishers, with almost 35,000 of these publications subjected to rigorous peer review. Scopus offers various forms, including books, journals, conference papers, and other materials. Springer also has many relevant records on 3D reconstruction for telepresence systems, which is an additional plus. Study selection All studies were recorded in a Reference Manager System, and duplicates were eliminated when the search was completed. The remaining studies were then assessed using inclusion and exclusion criteria for the titles and abstracts. Where no judgment can be made on inclusion, the entire document has been read to give a final opinion. Data extractions Data have been retrieved utilizing a data extraction form from the included studies. For this study, the form was specially constructed and included six data items, as seen in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Data extraction form items. https://doi.org/10.1371/journal.pone.0287155.t002 Synthesis of results Data analysis of the investigations was carried out after data extraction. The data gathered have been evaluated according to predetermined topics in the narrative format resulting from the research questions and discussed in the following topics: Introduction of telepresence technology 3D reconstruction methods for telepresence Evaluation of 3D reconstruction method for telepresence. Comprehensive science mapping analysis A comprehensive science mapping analysis, as referred to [22–24], was done to produce a bibliometric measurement of the included studies’ annual and country-specific production. The relationship between the production of research work concerning 3D reconstruction for telepresence systems and the year of publication is illustrated in Fig 4. As can be seen, the most significant number of these studies were published in 2016 and 2021, with five publications out of thirty-eight selected papers. The country-specific production of the included studies is shown in Fig 5. shows the geographical distribution of the included studies. Most of these studies (22, 48%) were published in North America (USA = 21, Canada = 1). After North America, the most common publication areas were Europe with 16 studies and 35% (UK = 4, France = 1, Netherland = 2, Germany = 4, Greece = 1, Finland = 1, Switzerland = 1, Sweden = 1, Italy = 1) and Asia with seven studies (India = 2, Malaysia = 2, Korea = 1, Japan = 1, Russia = 1, China = 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Annual production. https://doi.org/10.1371/journal.pone.0287155.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Country-specific production. https://doi.org/10.1371/journal.pone.0287155.g005 Defining the research questions The primary objective of the systematic literature review (SLR) is to comprehend and recognize the 3D reconstruction method implemented in telepresence based on the research questions (RQs) and summarize it. The study topics and domains to match the performance of existing methods are also further employed. A total of three RQs were discussed as follows in order to achieve this objective: RQ1: What are the input data for the 3D reconstruction method? RQ2: What are the real-time 3D reconstruction methods implemented in telepresence systems? RQ3: How can the real-time 3D reconstruction method be evaluated for the telepresence system or application? Inclusion and exclusion criteria A considerable measure of inclusion and exclusion criteria have been decided, as in Table 1. Regarding the literature type, we have selected article journals and conference proceedings that specifically concentrate on the study or design of 3D reconstruction methods or techniques employed in telepresence systems. Only available full-text literature was included. Review articles, book series, and chapters in a book have been excluded from consideration. Non-English publications were also withdrawn to avoid misunderstanding and confusion over the translated works. Finally, in terms of chronology (between 2010 and 2022), a period of thirteen years is chosen as an acceptable length of time long enough to grasp the evolution of research and related publications. Because the evaluation process concentrated on real-time 3D reconstruction for the telepresence system, articles published on 3D reconstruction that did not specifically target the telepresence system were removed from consideration. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. The inclusion and exclusion criteria. https://doi.org/10.1371/journal.pone.0287155.t001 Source and search study The search was carried out using online scientific databases in the form of an online electronic search, relying on several journal databases. These online resources were selected because they were deemed to be the best relevant databases for delivering comprehensive information in the field of 3D reconstruction at the time of selection. Regarding peer-reviewed literature databases in electrical engineering, computer science, and electronics, IEEE Xplore gives web access to more than five million full-text documents from some of the world’s most highly cited journals. In academic literature, Scopus is one of the world’s largest and most well-regarded abstract and citation databases. The collection contains more than 40,000 titles from more than 10,000 foreign publishers, with almost 35,000 of these publications subjected to rigorous peer review. Scopus offers various forms, including books, journals, conference papers, and other materials. Springer also has many relevant records on 3D reconstruction for telepresence systems, which is an additional plus. Study selection All studies were recorded in a Reference Manager System, and duplicates were eliminated when the search was completed. The remaining studies were then assessed using inclusion and exclusion criteria for the titles and abstracts. Where no judgment can be made on inclusion, the entire document has been read to give a final opinion. Data extractions Data have been retrieved utilizing a data extraction form from the included studies. For this study, the form was specially constructed and included six data items, as seen in Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Data extraction form items. https://doi.org/10.1371/journal.pone.0287155.t002 Synthesis of results Data analysis of the investigations was carried out after data extraction. The data gathered have been evaluated according to predetermined topics in the narrative format resulting from the research questions and discussed in the following topics: Introduction of telepresence technology 3D reconstruction methods for telepresence Evaluation of 3D reconstruction method for telepresence. Comprehensive science mapping analysis A comprehensive science mapping analysis, as referred to [22–24], was done to produce a bibliometric measurement of the included studies’ annual and country-specific production. The relationship between the production of research work concerning 3D reconstruction for telepresence systems and the year of publication is illustrated in Fig 4. As can be seen, the most significant number of these studies were published in 2016 and 2021, with five publications out of thirty-eight selected papers. The country-specific production of the included studies is shown in Fig 5. shows the geographical distribution of the included studies. Most of these studies (22, 48%) were published in North America (USA = 21, Canada = 1). After North America, the most common publication areas were Europe with 16 studies and 35% (UK = 4, France = 1, Netherland = 2, Germany = 4, Greece = 1, Finland = 1, Switzerland = 1, Sweden = 1, Italy = 1) and Asia with seven studies (India = 2, Malaysia = 2, Korea = 1, Japan = 1, Russia = 1, China = 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Annual production. https://doi.org/10.1371/journal.pone.0287155.g004 Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Country-specific production. https://doi.org/10.1371/journal.pone.0287155.g005 Results A total of 662 documents from the six top sites, ACM Digital Library, Google Scholar, IEEE Xplore, Springer Link, Scopus and ProQuest journals and candidate documents, have been collected. The number of overall publications on these platforms is not an indicator of their relevance but rather whether they capture the respective field. We analyzed each study to assess whether they suggest a 3D reconstruction method for a telepresence system to meet the mentioned limitations. Finally, for the survey, a total of 48 documents have been chosen. This systematic review has employed a standardized methodology (S1 Checklist). Telepresence technology Marvin Minsky proposed the concept of "telepresence" to refer to the capability to control the tools of a remote robot as though using an individual’s real hands directly [25]. Remote manipulation paired with high-quality sensory feedback is the word used in this context to refer to remote manipulation. Later, Bill Buxton applied telepresence as a concept to the field of telecommunications [26]. In collaborative work, he distinguished between task space and person space and stated that "successful telepresence is reliant on the sharing of both in a high-quality manner." In the context of a shared environment with several users, Buxton et al. proposed that each teleconference participant be represented by a personal station equipped with audio and visual capabilities. Additional networked electronic whiteboards created a shared task environment [27]. Since then, significant progress has been made toward a shared person space with the concept of a shared and integrated environment for groups of people and tasks. Visual communication systems of the modern era emphasize visual or spatial aspects and induce temporary disruption. This can influence the chances, the pace of a discussion and its meaning, how one perceives the other person and the interaction between them over time. The visual and spatial properties can be balanced by merging a 3D reconstruction and a display technology where both are free-viewpoint capable. Telepresence is a developing technology that attempts to deliver spatial immersion and virtual representation in a conventional non-physical environment. Several telepresence technologies have been recommended to provide end-to-end users with immersive and functional interactions [28]. The design of interactive environments using perspective views will enhance and integrate the co-space experience [29]. The use of audio-visual data and other stimuli to better understand co-location between users in the same virtual area is also being explored further. The concept of telepresence combined with 3D reconstruction has motivated researchers for decades, albeit prototypes evolved slowly in the 1980s and 1990s due to technological constraints. Several cameras are deployed, and their imagery is constantly updated, including the moving user, to build a 3D reconstruction of the room-scale scene [30]. Telepresence began in 1994 as a system for collecting photometric and depth data from many fixed-stationary cameras. Virtualized reality [31] allows for the simulation of real-world events and their continuous movement to be captured in an image sequence; nevertheless, the movement of the real simulated image is not smooth and frequently disrupts. On the other hand, Mulligan and Al [32] proposed a hybrid movement and stereo system to boost speed and power, even if the 3D environment must be obtained using remote location transmission. Towles et al. [33] stated that the sense of being there in the scene reconstruction is still feasible without relying on hardware or software technology. However, it was difficult to develop a complete Duplex device because of hardware and monitor configuration constraints. Tanikawa [34] proposed a technique in which photos of a person were collected from various cameras around the network and displayed on a revolving flat panel monitor in a range of image positions. However, due to the limitations of display technology, numerous positions overlap when a viewer is pushed around the viewing system. Kurillo et al. [35] acknowledged that an immersive VR system is designed for remote interaction and understanding of physical activity. To react to real-world or physical-world events, the configuration of a multi-camera system is required to execute 3D reconstruction. As low-cost depth sensors such as the Microsoft Kinect became available, the number of studies and initiatives involving 3D reconstruction and its application in telepresence systems grew at a rapid pace. Beck et al. [6] introduce a novel immersive telepresence system that enables remote groups of users to interact in a shared virtual three-dimensional world created utilizing numerous Kinect devices. The purpose of telepresence is to create the illusion of being present or physically present with a remote individual. Telepresence can result from humans being projected in their natural size, 3D representations of the user or their environment, or mutual interaction between the remote site and the user. A variety of paradigms have been achieved, including a remote user who appears in a local location [36], a remote space appearing to expand beyond the local environment [4, 37], and a local user who is immersed in a remote area [38]. High-speed data transmission and real-time visualization of transmitted data are essential in a 3D telepresence system to transmit the 3D representation of the user or an environment and to ensure the interaction system is to provide immediate feedback [39]. Communication technology has developed rapidly with advancements in imaging and display technologies [40]. Various 3D communication systems are emerging, bringing more vivid and immersive experiences [4, 6, 41, 42]. Streaming or transmission of data. To accomplish the crucial success elements of an immersive telepresence experience, 3D telepresence systems have demanding standards for reconstruction, streaming speeds, and the visual quality of the obtained scene. For such a system to accomplish its intended purpose and be utilized effectively, multiple requirements must be met simultaneously. One of the most important criteria is low and dependable system latency; network latency is also essential for practical use. Similar requirements apply to videoconferencing services on 2D displays. If not met, audio/video synchronization could be impaired, unnatural breaks in visual continuity could occur, and the overall user experience could be reduced [43]. It is essential to minimize memory use in data transmission, as the greater the amount of data stored over an extended period, the larger the generated data set [44–46]. The potential solution is data compression [47–49]. Data compression is a technique that reduces the data size compared to its original size, making storage and transmission more efficient [50]. Compression data techniques are commonly employed in the telepresence system to ensure the data transmitted to the remote site arrives at the appropriate timestamp and in real-time [16, 51–53]. However, new source selection challenges have evolved for real-time telepresence with 3D model reconstruction across a network with constrained bandwidth. The first is shared bandwidth and real-time requirements. The bandwidth required to support a massive amount of video data from cameras is anticipated to surge, and the channel quality of each camera may affect the transmission rate [52, 53]. Nevertheless, the transmission latency is also critical to allow real-time interaction in VR systems, and as a result, the bandwidth demands and the real-time requirements need to be jointly assessed [54, 55]. For data transmission for the telepresence system, [56] uses a server that provides functionality to compress the voxel blocks and sends it to the client. The client listens for incoming volumetric data and assembles it once received. It has an exact copy of the server-side model for the telepresence system. [15] set the incoming data from the reconstruction client to first concurrently be integrated into the truncated signed distanced function (TSDF) voxel block model and then used to update the appropriate blocks and their seven neighbors in the Marching Cube voxel block representation in the negative direction. Maintaining such a collection for each connected client enables advanced streaming tactics required for a lag-free viewing experience and improves performance. [8] discovered that even after compression, depth images contribute to most network traffic, but colour images are comparatively small enough with jpeg compression and suggested adding temporal delta compression to the integrated lossless depth compression techniques increased compression ratios. All compression and streaming systems must balance bandwidth and computing speed. Visualization. Another absolute requirement for a telepresence system is sufficient resolution. In this context, the resolution includes both spatial and angular resolution. Low spatial resolution can result in certain degrees of blur, which distorts the visual experience and makes it difficult or impossible to extract essential visual information, such as the individual’s facial expression [57]. Insufficient angular resolution may worsen, resulting in horizontal motion parallax disruption. In such a circumstance, visual phenomena include the crosstalk effect and sudden view jump, the mortal enemy of glasses-free 3D vision, as seen in [58]. However, high-end extremes should also be avoided, as the total system latency is also determined by a specific system’s processing demands and bandwidth utilization [43]. 3D imaging and display technologies are significant technical elements for 3D communication. When constructing a 3D communication system, choosing appropriate 3D imaging and display technologies is essential. The 3D display methods can be categorized as binocular vision display [59], volume display [36], light field (LF) display [43, 60], and holographic display [61, 62]. Holographic display is a promising method for giving human eyes all the depth information [26–30]. Under coherent illuminations, computer-generated holography may reconstruct 3D intensity patterns from computer-generated holograms (CGHs) [63–65]. In recent studies [66–68], the holographic display for computer-generated objects has been developed. However, few studies on holographic displays process 3D data gathering into real-time display [40]. [69] mentioned 3D display technology that has been implemented in the telepresence system can be divided into two main devices, which are projectors [62, 70–73] and head-mounted devices (HMD) [45, 52, 58, 59, 74–77]. The projector device’s 3D display technologies are an on-stage hologram, autostereoscopic, and holographic projection, while HMDs can be classified into MR headsets and VR headsets. Before selecting the appropriate 3D display technology for a telepresence system, it is necessary to determine the number of users who will be displayed or projected and the number of users who will be perceiving the other user. The focus and purpose of the telepresence technology usage should also play a role in determining the optimal 3D display. Real-time 3D reconstruction methods for telepresence Real-time 3D reconstruction is a crucial element for many immersive telepresence applications [9]; hence it is essential to identify which real-time 3D reconstruction methods are employed in telepresence systems. The general process involved in real-time 3D reconstruction can be identified as depth data acquisition, geometry extraction, surface generation, and fusion to generate a 3D model that represents point cloud or mesh data. There are several methods of 3D reconstruction applied for telepresence systems that depend on the input data, such as for images or video data, which consist of the image frames, and there is an additional process required to obtain the depth data. However, traditional methods, such as the Shape-from-silhouette method, compute the surface of the visual hull of a scene object in the form of a polyhedron. As for 3D reconstruction using RGB-D sensors, the depth data obtained can be pre-processed or used directly as input data to compute the 3D representation of the target scene or object using a point cloud, mesh, or volumetric fusion approach. The list of included studies that have been analyzed to extract the information regarding 3D reconstruction methods for telepresence systems is as presented in Table 3.: Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Classified 3D reconstruction method for telepresence system of the selected primary studies. https://doi.org/10.1371/journal.pone.0287155.t003 Visibility method: Shape-from silhouette. The shape-from-silhouette approach creates a shape model called the visual hull to obtain a three-dimensional geometric representation of objects and people in the acquisition space. This method generates shape models for use in subsequent stages of the process, such as texture mapping or interaction in real-time. The visual hull is defined geometrically as the intersection of the viewing cones, which are generalized cones whose apices are the projective centers of the cameras and whose cross-sections overlap with the silhouettes of the scene, as illustrated in Fig 6. When piecewise-linear photo contours for silhouettes are considered, the visual hull is transformed into a regular polyhedron. Although a visual hull cannot model concavities, it can be efficiently computed, resulting in a very excellent approximation of the human shape. The disadvantage of shape-from-silhouette techniques, according to [102], has mentioned not being capable of reconstructing concave regions adequately. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Visual hull from 4 views [11]. https://doi.org/10.1371/journal.pone.0287155.g006 The EPVH algorithm has the unique ability to retrieve an exact 3D geometry that refers to the silhouettes obtained. This is a significant advantage over other algorithms. When models need to be textured, this is an important feature because it allows Textures from silhouettes to be mapped directly on a 3D model, which is very useful. The 2D polygonal outline of the object of the scene is obtained for each view. A discrete polygonal description of silhouettes of this type results in a unique polyhedron representation of the visual hull, the structure of which is retrieved using the EPVH algorithm. In order to execute this, three measures need to be taken. For starters, a specific polyhedron edge subset is generated, the viewing edges, which are the visual edges induced by viewing lines of contour vertices. A second step involves recovering all the other edges of the polyhedron mesh via a sequence of recursive geometric deductions. The positions of vertices that have not yet been computed are gradually inferred from those that have already been computed, with the viewing edges serving as a starting set of vertices. The mesh is traversed repeatedly in a consistent manner in the third step to identify each face of the polyhedron. Volumetric method: Truncated signed distanced function (TSDF). The volumetric surface representation format based on the TSDF displays an environment in 3D using a voxel grid where every voxel records the nearest area’s distance. This method has been used in current depth camera-based environment mapping, and location systems use the representation widely. An n-dimensional world is represented by an n-dimensional grid of voxels of equal size. A voxel’s location is specified by its center. There are two significant values for each voxel. To begin, sdfi(x) is the signed distance between the voxel center and the closest object surface in direction of the current measurement. Values are defined to be positive in front of an object in free space. Distances behind the surface, which is within the object, are negative. Likewise, each voxel has a weight, wi(x) that is used to quantify the uncertainty associated with the corresponding, sdfi(x). The subscript i indicates that this is the i ’th observation. sdfi(x) is defined as in Fig 7 and the following equation. pic(x) is the depth image projection of the voxel center x. Thus, depthi(pic(x)) denotes the depth measured between the camera and the closest object surface point p on the ray crossing x. Consequently, camz(x) is the distance along the optical axis between the voxel and the camera. As a result, sdfi(x) is also a distance along the optical axis. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Truncated signed distance function (TSDF). (a) and the TSDF that was interpolated (b). Every voxel is represented as a dot in the grid, while the black line represents the surface. Positive distances are indicated by the colour blue to green; negative distances are indicated by the colour green to red. [103]. https://doi.org/10.1371/journal.pone.0287155.g007 Truncated (±t) SDF is advantageous since vast distances are irrelevant for surface reconstruction, and a value range constraint can be used to reduce memory usage. tsdfi(x) is the abbreviation for the truncated variant of sdfi(x). Fig 8(a) shows that the tsdfi(x) of the voxel grid are expressed using colour. The TSDF is sampled along a viewing ray in Fig 8(b). Observations from multiple views can be integrated into a single TSDF to integrate data from multiple perspectives to increase accuracy or to fill in missing spots on the surface. This is accomplished through weighted summation, which is often accomplished using TSDF iterative updates. tsdfi(x) signifies the integration of all observations, tsdfi(x) with 1 ≤ j ≤ i. Wi(x) quantifies the uncertainty of TSDFi(x)). The following update phase incorporates a new observation for each voxel x in the grid. The grid is initialized with TSDFi(x) = 0 and W0(x) = 0. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Reference image of TSDF algorithm. a) The camera’s field of vision, optical axis, and ray (blue), as well as the TSDF grid (unseen voxels are white). (b) A TSDF sample was taken along the ray [104]. https://doi.org/10.1371/journal.pone.0287155.g008 [105] framework’s server executes scene reconstruction and employs the internal data structure Voxel Block Hashing (VBH) [106] for scene representation. VBH will only save a voxel block if at least one of its voxels is within the TSDF’s truncation band. A spatial hash function addresses the blocks, which converts a 3D voxel position in world space to a hash table entry. [107] implements VBH to hold the fused colour and 3D depth information for volumetric representation with hashes. Triangulation method: Mesh generation. To some degree, the purpose of 3D reconstruction is to make the reconstructed scene or object visible. Through 3D reconstruction methods, a group of 3D dot sets can be generated; however, this method cannot reflect surface details. As a result, these spatial coordinates should be triangulated, and a simulated surface composed of multiple triangles can be employed to approximate the actual surface. It is possible to achieve this using a collection of 3D point sets and applying the 3D reconstruction algorithms. However, the surface details cannot be replicated through this method. As a result, it necessitates the triangulation of these spatial coordinates, and the usage of a simulated surface comprised of multiple triangles to resemble the actual surface can be accomplished. This process establishes a networked structure for the scattered 3D point sets. The object’s 3D model will be created using triangular plane clips following triangulation. We can now retrieve the actual 3D model by extracting the texture from the image and projecting it onto the 3D model. The direct meshing of point clouds is possible using Delaunay triangulation and its variants [21]. These methods are susceptible to noise and inconsistencies in the distances between points. Maimone and Fuchs [4, 41] independently construct triangle meshes for multiple cameras by connecting adjacent range image pixels. The meshes are not blended together but are rendered separately. The frames are then combined. Alexiadis et al. [57] take the concept further by merging triangle meshes before rendering. While these techniques can achieve high frame rates, the output quality could be improved. Delaunay’s Triangulation is one of the most frequently used triangulation methods since it is characterized by optimality. Delaunay presented it for the first time in 1934. There are three primary ways for Delaunay’s Triangulation: the incremental method (incremental insertion), divide algorithm (segmentation-merger algorithm), and triangulation growth algorithm abandoned in the mid-1980s. The other two techniques are particularly common. In the following sense, Delaunay triangulation D(P) of P is the Voronoi diagram’s dual: it contains the same number of points as the Voronoi diagram. A simplex with vertices, p1 … pk and an array of V1 … Vk of Voronoi cells corresponding to point, p1 … pk has a nonempty intersection, n, belonging to Delaunay triangulation. It is a simplicial complex derived from the convex hull of the points in P. That is, if the common intersection of the corresponding Voronoi cells is not empty, the convex hull of four points in P defines a Delaunay cell (tetrahedron). Similarly, if the intersection of their corresponding Voronoi cells is not empty and has three or two points, the convex hull denotes as Delaunay face or edge. The Delaunay triangulation and Voronoi diagram are shown in Fig 9. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 9. Delaunay triangulation and Voronoi diagram [108]. https://doi.org/10.1371/journal.pone.0287155.g009 Voronoi diagram V(P) of P is a convex polyhedron cell decomposition of . Each Voronoi cell comprises exactly one sample point, and all points of R3 that are not closer to any other sample point, that is, the Voronoi cell corresponding to p ∈ P is as follows. Voronoi facets are closed facets shared by two Voronoi cells, Voronoi edges are closed edges shared by three Voronoi cells, and Voronoi vertices are closed points shared by four Voronoi cells. Besides, the mesh can also be computed for TSDF-based approaches based on Marching Cubes [91]. Point-based method: Point cloud representation. A cloud point representation refers to a group of recorded depth maps. Point clouds represent the output of numerous 3D sensors like laser scanners and a technique for representing 3D scenes. When a point-based input is turned into a continuous implicit function, it is discretized and then transformed into an (explicit) form through costly polygonization [80] or ray casting [81], then the computational overheads required for switching between several data representations are unveiled. Additionally, using a regular voxel grid, which tightly depicts empty space and surfaces and so severely limits the size of the reconstruction volume, imposes memory overheads. Moving volume systems [51, 57], which function in extremely low volumes, but which release voxels when the sensor moves, or volumetric hierarchical data structures [82], which incur further computational as well as data structure complexity for a restricted spatial gain, have been developed as a result of these memory limitations. Simpler representations have also been investigated in addition to volumetric techniques. The input obtained from depth/range sensors are more suited for point-based representations. In the case of real-time 3D reconstruction, [33] used a point-based technique and a custom-structured light sensor. In addition to reducing computational complexity, point-based methods lower the overall memory associated with volumetric approaches (standard grid) as long as overlapping points are combined. Therefore, such strategies were employed in larger reconstructions. However, an obvious compromise in scale versus speed and quality becomes apparent. The flow data on the process of point-based surface rendering, as in Fig 10, starts with 3D with attributes such as position, normal, radius, etc. Then, projecting the 3D points into scattered pixel data could give the depth, normal or radius value. The interpolation and shading process would result in the image of the surface with depth and colour information. [85] demonstrated real-time 3D reconstruction using a point-based approach and a customized structured light sensor. Apart from lowering computational complexity, point-based methods have a lower memory footprint than volumetric (regular grid) alternatives, if the points that overlap are combined. As a result, these techniques have been applied to larger-scale reconstructions. Nevertheless, a clear trade-off between scale and speed, and quality becomes apparent. Point-based methods might also be more computationally demanding in terms of storage than compact index-based volume representations based on Marching Cubes. The approach is compact and readily manages data, which can benefit telepresence, which requires instant transmission and fast and compact data structures to reconstruct and provide remote users a virtual 3D model in real-time. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 10. Flow of data on the process of point-based surface rendering [109]. https://doi.org/10.1371/journal.pone.0287155.g010 Evaluation of 3D reconstruction method for telepresence system Evaluating the 3D reconstruction method is a challenging task. This is not just owing to the increased complexity of the problem but also the absence of widely acknowledged standardized testing procedures. A performance evaluation system in this area is lacking without consideration of the design of experimental test beds and analysis methodologies, as well as the definition of the ground truth. Furthermore, to establish valid objective comparisons, the performance must be quantified and qualified in some way. Performance analysis. For performance assessment, [6, 16, 51, 55, 57, 70, 74, 77, 90] measured frame rates of application and network latency by [6, 17, 91]. [58, 78] monitor the refresh rates of the 3D reconstruction method and the latency of the overall system [15, 42, 51, 70, 75, 76, 78, 92, 93]. [79] measure the processing time for encoding silhouette images. [81] compares average kernel times of IBVH. [41, 84, 86] measure the rendering rate. [4] measure the frame rate for the display. [40, 54, 89] evaluate the speed of modelling. [43, 55, 74, 77, 86, 94] measure the processing time and frame rate of the enhancement method. [16, 42, 51, 52, 70, 75, 83, 88, 93] measure the computational time. The number of mesh faces is calculated by [57, 85] record the sequential alignment comparison. [87] measure the root mean square error (RMSE) of measurements and [70] calculate the number of resultant vertices. The bandwidth for streaming the data to the remote place was done by [17, 91]. Visual quality. The visual quality evaluation that has been conducted by [79] measures the temporal qualities of the acquisition result, while [80, 83] compare the quality of the obtained results with a dataset. [4, 41] made comparisons of temporal noise for their results. [19, 37, 42, 57, 94, 110] made a comparative result with respect to the visual quality of the reconstructed model. [53] evaluates the quality of the rendering result, and [16, 88] made a qualitative comparison with other different state-of-the-art methods. Peak signal-to-noise ratio (PSNR) is utilized to quantify the nature of the reconstructed compressed image. The higher value of PSNR indicates a better quality of the recreated image [108]. User test. The usability testing has been conducted by [6, 55]. conducted an experiment, and at the end of the session, asked the questionnaire consisted of 10 topics that were covered by groups of two to four separate questions that had to be answered using Likert scales with varying orientations regarding the overall experience, usage experience, comprehensibility of body language and gaze communication, acceptance of the apparatus used and the illusion of physical co-presence. [55] have let the participants experience two separate prototypes and made them rate to compare both systems in terms of which one is more accessible, more preferable, or that could make the participant feel more present and feel the presence of another remote user. User studies [43, 58, 70, 93] and pilot studies [15] have been conducted and evaluated. [91] evaluate the practicality of the framework for telepresence in live-captured scenes while [58, 59] evaluate the user experience of the system. Telepresence technology Marvin Minsky proposed the concept of "telepresence" to refer to the capability to control the tools of a remote robot as though using an individual’s real hands directly [25]. Remote manipulation paired with high-quality sensory feedback is the word used in this context to refer to remote manipulation. Later, Bill Buxton applied telepresence as a concept to the field of telecommunications [26]. In collaborative work, he distinguished between task space and person space and stated that "successful telepresence is reliant on the sharing of both in a high-quality manner." In the context of a shared environment with several users, Buxton et al. proposed that each teleconference participant be represented by a personal station equipped with audio and visual capabilities. Additional networked electronic whiteboards created a shared task environment [27]. Since then, significant progress has been made toward a shared person space with the concept of a shared and integrated environment for groups of people and tasks. Visual communication systems of the modern era emphasize visual or spatial aspects and induce temporary disruption. This can influence the chances, the pace of a discussion and its meaning, how one perceives the other person and the interaction between them over time. The visual and spatial properties can be balanced by merging a 3D reconstruction and a display technology where both are free-viewpoint capable. Telepresence is a developing technology that attempts to deliver spatial immersion and virtual representation in a conventional non-physical environment. Several telepresence technologies have been recommended to provide end-to-end users with immersive and functional interactions [28]. The design of interactive environments using perspective views will enhance and integrate the co-space experience [29]. The use of audio-visual data and other stimuli to better understand co-location between users in the same virtual area is also being explored further. The concept of telepresence combined with 3D reconstruction has motivated researchers for decades, albeit prototypes evolved slowly in the 1980s and 1990s due to technological constraints. Several cameras are deployed, and their imagery is constantly updated, including the moving user, to build a 3D reconstruction of the room-scale scene [30]. Telepresence began in 1994 as a system for collecting photometric and depth data from many fixed-stationary cameras. Virtualized reality [31] allows for the simulation of real-world events and their continuous movement to be captured in an image sequence; nevertheless, the movement of the real simulated image is not smooth and frequently disrupts. On the other hand, Mulligan and Al [32] proposed a hybrid movement and stereo system to boost speed and power, even if the 3D environment must be obtained using remote location transmission. Towles et al. [33] stated that the sense of being there in the scene reconstruction is still feasible without relying on hardware or software technology. However, it was difficult to develop a complete Duplex device because of hardware and monitor configuration constraints. Tanikawa [34] proposed a technique in which photos of a person were collected from various cameras around the network and displayed on a revolving flat panel monitor in a range of image positions. However, due to the limitations of display technology, numerous positions overlap when a viewer is pushed around the viewing system. Kurillo et al. [35] acknowledged that an immersive VR system is designed for remote interaction and understanding of physical activity. To react to real-world or physical-world events, the configuration of a multi-camera system is required to execute 3D reconstruction. As low-cost depth sensors such as the Microsoft Kinect became available, the number of studies and initiatives involving 3D reconstruction and its application in telepresence systems grew at a rapid pace. Beck et al. [6] introduce a novel immersive telepresence system that enables remote groups of users to interact in a shared virtual three-dimensional world created utilizing numerous Kinect devices. The purpose of telepresence is to create the illusion of being present or physically present with a remote individual. Telepresence can result from humans being projected in their natural size, 3D representations of the user or their environment, or mutual interaction between the remote site and the user. A variety of paradigms have been achieved, including a remote user who appears in a local location [36], a remote space appearing to expand beyond the local environment [4, 37], and a local user who is immersed in a remote area [38]. High-speed data transmission and real-time visualization of transmitted data are essential in a 3D telepresence system to transmit the 3D representation of the user or an environment and to ensure the interaction system is to provide immediate feedback [39]. Communication technology has developed rapidly with advancements in imaging and display technologies [40]. Various 3D communication systems are emerging, bringing more vivid and immersive experiences [4, 6, 41, 42]. Streaming or transmission of data. To accomplish the crucial success elements of an immersive telepresence experience, 3D telepresence systems have demanding standards for reconstruction, streaming speeds, and the visual quality of the obtained scene. For such a system to accomplish its intended purpose and be utilized effectively, multiple requirements must be met simultaneously. One of the most important criteria is low and dependable system latency; network latency is also essential for practical use. Similar requirements apply to videoconferencing services on 2D displays. If not met, audio/video synchronization could be impaired, unnatural breaks in visual continuity could occur, and the overall user experience could be reduced [43]. It is essential to minimize memory use in data transmission, as the greater the amount of data stored over an extended period, the larger the generated data set [44–46]. The potential solution is data compression [47–49]. Data compression is a technique that reduces the data size compared to its original size, making storage and transmission more efficient [50]. Compression data techniques are commonly employed in the telepresence system to ensure the data transmitted to the remote site arrives at the appropriate timestamp and in real-time [16, 51–53]. However, new source selection challenges have evolved for real-time telepresence with 3D model reconstruction across a network with constrained bandwidth. The first is shared bandwidth and real-time requirements. The bandwidth required to support a massive amount of video data from cameras is anticipated to surge, and the channel quality of each camera may affect the transmission rate [52, 53]. Nevertheless, the transmission latency is also critical to allow real-time interaction in VR systems, and as a result, the bandwidth demands and the real-time requirements need to be jointly assessed [54, 55]. For data transmission for the telepresence system, [56] uses a server that provides functionality to compress the voxel blocks and sends it to the client. The client listens for incoming volumetric data and assembles it once received. It has an exact copy of the server-side model for the telepresence system. [15] set the incoming data from the reconstruction client to first concurrently be integrated into the truncated signed distanced function (TSDF) voxel block model and then used to update the appropriate blocks and their seven neighbors in the Marching Cube voxel block representation in the negative direction. Maintaining such a collection for each connected client enables advanced streaming tactics required for a lag-free viewing experience and improves performance. [8] discovered that even after compression, depth images contribute to most network traffic, but colour images are comparatively small enough with jpeg compression and suggested adding temporal delta compression to the integrated lossless depth compression techniques increased compression ratios. All compression and streaming systems must balance bandwidth and computing speed. Visualization. Another absolute requirement for a telepresence system is sufficient resolution. In this context, the resolution includes both spatial and angular resolution. Low spatial resolution can result in certain degrees of blur, which distorts the visual experience and makes it difficult or impossible to extract essential visual information, such as the individual’s facial expression [57]. Insufficient angular resolution may worsen, resulting in horizontal motion parallax disruption. In such a circumstance, visual phenomena include the crosstalk effect and sudden view jump, the mortal enemy of glasses-free 3D vision, as seen in [58]. However, high-end extremes should also be avoided, as the total system latency is also determined by a specific system’s processing demands and bandwidth utilization [43]. 3D imaging and display technologies are significant technical elements for 3D communication. When constructing a 3D communication system, choosing appropriate 3D imaging and display technologies is essential. The 3D display methods can be categorized as binocular vision display [59], volume display [36], light field (LF) display [43, 60], and holographic display [61, 62]. Holographic display is a promising method for giving human eyes all the depth information [26–30]. Under coherent illuminations, computer-generated holography may reconstruct 3D intensity patterns from computer-generated holograms (CGHs) [63–65]. In recent studies [66–68], the holographic display for computer-generated objects has been developed. However, few studies on holographic displays process 3D data gathering into real-time display [40]. [69] mentioned 3D display technology that has been implemented in the telepresence system can be divided into two main devices, which are projectors [62, 70–73] and head-mounted devices (HMD) [45, 52, 58, 59, 74–77]. The projector device’s 3D display technologies are an on-stage hologram, autostereoscopic, and holographic projection, while HMDs can be classified into MR headsets and VR headsets. Before selecting the appropriate 3D display technology for a telepresence system, it is necessary to determine the number of users who will be displayed or projected and the number of users who will be perceiving the other user. The focus and purpose of the telepresence technology usage should also play a role in determining the optimal 3D display. Streaming or transmission of data. To accomplish the crucial success elements of an immersive telepresence experience, 3D telepresence systems have demanding standards for reconstruction, streaming speeds, and the visual quality of the obtained scene. For such a system to accomplish its intended purpose and be utilized effectively, multiple requirements must be met simultaneously. One of the most important criteria is low and dependable system latency; network latency is also essential for practical use. Similar requirements apply to videoconferencing services on 2D displays. If not met, audio/video synchronization could be impaired, unnatural breaks in visual continuity could occur, and the overall user experience could be reduced [43]. It is essential to minimize memory use in data transmission, as the greater the amount of data stored over an extended period, the larger the generated data set [44–46]. The potential solution is data compression [47–49]. Data compression is a technique that reduces the data size compared to its original size, making storage and transmission more efficient [50]. Compression data techniques are commonly employed in the telepresence system to ensure the data transmitted to the remote site arrives at the appropriate timestamp and in real-time [16, 51–53]. However, new source selection challenges have evolved for real-time telepresence with 3D model reconstruction across a network with constrained bandwidth. The first is shared bandwidth and real-time requirements. The bandwidth required to support a massive amount of video data from cameras is anticipated to surge, and the channel quality of each camera may affect the transmission rate [52, 53]. Nevertheless, the transmission latency is also critical to allow real-time interaction in VR systems, and as a result, the bandwidth demands and the real-time requirements need to be jointly assessed [54, 55]. For data transmission for the telepresence system, [56] uses a server that provides functionality to compress the voxel blocks and sends it to the client. The client listens for incoming volumetric data and assembles it once received. It has an exact copy of the server-side model for the telepresence system. [15] set the incoming data from the reconstruction client to first concurrently be integrated into the truncated signed distanced function (TSDF) voxel block model and then used to update the appropriate blocks and their seven neighbors in the Marching Cube voxel block representation in the negative direction. Maintaining such a collection for each connected client enables advanced streaming tactics required for a lag-free viewing experience and improves performance. [8] discovered that even after compression, depth images contribute to most network traffic, but colour images are comparatively small enough with jpeg compression and suggested adding temporal delta compression to the integrated lossless depth compression techniques increased compression ratios. All compression and streaming systems must balance bandwidth and computing speed. Visualization. Another absolute requirement for a telepresence system is sufficient resolution. In this context, the resolution includes both spatial and angular resolution. Low spatial resolution can result in certain degrees of blur, which distorts the visual experience and makes it difficult or impossible to extract essential visual information, such as the individual’s facial expression [57]. Insufficient angular resolution may worsen, resulting in horizontal motion parallax disruption. In such a circumstance, visual phenomena include the crosstalk effect and sudden view jump, the mortal enemy of glasses-free 3D vision, as seen in [58]. However, high-end extremes should also be avoided, as the total system latency is also determined by a specific system’s processing demands and bandwidth utilization [43]. 3D imaging and display technologies are significant technical elements for 3D communication. When constructing a 3D communication system, choosing appropriate 3D imaging and display technologies is essential. The 3D display methods can be categorized as binocular vision display [59], volume display [36], light field (LF) display [43, 60], and holographic display [61, 62]. Holographic display is a promising method for giving human eyes all the depth information [26–30]. Under coherent illuminations, computer-generated holography may reconstruct 3D intensity patterns from computer-generated holograms (CGHs) [63–65]. In recent studies [66–68], the holographic display for computer-generated objects has been developed. However, few studies on holographic displays process 3D data gathering into real-time display [40]. [69] mentioned 3D display technology that has been implemented in the telepresence system can be divided into two main devices, which are projectors [62, 70–73] and head-mounted devices (HMD) [45, 52, 58, 59, 74–77]. The projector device’s 3D display technologies are an on-stage hologram, autostereoscopic, and holographic projection, while HMDs can be classified into MR headsets and VR headsets. Before selecting the appropriate 3D display technology for a telepresence system, it is necessary to determine the number of users who will be displayed or projected and the number of users who will be perceiving the other user. The focus and purpose of the telepresence technology usage should also play a role in determining the optimal 3D display. Real-time 3D reconstruction methods for telepresence Real-time 3D reconstruction is a crucial element for many immersive telepresence applications [9]; hence it is essential to identify which real-time 3D reconstruction methods are employed in telepresence systems. The general process involved in real-time 3D reconstruction can be identified as depth data acquisition, geometry extraction, surface generation, and fusion to generate a 3D model that represents point cloud or mesh data. There are several methods of 3D reconstruction applied for telepresence systems that depend on the input data, such as for images or video data, which consist of the image frames, and there is an additional process required to obtain the depth data. However, traditional methods, such as the Shape-from-silhouette method, compute the surface of the visual hull of a scene object in the form of a polyhedron. As for 3D reconstruction using RGB-D sensors, the depth data obtained can be pre-processed or used directly as input data to compute the 3D representation of the target scene or object using a point cloud, mesh, or volumetric fusion approach. The list of included studies that have been analyzed to extract the information regarding 3D reconstruction methods for telepresence systems is as presented in Table 3.: Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Classified 3D reconstruction method for telepresence system of the selected primary studies. https://doi.org/10.1371/journal.pone.0287155.t003 Visibility method: Shape-from silhouette. The shape-from-silhouette approach creates a shape model called the visual hull to obtain a three-dimensional geometric representation of objects and people in the acquisition space. This method generates shape models for use in subsequent stages of the process, such as texture mapping or interaction in real-time. The visual hull is defined geometrically as the intersection of the viewing cones, which are generalized cones whose apices are the projective centers of the cameras and whose cross-sections overlap with the silhouettes of the scene, as illustrated in Fig 6. When piecewise-linear photo contours for silhouettes are considered, the visual hull is transformed into a regular polyhedron. Although a visual hull cannot model concavities, it can be efficiently computed, resulting in a very excellent approximation of the human shape. The disadvantage of shape-from-silhouette techniques, according to [102], has mentioned not being capable of reconstructing concave regions adequately. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Visual hull from 4 views [11]. https://doi.org/10.1371/journal.pone.0287155.g006 The EPVH algorithm has the unique ability to retrieve an exact 3D geometry that refers to the silhouettes obtained. This is a significant advantage over other algorithms. When models need to be textured, this is an important feature because it allows Textures from silhouettes to be mapped directly on a 3D model, which is very useful. The 2D polygonal outline of the object of the scene is obtained for each view. A discrete polygonal description of silhouettes of this type results in a unique polyhedron representation of the visual hull, the structure of which is retrieved using the EPVH algorithm. In order to execute this, three measures need to be taken. For starters, a specific polyhedron edge subset is generated, the viewing edges, which are the visual edges induced by viewing lines of contour vertices. A second step involves recovering all the other edges of the polyhedron mesh via a sequence of recursive geometric deductions. The positions of vertices that have not yet been computed are gradually inferred from those that have already been computed, with the viewing edges serving as a starting set of vertices. The mesh is traversed repeatedly in a consistent manner in the third step to identify each face of the polyhedron. Volumetric method: Truncated signed distanced function (TSDF). The volumetric surface representation format based on the TSDF displays an environment in 3D using a voxel grid where every voxel records the nearest area’s distance. This method has been used in current depth camera-based environment mapping, and location systems use the representation widely. An n-dimensional world is represented by an n-dimensional grid of voxels of equal size. A voxel’s location is specified by its center. There are two significant values for each voxel. To begin, sdfi(x) is the signed distance between the voxel center and the closest object surface in direction of the current measurement. Values are defined to be positive in front of an object in free space. Distances behind the surface, which is within the object, are negative. Likewise, each voxel has a weight, wi(x) that is used to quantify the uncertainty associated with the corresponding, sdfi(x). The subscript i indicates that this is the i ’th observation. sdfi(x) is defined as in Fig 7 and the following equation. pic(x) is the depth image projection of the voxel center x. Thus, depthi(pic(x)) denotes the depth measured between the camera and the closest object surface point p on the ray crossing x. Consequently, camz(x) is the distance along the optical axis between the voxel and the camera. As a result, sdfi(x) is also a distance along the optical axis. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Truncated signed distance function (TSDF). (a) and the TSDF that was interpolated (b). Every voxel is represented as a dot in the grid, while the black line represents the surface. Positive distances are indicated by the colour blue to green; negative distances are indicated by the colour green to red. [103]. https://doi.org/10.1371/journal.pone.0287155.g007 Truncated (±t) SDF is advantageous since vast distances are irrelevant for surface reconstruction, and a value range constraint can be used to reduce memory usage. tsdfi(x) is the abbreviation for the truncated variant of sdfi(x). Fig 8(a) shows that the tsdfi(x) of the voxel grid are expressed using colour. The TSDF is sampled along a viewing ray in Fig 8(b). Observations from multiple views can be integrated into a single TSDF to integrate data from multiple perspectives to increase accuracy or to fill in missing spots on the surface. This is accomplished through weighted summation, which is often accomplished using TSDF iterative updates. tsdfi(x) signifies the integration of all observations, tsdfi(x) with 1 ≤ j ≤ i. Wi(x) quantifies the uncertainty of TSDFi(x)). The following update phase incorporates a new observation for each voxel x in the grid. The grid is initialized with TSDFi(x) = 0 and W0(x) = 0. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Reference image of TSDF algorithm. a) The camera’s field of vision, optical axis, and ray (blue), as well as the TSDF grid (unseen voxels are white). (b) A TSDF sample was taken along the ray [104]. https://doi.org/10.1371/journal.pone.0287155.g008 [105] framework’s server executes scene reconstruction and employs the internal data structure Voxel Block Hashing (VBH) [106] for scene representation. VBH will only save a voxel block if at least one of its voxels is within the TSDF’s truncation band. A spatial hash function addresses the blocks, which converts a 3D voxel position in world space to a hash table entry. [107] implements VBH to hold the fused colour and 3D depth information for volumetric representation with hashes. Triangulation method: Mesh generation. To some degree, the purpose of 3D reconstruction is to make the reconstructed scene or object visible. Through 3D reconstruction methods, a group of 3D dot sets can be generated; however, this method cannot reflect surface details. As a result, these spatial coordinates should be triangulated, and a simulated surface composed of multiple triangles can be employed to approximate the actual surface. It is possible to achieve this using a collection of 3D point sets and applying the 3D reconstruction algorithms. However, the surface details cannot be replicated through this method. As a result, it necessitates the triangulation of these spatial coordinates, and the usage of a simulated surface comprised of multiple triangles to resemble the actual surface can be accomplished. This process establishes a networked structure for the scattered 3D point sets. The object’s 3D model will be created using triangular plane clips following triangulation. We can now retrieve the actual 3D model by extracting the texture from the image and projecting it onto the 3D model. The direct meshing of point clouds is possible using Delaunay triangulation and its variants [21]. These methods are susceptible to noise and inconsistencies in the distances between points. Maimone and Fuchs [4, 41] independently construct triangle meshes for multiple cameras by connecting adjacent range image pixels. The meshes are not blended together but are rendered separately. The frames are then combined. Alexiadis et al. [57] take the concept further by merging triangle meshes before rendering. While these techniques can achieve high frame rates, the output quality could be improved. Delaunay’s Triangulation is one of the most frequently used triangulation methods since it is characterized by optimality. Delaunay presented it for the first time in 1934. There are three primary ways for Delaunay’s Triangulation: the incremental method (incremental insertion), divide algorithm (segmentation-merger algorithm), and triangulation growth algorithm abandoned in the mid-1980s. The other two techniques are particularly common. In the following sense, Delaunay triangulation D(P) of P is the Voronoi diagram’s dual: it contains the same number of points as the Voronoi diagram. A simplex with vertices, p1 … pk and an array of V1 … Vk of Voronoi cells corresponding to point, p1 … pk has a nonempty intersection, n, belonging to Delaunay triangulation. It is a simplicial complex derived from the convex hull of the points in P. That is, if the common intersection of the corresponding Voronoi cells is not empty, the convex hull of four points in P defines a Delaunay cell (tetrahedron). Similarly, if the intersection of their corresponding Voronoi cells is not empty and has three or two points, the convex hull denotes as Delaunay face or edge. The Delaunay triangulation and Voronoi diagram are shown in Fig 9. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 9. Delaunay triangulation and Voronoi diagram [108]. https://doi.org/10.1371/journal.pone.0287155.g009 Voronoi diagram V(P) of P is a convex polyhedron cell decomposition of . Each Voronoi cell comprises exactly one sample point, and all points of R3 that are not closer to any other sample point, that is, the Voronoi cell corresponding to p ∈ P is as follows. Voronoi facets are closed facets shared by two Voronoi cells, Voronoi edges are closed edges shared by three Voronoi cells, and Voronoi vertices are closed points shared by four Voronoi cells. Besides, the mesh can also be computed for TSDF-based approaches based on Marching Cubes [91]. Point-based method: Point cloud representation. A cloud point representation refers to a group of recorded depth maps. Point clouds represent the output of numerous 3D sensors like laser scanners and a technique for representing 3D scenes. When a point-based input is turned into a continuous implicit function, it is discretized and then transformed into an (explicit) form through costly polygonization [80] or ray casting [81], then the computational overheads required for switching between several data representations are unveiled. Additionally, using a regular voxel grid, which tightly depicts empty space and surfaces and so severely limits the size of the reconstruction volume, imposes memory overheads. Moving volume systems [51, 57], which function in extremely low volumes, but which release voxels when the sensor moves, or volumetric hierarchical data structures [82], which incur further computational as well as data structure complexity for a restricted spatial gain, have been developed as a result of these memory limitations. Simpler representations have also been investigated in addition to volumetric techniques. The input obtained from depth/range sensors are more suited for point-based representations. In the case of real-time 3D reconstruction, [33] used a point-based technique and a custom-structured light sensor. In addition to reducing computational complexity, point-based methods lower the overall memory associated with volumetric approaches (standard grid) as long as overlapping points are combined. Therefore, such strategies were employed in larger reconstructions. However, an obvious compromise in scale versus speed and quality becomes apparent. The flow data on the process of point-based surface rendering, as in Fig 10, starts with 3D with attributes such as position, normal, radius, etc. Then, projecting the 3D points into scattered pixel data could give the depth, normal or radius value. The interpolation and shading process would result in the image of the surface with depth and colour information. [85] demonstrated real-time 3D reconstruction using a point-based approach and a customized structured light sensor. Apart from lowering computational complexity, point-based methods have a lower memory footprint than volumetric (regular grid) alternatives, if the points that overlap are combined. As a result, these techniques have been applied to larger-scale reconstructions. Nevertheless, a clear trade-off between scale and speed, and quality becomes apparent. Point-based methods might also be more computationally demanding in terms of storage than compact index-based volume representations based on Marching Cubes. The approach is compact and readily manages data, which can benefit telepresence, which requires instant transmission and fast and compact data structures to reconstruct and provide remote users a virtual 3D model in real-time. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 10. Flow of data on the process of point-based surface rendering [109]. https://doi.org/10.1371/journal.pone.0287155.g010 Visibility method: Shape-from silhouette. The shape-from-silhouette approach creates a shape model called the visual hull to obtain a three-dimensional geometric representation of objects and people in the acquisition space. This method generates shape models for use in subsequent stages of the process, such as texture mapping or interaction in real-time. The visual hull is defined geometrically as the intersection of the viewing cones, which are generalized cones whose apices are the projective centers of the cameras and whose cross-sections overlap with the silhouettes of the scene, as illustrated in Fig 6. When piecewise-linear photo contours for silhouettes are considered, the visual hull is transformed into a regular polyhedron. Although a visual hull cannot model concavities, it can be efficiently computed, resulting in a very excellent approximation of the human shape. The disadvantage of shape-from-silhouette techniques, according to [102], has mentioned not being capable of reconstructing concave regions adequately. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Visual hull from 4 views [11]. https://doi.org/10.1371/journal.pone.0287155.g006 The EPVH algorithm has the unique ability to retrieve an exact 3D geometry that refers to the silhouettes obtained. This is a significant advantage over other algorithms. When models need to be textured, this is an important feature because it allows Textures from silhouettes to be mapped directly on a 3D model, which is very useful. The 2D polygonal outline of the object of the scene is obtained for each view. A discrete polygonal description of silhouettes of this type results in a unique polyhedron representation of the visual hull, the structure of which is retrieved using the EPVH algorithm. In order to execute this, three measures need to be taken. For starters, a specific polyhedron edge subset is generated, the viewing edges, which are the visual edges induced by viewing lines of contour vertices. A second step involves recovering all the other edges of the polyhedron mesh via a sequence of recursive geometric deductions. The positions of vertices that have not yet been computed are gradually inferred from those that have already been computed, with the viewing edges serving as a starting set of vertices. The mesh is traversed repeatedly in a consistent manner in the third step to identify each face of the polyhedron. Volumetric method: Truncated signed distanced function (TSDF). The volumetric surface representation format based on the TSDF displays an environment in 3D using a voxel grid where every voxel records the nearest area’s distance. This method has been used in current depth camera-based environment mapping, and location systems use the representation widely. An n-dimensional world is represented by an n-dimensional grid of voxels of equal size. A voxel’s location is specified by its center. There are two significant values for each voxel. To begin, sdfi(x) is the signed distance between the voxel center and the closest object surface in direction of the current measurement. Values are defined to be positive in front of an object in free space. Distances behind the surface, which is within the object, are negative. Likewise, each voxel has a weight, wi(x) that is used to quantify the uncertainty associated with the corresponding, sdfi(x). The subscript i indicates that this is the i ’th observation. sdfi(x) is defined as in Fig 7 and the following equation. pic(x) is the depth image projection of the voxel center x. Thus, depthi(pic(x)) denotes the depth measured between the camera and the closest object surface point p on the ray crossing x. Consequently, camz(x) is the distance along the optical axis between the voxel and the camera. As a result, sdfi(x) is also a distance along the optical axis. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 7. Truncated signed distance function (TSDF). (a) and the TSDF that was interpolated (b). Every voxel is represented as a dot in the grid, while the black line represents the surface. Positive distances are indicated by the colour blue to green; negative distances are indicated by the colour green to red. [103]. https://doi.org/10.1371/journal.pone.0287155.g007 Truncated (±t) SDF is advantageous since vast distances are irrelevant for surface reconstruction, and a value range constraint can be used to reduce memory usage. tsdfi(x) is the abbreviation for the truncated variant of sdfi(x). Fig 8(a) shows that the tsdfi(x) of the voxel grid are expressed using colour. The TSDF is sampled along a viewing ray in Fig 8(b). Observations from multiple views can be integrated into a single TSDF to integrate data from multiple perspectives to increase accuracy or to fill in missing spots on the surface. This is accomplished through weighted summation, which is often accomplished using TSDF iterative updates. tsdfi(x) signifies the integration of all observations, tsdfi(x) with 1 ≤ j ≤ i. Wi(x) quantifies the uncertainty of TSDFi(x)). The following update phase incorporates a new observation for each voxel x in the grid. The grid is initialized with TSDFi(x) = 0 and W0(x) = 0. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 8. Reference image of TSDF algorithm. a) The camera’s field of vision, optical axis, and ray (blue), as well as the TSDF grid (unseen voxels are white). (b) A TSDF sample was taken along the ray [104]. https://doi.org/10.1371/journal.pone.0287155.g008 [105] framework’s server executes scene reconstruction and employs the internal data structure Voxel Block Hashing (VBH) [106] for scene representation. VBH will only save a voxel block if at least one of its voxels is within the TSDF’s truncation band. A spatial hash function addresses the blocks, which converts a 3D voxel position in world space to a hash table entry. [107] implements VBH to hold the fused colour and 3D depth information for volumetric representation with hashes. Triangulation method: Mesh generation. To some degree, the purpose of 3D reconstruction is to make the reconstructed scene or object visible. Through 3D reconstruction methods, a group of 3D dot sets can be generated; however, this method cannot reflect surface details. As a result, these spatial coordinates should be triangulated, and a simulated surface composed of multiple triangles can be employed to approximate the actual surface. It is possible to achieve this using a collection of 3D point sets and applying the 3D reconstruction algorithms. However, the surface details cannot be replicated through this method. As a result, it necessitates the triangulation of these spatial coordinates, and the usage of a simulated surface comprised of multiple triangles to resemble the actual surface can be accomplished. This process establishes a networked structure for the scattered 3D point sets. The object’s 3D model will be created using triangular plane clips following triangulation. We can now retrieve the actual 3D model by extracting the texture from the image and projecting it onto the 3D model. The direct meshing of point clouds is possible using Delaunay triangulation and its variants [21]. These methods are susceptible to noise and inconsistencies in the distances between points. Maimone and Fuchs [4, 41] independently construct triangle meshes for multiple cameras by connecting adjacent range image pixels. The meshes are not blended together but are rendered separately. The frames are then combined. Alexiadis et al. [57] take the concept further by merging triangle meshes before rendering. While these techniques can achieve high frame rates, the output quality could be improved. Delaunay’s Triangulation is one of the most frequently used triangulation methods since it is characterized by optimality. Delaunay presented it for the first time in 1934. There are three primary ways for Delaunay’s Triangulation: the incremental method (incremental insertion), divide algorithm (segmentation-merger algorithm), and triangulation growth algorithm abandoned in the mid-1980s. The other two techniques are particularly common. In the following sense, Delaunay triangulation D(P) of P is the Voronoi diagram’s dual: it contains the same number of points as the Voronoi diagram. A simplex with vertices, p1 … pk and an array of V1 … Vk of Voronoi cells corresponding to point, p1 … pk has a nonempty intersection, n, belonging to Delaunay triangulation. It is a simplicial complex derived from the convex hull of the points in P. That is, if the common intersection of the corresponding Voronoi cells is not empty, the convex hull of four points in P defines a Delaunay cell (tetrahedron). Similarly, if the intersection of their corresponding Voronoi cells is not empty and has three or two points, the convex hull denotes as Delaunay face or edge. The Delaunay triangulation and Voronoi diagram are shown in Fig 9. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 9. Delaunay triangulation and Voronoi diagram [108]. https://doi.org/10.1371/journal.pone.0287155.g009 Voronoi diagram V(P) of P is a convex polyhedron cell decomposition of . Each Voronoi cell comprises exactly one sample point, and all points of R3 that are not closer to any other sample point, that is, the Voronoi cell corresponding to p ∈ P is as follows. Voronoi facets are closed facets shared by two Voronoi cells, Voronoi edges are closed edges shared by three Voronoi cells, and Voronoi vertices are closed points shared by four Voronoi cells. Besides, the mesh can also be computed for TSDF-based approaches based on Marching Cubes [91]. Point-based method: Point cloud representation. A cloud point representation refers to a group of recorded depth maps. Point clouds represent the output of numerous 3D sensors like laser scanners and a technique for representing 3D scenes. When a point-based input is turned into a continuous implicit function, it is discretized and then transformed into an (explicit) form through costly polygonization [80] or ray casting [81], then the computational overheads required for switching between several data representations are unveiled. Additionally, using a regular voxel grid, which tightly depicts empty space and surfaces and so severely limits the size of the reconstruction volume, imposes memory overheads. Moving volume systems [51, 57], which function in extremely low volumes, but which release voxels when the sensor moves, or volumetric hierarchical data structures [82], which incur further computational as well as data structure complexity for a restricted spatial gain, have been developed as a result of these memory limitations. Simpler representations have also been investigated in addition to volumetric techniques. The input obtained from depth/range sensors are more suited for point-based representations. In the case of real-time 3D reconstruction, [33] used a point-based technique and a custom-structured light sensor. In addition to reducing computational complexity, point-based methods lower the overall memory associated with volumetric approaches (standard grid) as long as overlapping points are combined. Therefore, such strategies were employed in larger reconstructions. However, an obvious compromise in scale versus speed and quality becomes apparent. The flow data on the process of point-based surface rendering, as in Fig 10, starts with 3D with attributes such as position, normal, radius, etc. Then, projecting the 3D points into scattered pixel data could give the depth, normal or radius value. The interpolation and shading process would result in the image of the surface with depth and colour information. [85] demonstrated real-time 3D reconstruction using a point-based approach and a customized structured light sensor. Apart from lowering computational complexity, point-based methods have a lower memory footprint than volumetric (regular grid) alternatives, if the points that overlap are combined. As a result, these techniques have been applied to larger-scale reconstructions. Nevertheless, a clear trade-off between scale and speed, and quality becomes apparent. Point-based methods might also be more computationally demanding in terms of storage than compact index-based volume representations based on Marching Cubes. The approach is compact and readily manages data, which can benefit telepresence, which requires instant transmission and fast and compact data structures to reconstruct and provide remote users a virtual 3D model in real-time. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 10. Flow of data on the process of point-based surface rendering [109]. https://doi.org/10.1371/journal.pone.0287155.g010 Evaluation of 3D reconstruction method for telepresence system Evaluating the 3D reconstruction method is a challenging task. This is not just owing to the increased complexity of the problem but also the absence of widely acknowledged standardized testing procedures. A performance evaluation system in this area is lacking without consideration of the design of experimental test beds and analysis methodologies, as well as the definition of the ground truth. Furthermore, to establish valid objective comparisons, the performance must be quantified and qualified in some way. Performance analysis. For performance assessment, [6, 16, 51, 55, 57, 70, 74, 77, 90] measured frame rates of application and network latency by [6, 17, 91]. [58, 78] monitor the refresh rates of the 3D reconstruction method and the latency of the overall system [15, 42, 51, 70, 75, 76, 78, 92, 93]. [79] measure the processing time for encoding silhouette images. [81] compares average kernel times of IBVH. [41, 84, 86] measure the rendering rate. [4] measure the frame rate for the display. [40, 54, 89] evaluate the speed of modelling. [43, 55, 74, 77, 86, 94] measure the processing time and frame rate of the enhancement method. [16, 42, 51, 52, 70, 75, 83, 88, 93] measure the computational time. The number of mesh faces is calculated by [57, 85] record the sequential alignment comparison. [87] measure the root mean square error (RMSE) of measurements and [70] calculate the number of resultant vertices. The bandwidth for streaming the data to the remote place was done by [17, 91]. Visual quality. The visual quality evaluation that has been conducted by [79] measures the temporal qualities of the acquisition result, while [80, 83] compare the quality of the obtained results with a dataset. [4, 41] made comparisons of temporal noise for their results. [19, 37, 42, 57, 94, 110] made a comparative result with respect to the visual quality of the reconstructed model. [53] evaluates the quality of the rendering result, and [16, 88] made a qualitative comparison with other different state-of-the-art methods. Peak signal-to-noise ratio (PSNR) is utilized to quantify the nature of the reconstructed compressed image. The higher value of PSNR indicates a better quality of the recreated image [108]. User test. The usability testing has been conducted by [6, 55]. conducted an experiment, and at the end of the session, asked the questionnaire consisted of 10 topics that were covered by groups of two to four separate questions that had to be answered using Likert scales with varying orientations regarding the overall experience, usage experience, comprehensibility of body language and gaze communication, acceptance of the apparatus used and the illusion of physical co-presence. [55] have let the participants experience two separate prototypes and made them rate to compare both systems in terms of which one is more accessible, more preferable, or that could make the participant feel more present and feel the presence of another remote user. User studies [43, 58, 70, 93] and pilot studies [15] have been conducted and evaluated. [91] evaluate the practicality of the framework for telepresence in live-captured scenes while [58, 59] evaluate the user experience of the system. Performance analysis. For performance assessment, [6, 16, 51, 55, 57, 70, 74, 77, 90] measured frame rates of application and network latency by [6, 17, 91]. [58, 78] monitor the refresh rates of the 3D reconstruction method and the latency of the overall system [15, 42, 51, 70, 75, 76, 78, 92, 93]. [79] measure the processing time for encoding silhouette images. [81] compares average kernel times of IBVH. [41, 84, 86] measure the rendering rate. [4] measure the frame rate for the display. [40, 54, 89] evaluate the speed of modelling. [43, 55, 74, 77, 86, 94] measure the processing time and frame rate of the enhancement method. [16, 42, 51, 52, 70, 75, 83, 88, 93] measure the computational time. The number of mesh faces is calculated by [57, 85] record the sequential alignment comparison. [87] measure the root mean square error (RMSE) of measurements and [70] calculate the number of resultant vertices. The bandwidth for streaming the data to the remote place was done by [17, 91]. Visual quality. The visual quality evaluation that has been conducted by [79] measures the temporal qualities of the acquisition result, while [80, 83] compare the quality of the obtained results with a dataset. [4, 41] made comparisons of temporal noise for their results. [19, 37, 42, 57, 94, 110] made a comparative result with respect to the visual quality of the reconstructed model. [53] evaluates the quality of the rendering result, and [16, 88] made a qualitative comparison with other different state-of-the-art methods. Peak signal-to-noise ratio (PSNR) is utilized to quantify the nature of the reconstructed compressed image. The higher value of PSNR indicates a better quality of the recreated image [108]. User test. The usability testing has been conducted by [6, 55]. conducted an experiment, and at the end of the session, asked the questionnaire consisted of 10 topics that were covered by groups of two to four separate questions that had to be answered using Likert scales with varying orientations regarding the overall experience, usage experience, comprehensibility of body language and gaze communication, acceptance of the apparatus used and the illusion of physical co-presence. [55] have let the participants experience two separate prototypes and made them rate to compare both systems in terms of which one is more accessible, more preferable, or that could make the participant feel more present and feel the presence of another remote user. User studies [43, 58, 70, 93] and pilot studies [15] have been conducted and evaluated. [91] evaluate the practicality of the framework for telepresence in live-captured scenes while [58, 59] evaluate the user experience of the system. Discussion The publication trend indicates that there is an increasing interest in integrating 3D reconstruction with telepresence systems. However, given the importance of the topic and the relatively small number of reports that summarized this field of study have been found. Hopefully, this systematic literature review can be helpful and valuable for other researchers. Overall, this systematic review of the 48 studies helped answer our three research questions. RQ1: What are the input data for the 3D reconstruction method? The input data which has been used for the 3D reconstruction method are images, and video captured using a regular camera, or depth and colour streams acquired using RGB-D sensors. The input device type has been detailed as illustrated in Fig 10. Over the last decade, a new class of cameras has been revolutionized that enables detailed measurement of the three-dimensional geometry of the scene being scanned, overcoming the limitations of conventional colour cameras. These sensors take a thorough per-pixel scene depth measurement, such as the distance between the scene’s points, and store the information. In most cases, these estimated depth values are given to the viewer as either a profound image of the viewable areas of the scene in a two-and-a-half-dimensional shape. RGB-D is the combination of RGB with a depth sensor of this type. This enables the simultaneous capture of scenery and scene geometry at accepted time frame rates using a stream of colour and depth images. Structured light and active infrared (IR) are two different methods used for depth sensing. Structured light involves projecting known patterns onto a scene and analyzing their deformations to calculate depth information. Meanwhile, active infrared (IR) uses emitted and reflected infrared light to obtain depth information. Time of flight (TOF) and stereo depth sensing are two techniques used in computer vision to determine depth information. TOF measures the time it takes for light to travel and return, while stereo depth sensing compares images from two cameras to calculate depth. A wide range of RGB-D products such as the Microsoft Kinect Xbox 360, Kinect V2, Azure, Intel RealSense Structure Sensor, and the Asus Xtion Pro have been created over the last ten years. Although earlier sensors were costly and only available to a few subject specialists, the range sensors are now everywhere and are even available on mobile devices of the newest generation. Current sensors are tiny, cheap, and accessible to a large audience daily. The availability of inexpensive sensor technology led to a significant leap in research, particularly with regard to more robust static and dynamic methodologies for reconstruction, from 3D scan applications to precise facial and body tracking systems to be integrated with telepresence systems. Table 4 summarize the details of various type of depth sensors. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Comparison of consumer-grade RGB-D sensors. https://doi.org/10.1371/journal.pone.0287155.t004 RQ2: What are the real-time 3D reconstruction methods implemented in telepresence systems? The 3D reconstruction method for telepresence can be classified into four methods. First is the visibility method most suitable for the traditional computer vision approach using the image or video frame as input and applying the shape-from-silhouette algorithm. The second method is the volumetric method which executes a Truncated Signed Distance Function (TSDF) to generate the surface representation of captured object or environment as a voxel grid in which every voxel records the distance to the nearest area. The third method is the triangulation method which is used for mesh generation. The Delaunay triangulation algorithm is the most used algorithm to generate the mesh of the reconstructed model. Last but not least is the point-based method. This method is mainly preferred as it helps to reduce computational complexity and lower the overall memory associated with volumetric approaches. Therefore, the telepresence system represents the target objects as sets of 3D volume pixels, or voxels, in a 3D box. The actual environment is then produced dynamically from any viewing angle at the local endpoint, inserting the point cloud object into a scene or rendering many concurrent point cloud objects. Consequently, it requires complicated preprocessing and rendering, including setups with many camera angles and RGB and depth cameras. Moreover, volumetric media is extremely dense because each voxel is transmitted only once. Therefore, a higher level of compression exchanges computation and latency for the bandwidth and latency required for networking, and vice versa. The accuracy of the reconstructed model can affect the telepresence system, as agreed by [56, 58], where the quality of reconstruction with visual cue offset has been directly influencing the user experience while performing remote communication using telepresence. There is an apparent trade-off between scale, speed, and quality. By comparing the previous and the latest study that has been analyzed in this report, it is apparent that there are continuous improvements in each of the methods as there is also gradual advancement of devices and machines made available for researchers. It is vital to adopt the appropriate reconstruction method to ensure that the accuracy and computation capacity of the reconstructed model can be advantageous when integrated with a telepresence system, resulting in positive user interaction. RQ3: How can the real-time 3D reconstruction method be evaluated for the telepresence system or application? There are several ways to evaluate the overall system of the 3D reconstruction method integrated with telepresence technology. The performance of 3D reconstruction and telepresence components can be quantified using performance analysis, visual quality comparison, and data gathered from user testing. The evaluation of the 3D telepresence system mostly depends on the research’s main objective. Suppose the researchers’ work mainly focuses on improving the quality of the reconstructed model or improving the 3D reconstruction method. In that case, the visual quality comparison and the performance of the 3D reconstruction method implemented are measured as evaluation. When the research works more towards improving the user experience from using the system, then user testing is the proper evaluation. RQ1: What are the input data for the 3D reconstruction method? The input data which has been used for the 3D reconstruction method are images, and video captured using a regular camera, or depth and colour streams acquired using RGB-D sensors. The input device type has been detailed as illustrated in Fig 10. Over the last decade, a new class of cameras has been revolutionized that enables detailed measurement of the three-dimensional geometry of the scene being scanned, overcoming the limitations of conventional colour cameras. These sensors take a thorough per-pixel scene depth measurement, such as the distance between the scene’s points, and store the information. In most cases, these estimated depth values are given to the viewer as either a profound image of the viewable areas of the scene in a two-and-a-half-dimensional shape. RGB-D is the combination of RGB with a depth sensor of this type. This enables the simultaneous capture of scenery and scene geometry at accepted time frame rates using a stream of colour and depth images. Structured light and active infrared (IR) are two different methods used for depth sensing. Structured light involves projecting known patterns onto a scene and analyzing their deformations to calculate depth information. Meanwhile, active infrared (IR) uses emitted and reflected infrared light to obtain depth information. Time of flight (TOF) and stereo depth sensing are two techniques used in computer vision to determine depth information. TOF measures the time it takes for light to travel and return, while stereo depth sensing compares images from two cameras to calculate depth. A wide range of RGB-D products such as the Microsoft Kinect Xbox 360, Kinect V2, Azure, Intel RealSense Structure Sensor, and the Asus Xtion Pro have been created over the last ten years. Although earlier sensors were costly and only available to a few subject specialists, the range sensors are now everywhere and are even available on mobile devices of the newest generation. Current sensors are tiny, cheap, and accessible to a large audience daily. The availability of inexpensive sensor technology led to a significant leap in research, particularly with regard to more robust static and dynamic methodologies for reconstruction, from 3D scan applications to precise facial and body tracking systems to be integrated with telepresence systems. Table 4 summarize the details of various type of depth sensors. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Comparison of consumer-grade RGB-D sensors. https://doi.org/10.1371/journal.pone.0287155.t004 RQ2: What are the real-time 3D reconstruction methods implemented in telepresence systems? The 3D reconstruction method for telepresence can be classified into four methods. First is the visibility method most suitable for the traditional computer vision approach using the image or video frame as input and applying the shape-from-silhouette algorithm. The second method is the volumetric method which executes a Truncated Signed Distance Function (TSDF) to generate the surface representation of captured object or environment as a voxel grid in which every voxel records the distance to the nearest area. The third method is the triangulation method which is used for mesh generation. The Delaunay triangulation algorithm is the most used algorithm to generate the mesh of the reconstructed model. Last but not least is the point-based method. This method is mainly preferred as it helps to reduce computational complexity and lower the overall memory associated with volumetric approaches. Therefore, the telepresence system represents the target objects as sets of 3D volume pixels, or voxels, in a 3D box. The actual environment is then produced dynamically from any viewing angle at the local endpoint, inserting the point cloud object into a scene or rendering many concurrent point cloud objects. Consequently, it requires complicated preprocessing and rendering, including setups with many camera angles and RGB and depth cameras. Moreover, volumetric media is extremely dense because each voxel is transmitted only once. Therefore, a higher level of compression exchanges computation and latency for the bandwidth and latency required for networking, and vice versa. The accuracy of the reconstructed model can affect the telepresence system, as agreed by [56, 58], where the quality of reconstruction with visual cue offset has been directly influencing the user experience while performing remote communication using telepresence. There is an apparent trade-off between scale, speed, and quality. By comparing the previous and the latest study that has been analyzed in this report, it is apparent that there are continuous improvements in each of the methods as there is also gradual advancement of devices and machines made available for researchers. It is vital to adopt the appropriate reconstruction method to ensure that the accuracy and computation capacity of the reconstructed model can be advantageous when integrated with a telepresence system, resulting in positive user interaction. RQ3: How can the real-time 3D reconstruction method be evaluated for the telepresence system or application? There are several ways to evaluate the overall system of the 3D reconstruction method integrated with telepresence technology. The performance of 3D reconstruction and telepresence components can be quantified using performance analysis, visual quality comparison, and data gathered from user testing. The evaluation of the 3D telepresence system mostly depends on the research’s main objective. Suppose the researchers’ work mainly focuses on improving the quality of the reconstructed model or improving the 3D reconstruction method. In that case, the visual quality comparison and the performance of the 3D reconstruction method implemented are measured as evaluation. When the research works more towards improving the user experience from using the system, then user testing is the proper evaluation. Conclusion This work conducted a comprehensive systematic literature survey to detect and examine the various 3D reconstruction methods scientists use for telepresence. We also present their advantages and disadvantages in the report. 48 literature publications were selected and analyzed through several phases in the systemic review process. The literature under evaluation has certain restrictions, as articles published between 2010 and 2022 are the literature documents considered in the review. It, therefore, restricts study and gives future research more scope regarding comprehending the devices before 2010. It also limits research. From this systemic review literature, researchers may gain an in-depth understanding and use this material to advance their study in this field for application in real-time 3D reconstruction Supporting information S1 Checklist. PRISMA 2009 checklist. https://doi.org/10.1371/journal.pone.0287155.s001 (DOC) S1 Fig. PRISMA 2009 flow diagram. https://doi.org/10.1371/journal.pone.0287155.s002 (DOC) Acknowledgments We extend our heartfelt gratitude and deepest appreciation to the Mixed and Virtual Reality Laboratory (mivielab) and ViCubeLab at the University of Technology Malaysia (UTM) for their invaluable support, unwavering dedication, and provision of exceptional facilities throughout the course of this research. Their technical assistance and resources have been instrumental in ensuring the successful execution and outcomes of our study. TI - A systematic literature review: Real-time 3D reconstruction method for telepresence system JF - PLoS ONE DO - 10.1371/journal.pone.0287155 DA - 2023-11-15 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/a-systematic-literature-review-real-time-3d-reconstruction-method-for-x3WMH7JKpr SP - e0287155 VL - 18 IS - 11 DP - DeepDyve ER -