Multilevel active registration for kinect human body scans: from low quality to high quality

Multilevel active registration for kinect human body scans: from low quality to high quality Multimedia Systems (2018) 24:257–270 DOI 10.1007/s00530-017-0541-1 REGULAR PAPER Multilevel active registration for kinect human body scans: from low quality to high quality 1 1 2 Zongyi Xu  · Qianni Zhang  · Shiyang Cheng   Received: 12 September 2016 / Accepted: 7 February 2017 / Published online: 10 March 2017 © The Author(s) 2017. This article is published with open access at Springerlink.com Abstract Registration of 3D human body has been for high-quality meshes in terms of accuracy and it outper- a challenging research topic for over decades. Most of forms them in the case of low-quality scans where noises, the traditional human body registration methods require holes and obscure parts are prevalent. manual assistance, or other auxiliary information such as texture and markers. The majority of these methods are Keywords Human body modeling · Statistical shape tailored for high-quality scans from expensive scanners. model · Non-rigid registration Following the introduction of the low-quality scans from cost-effective devices such as Kinect, the 3D data captur - ing of human body becomes more convenient and easier. 1 Introduction However, due to the inevitable holes, noises and outliers in the low-quality scan, the registration of human body The modeling of accurate 3D human body is a fundamental becomes even more challenging. To address this problem, problem for many applications such as design, animation, we propose a fully automatic active registration method and virtual reality. The modeling of human body meshes which deforms a high-resolution template mesh to match is performed on a corpus of registered scans. However, the the low-quality human body scans. Our registration method acquirement of high-quality human body meshes and reg- operates on two levels of statistical shape models: (1) the istration of meshes are challenging. Current publicly avail- first level is a holistic body shape model that defines the able high-quality human body datasets, such as SCAPE [3], basic figure of human; (2) the second level includes a set of FAUST [5], TOSCA [8] are built either from costly laser shape models for every body part, aiming at capturing more scanners or need other assistance (e.g makers, texture or body details. Our fitting procedure follows a coarse-to-fine professional tools). With the appearance of low-cost scan- approach that is robust and efficient. Experiments show that ners such as Kinect, it is now possible for an object, a our method is comparable with the state-of-the-art methods room or even a person to be quickly scanned, modeled and tracked [12, 14, 27, 28, 30, 31, 44]. Nowadays human body meshes could be captured for different identities in different Communicated by P. Pala. poses in a few minutes. However, the prevalent noises, out- * Zongyi Xu liers and holes in the scans acquired with low-cost scanners zongyi.xu@qmul.ac.uk bring in more challenges for mesh registration. Qianni Zhang To register the 3D scans, several 3D fitting methods qianni.zhang@qmul.ac.uk are proposed [1, 2, 5, 14, 32, 49]. The invertible finite Shiyang Cheng volume method [14] is used to control the template tetra- shiyang.cheng11@imperial.ac.uk hedral mesh to the target point clouds. The stitched pup- 1 pet model [49] adopts the DPMP algorithm which is a Queen Mary University of London, Mile End Rd, London E1 4NS, UK particle-based method to align a graphical model to target 2 meshes. More efforts are made to perform the nonrigid Imperial College London, Kensington, London SW7 2AZ, UK ICP (iterative closest point) [1, 2] which computes the Vol.:(0123456789) 1 3 258 Z. Xu et al. affine transformation at each vertex of template to allow • Second, we provide a dataset of 250 real human body non-rigid registration of template and scans. Although scans acquired with Microsoft Kinect for XBOX 360. these ICP-based nonrigid registration methods demon- This dataset can be used to evaluate the robustness of strate high accuracy, it is sensitive to missing data, which registration algorithms in case of low-quality scans. The might lead to an erroneous fitting result. For Kinect-like dataset is available for research purposes at http://www. scanners, due to self-occluded parts like crotch and arm- eecs.qmul.ac.uk/~zx300/k3d-hub.html. pit, holes and distortion on the mesh are inevitable. To faithfully register the body scans captured from The rest of this paper is structured as follows. In Sect.  2, low-cost scanners, like Kinect, we present a multilevel the literature review of mesh registration is presented. Our active body registration (MABR) approach to build a proposed method is described in detail in Sect.  3 and we watertight and high fidelity virtual human body in an also introduce the Kinect scanning platform which is used automatic way. We aim to align a template mesh with the to build our K3D-Hub dataset in Sect. 4. The experimental target scans acquired with Kinect as close as possible. evaluation results are shown in Sect. 5 and a brief summary Here, a template mesh is the mean shape which is learned is given in Sect. 6. from an existing high-quality human body mesh dataset. In our method, multilevel registration is performed. In the first level, the overall template and target are roughly 2 Related work aligned. In the second level, a region-based registration is performed where the template is divided into 16 parts Although shape matching has been deeply researched, and each part is fitted to the target separately. For the finding full correspondences for non-rigid and articulated main body parts where the scan is complete and full of meshes is still challenging. Geometry information is usu- details such as torso, legs and arms, the local affine trans- ally used to extract local features. Histogram of Oriented formation for each vertex is computed. As for impaired Normal Vectors [40] and Local Normal Binary Patterns parts such as foot and hand, we deform the correspond- (LNBPs) [37] are descriptors presented based on surface ing parts of the template at a coarse-grained level for normal. Since the colour information cannot represent the completeness. unique feature in 3D mesh domain, it usually is used as an With the proposed method, we are able to automatically auxiliary information to other features [5]. Besides using reconstruct high-quality 3D mesh from low-quality scans the local geometric features, many works extend the exist- or point clouds. This technique can be employed in a vari- ing 2D features to the 3D domain [13, 33, 38]. 3D-Harris ety of applications such as in virtual dressing applications [33] is the 3D extension of the 2D corner detection method to show the clothes from different stereo views and help the with Harris operator. Local depth SIFT (LD-SIFT) [13] customers to choose the best fitting clothes. In the virtual extends SIFT feature by representing the vicinity of each games, the systems can generate realistic full body avatars interest point as a depth map and estimating its dominant according to rough scans of the users instantly, which bene- angle using the principal component analysis to achieve fit from algorithm’s robustness to missing data which com- rotation invariance. MeshSIFT [38] characterizes the sali- monly exist in scans from low-cost scanners. The approach ent points neighbourhood with a feature vector consist- manages to avoid the tediously manual work of build- ing of concatenated histograms of shape indices and slant ing high-fidelity 3D models with professional tools and is angles. MeshSIFT presents robustness to expression varia- capable of building a complete and high-quality meshes tions, missing data and outliers when it is used to 3D face within 2 min automatically, which can be beneficial to the shape matching. Clearly, both of these methods rely on the television production. This method may also be integrated local shape features such as curvature or angles. Since they in software as a tool for preprocessing raw scans, filling in are not pose independent, they cannot be used for shapes missing parts automatically and registering scans. undergoing affine transformation, like human  body shape Our main contributions reported in this paper are: with different poses. Since human body is isometric shape, many works • First, we propose a fully automatic registration method make use of isometry to find the correspondences. If two which performs well even on noisy low-quality data. shapes are perfectly isometric, then there exists an isometry Our method follows the region-based approach to regis- i.e., a distance-preserving mapping, between these shapes ter the human body scans, which improves the accuracy such that the geodesic distance between any two points of registration. According to the nature of different body on one shape is exactly the same as the geodesic distance parts, our approach adopts particular registration strat- between their correspondences on the other [36]. Differ - egies, which makes the method robust to noisy Kinect ent approaches are proposed to exploit isometry for shape scans. correspondences [14, 15, 20, 29, 35]. One way is to embed 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 259 shape into a different domain where geodesic distances are authors make use of the texture information to assist the replaced by Euclidean distance so that isometric deviation alignment of the meshes. The registered mesh has 6890 can be measured and optimized in the embedding space vertices and 13,776 faces. Compared with SCAPE data- [15]. Euclidean embedding can be achieved using various set, the resolution is lower but the mesh is still realistic. techniques such as classical MDS (Multidimensional Scal- Nonetheless, its registration method is not fully auto- ing) [20, 35], least-squares MDS [15], and spectral analy- matic for the reason that it is based on the texture infor- sis of the graph Laplacian [29] or of the Laplace–Beltrami mation which is added by hand. The CAESAR dataset operator [14]. However, when it comes to the meshes from [34] contains 2400 male and female laser scans with tex- low-cost scanners, the above isometry-based methods are ture information and hand-placed landmarks. Each range not applicable as they usually require watertight meshes scan in the dataset has about 150,000–200,000 vertices and suffer from self-symmetry of human body shape. and 73 markers. Unfortunately, this dataset does not pro- Another approach is to fit a common template mesh to vide correspondences and contains many holes. The MPI noisy scans. Once fitted, these scans share a common topol- [18] captures 114 subjects in a subset of 35 poses using ogy with the template and are fully registered. By removing a 3D laser scanner. All the aforementioned models are noises and completing holes in the low-quality scans, a captured from expensive scanners or under the condition high-quality mesh is built straightforwardly. To perform of complex and large scale scanning platform. Compared registration,  traditional methods tend to rely on auxiliary with scans acquired with low-cost scanners, they have 1 2 modeling tools, such as Maya, Blender, manual markers much less noises floating on the surface, no big holes and and texture information. Recently, authors in [26] deform a no hierarchical outliers. The methods working on these high-quality template mesh to scans which are from a ste- high-quality meshes might not be directly applied to low- reo scanning system consisting of multiple RGB-D cam- quality scans from cheap scanners, like Kinect, to get sat- eras in a circle. Various non-rigid ICP algorithms [2, 16, isfactory results. 17, 22, 24] are proposed to register 3D mesh. They usually The statistical shape model also has been introduced combine the classic ICP with some regularization terms to for 3D face reconstruction, face modeling and face ani- make the surface deformation smooth. However, the ICP- mation [4, 41–43]. Unlike human body,  the facial land- based methods are sensitive to missing data and outliers. marks [21, 45–47] can be detected accurately and used as When they are used in noisy Kinect scans, the hand/foot reliable constraints to initialize the fitting of morphable parts and top of the head are usually distorted severely. model. In [25], for aligning two faces, the authors extract Besides the ICP-based registration methods mentioned the facial features before performing ICP registration. above,  statistical shape models are employed to improve Accurate landmarks are extracted in [7] to guide the face the smoothness and robustness, as the prior knowledge modeling from large-scale facial dataset. In [23], a pre- are embedded. Scape [3] learns a shape model with PCA processing algorithm is proposed to fill holes and smooth to describe the body shape variations using 45 instances the noisy depth data from Kinect before performing face in a similar pose. It also builds a pose model which is recognition. A high-resolution face model is constructed a mapping from posture parameters to the body shape in [6] from low-resolution depth frames acquired with a with a dataset that includes 70 poses of one subject. Kinect sensor. In this work, an initial denoising opera- With the learnt model, it builds a human body dataset tion which is based on the anisotropic nature of the error but only pose dataset is released which contains meshes distribution with respect to the viewing direction of the of 70 different poses of a particular person. Since the acquired frames and a following manifold estimation body shapes of different people vary greatly for a par - approach based on the lowess nonparametric regression ticular pose (for example, considering the same pose of method which is used to remove outliers from the data arm lifting, the muscle variations of normal people and are proposed to generate high-resolution face models the athlete are definitely different.), TenBo [9 ] proposes from Kinect depth sequences. However, these approaches to model 3D human body with variations on both pose detect landmarks with the help of RGB images or depth and body shape. It trained the Tenbo model with the data- images as clear and strong initialization or preprocessing set from [18]. The model is used to estimate shape and steps are performed to fill holes or smooths the data. In pose parameters with the depth map and skeleton pro- the case of human body registration where texture infor- vided by Microsoft Kinect sensors. The FAUST [5] con- mation is often missing, accurate initial landmarks are tains 300 scans of 10 people in 30 different poses. The hard to be detected automatically. Compared with human faces, the magnitude of changes of human body surface is larger even though the subjects are asked to perform the same pose, which brings in more challenges in human http://www.autodesk.co.uk/products/maya/overview. body registration. https://www.blender.org/. 1 3 260 Z. Xu et al. Fig. 1 The work flow of the proposed method. We first train a statis- illustrated as follows. In the coarse registration level, we deform the tical shape model from 200 aligned meshes in SPRING dataset using template mesh non-rigidly into the target, making the template over- PCA techniques. The mean shape is used as the template mesh. The lap with target in most parts. In the fine registration level, a region- registration between template and target mesh includes two levels based deformation is used to deform the template more accurately systems. In traditional rigid transformation, correspond- 3 Region‑based human body registration ences are needed to compute the rigid transformation matrix. Some works use markers to establish correspond- Region-based modeling technique has been prevalent in ences manually. Some 3D mesh features like Heat Kernel face [10, 41] and human body [49] modelling, as it allows Signature [39] are based on surface properties like geodesic for richer shape representation and enables the fitting of distance, curvature, or face normals. These features work different parts to be specifically tailored. Inspired by [10], well on public human mesh dataset as they are processed we combine the statistical shape model with non-rigid to share topology and high-quality without noises or fold- iterative closest point algorithm. However, the direct appli- ing faces so that the geometry distance is measurable. cation of this fitting method to low-cost, noisy and incom- However, in our case, the number of vertices of targets var- plete Kinect scans could lead to inconsistent and erroneous ies while the template mesh has fixed number of vertices. results. This happens particularly often when it comes to Moreover, it is obvious that the physique such as height and hands and feet fitting (examples of failed fitting shown in muscle properties of the template are different from those Fig. 9). The main reason is that Kinect scan of the feet can of scans. Lastly, noises and holes exist in our data. There- barely be separated from the stand; while, during data cap- fore, the feature which works on the high-quality surface turing, negligible movement of hands is inevitable, caus- cannot be used in our work. ing serious artifacts in hand scan. Even if we perform the Without using correspondence, we choose to build a coarse level registration, the distance of these parts between shape-aware coordinate system for each model and trans- source and target might be large, the nearest neighbors tend form the source to align its origin and axes with the target. to be incorrect and non-rigid ICP easily gets trapped in PCA is used to identify the most important parts from the local minima [19]. Therefore, we propose a different fitting vertex set. PCA-based alignment is to align the principle method that takes special care of foot and hand modeling. directions of the vertex set. First, given a set of vertices The pipeline of the proposed MABR method is shown in S ={p } and its centroid location , we have  formulated Fig.  1. First, a 3D morphable shape model is trained from p i as Eq. 1. 200 pre-aligned high-quality meshes. The mean shape is used as template. Second, coarse registration is employed to roughly align the template and target. Then different ⎡p − c , p − c ,… , p − c ⎤ 1x x 2x x nx x ⎢ ⎥ = p − c , p − c ,… , p − c , non-rigid deformation techniques are applied on the main (1) 1y y 2y y ny y ⎢ ⎥ p − c , p − c ,… , p − c body parts and hand/foot parts, respectively, with our ⎣ ⎦ 1z z 2z z nz z trained morphable shape model. where p , p , p and c , c , c are the coordinates of vertex ix iy iz x y z and centroid  respectively. The covariance matrix  is 3.1 Rigid registration formulated as: =  . (2) The target mesh is captured from a Kinect scanner and the template mesh is the mean shape from the public dataset. The eigenvectors of the covariance matrix  represent The goal of rigid registration is to unify their coordinate principle directions of shape variation. They are orthogonal 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 261 4N×1 where  ∈ ℜ are the 3D coordinates (x,  y,  z) plus corresponding homogeneous coordinates of all N verti- 4N×k ces;  ∈ ℜ are the eigenvectors of the PCA model, 4N×1 k×1 ∈ ℜ is the mean shape, and  ∈ ℜ contains the non-rigid parameters for shape deformation. Apart from a holistic body shape model, to further describe the large amount of shape variability in human body, we model each region of the body with its own PCA model. In this paper, we employ the body segmentation model provided by the SCAPE [3] dataset. Assume that we have p independent parts in the segmented template Fig. 2 One of the examples of initial rigid alignment  ={ } , and the ith part  can also be modeled using i=1 Eq. 6: to each other while the eigenvalues indicate the amount of i i i i =   +  . (6) variation along each eigenvector. Therefore, the eigenvec- i i i Here,  ,  and  are the shape coordinates, eigenbasis tor with largest eigenvalue is the direction where the mesh and mean shape of the model for ith region, respectively, shape varies the most. In the human body mesh, the princi- and  is the latent variable controlling deformation of the ple direction should be along the height direction. The next model. As a result, we trained two levels of shape model: two directions should be along the width and thickness of the first level is a holistic model for the entire body and the the human body, respectively. second ones is region-based model that models each body Given two human body meshes S and S , their covariance a b part separately. matrices  and  can be computed with Eq.  2. We form two matrices  and  where columns are the eigenvectors of  and  , respectively. To align these two orthogonal 3.3 Coarse level registration matrices, we compute the rotation  such that The main goal of this registration is to overlap the template =  , (3) and target scan, while minor details of the body can be Finally, the PCA-based alignment can be performed with ignored in this level. After rigid transformation, we apply the following formula. the holistic PCA model trained in the Sect.  3.2 to get the deformed template that would sit closer to the target point =  + ( −  ), (4) clouds. Here, with target point clouds  retrieved by near- where  and  are the centroids of S and S correspond- a b est neighbors search using the k-d tree algorithm, the cost ingly. After we perform the rigid registration, S should be function to be minimized can be formulated as: aligned with S in terms of main directions. One of the ini- 2 2 tially rigid alignment examples is shown in Fig. 2. We can E()=  −  = ( + )−  . (7) see that both of meshes look forward after we align them To solve this equation, we take the partial derivative with rigidly. However, in terms of height, body shapes, they still regard to  and take the minimum when it approaches to differ a lot. zero: T T 3.2 Morphable shape models +  ( − )= , (8) and get the closed-form solution, In this part, we introduce the statistical body shape model T −1 T = −( )  ( − ). (9) trained from 200 entire human body meshes using PCA technique. The training set is from the SPRING dataset [48] 3.4 Fine level registration which includes 3038 high-resolution body models and each mesh has 12,500 vertices and 25,000 faces. All the meshes After the coarse level registration, to capture the non-rigid have been placed in point to point correspondence. This nature of body surface and provide an accurate fitted mesh, large aligned dataset allows for a reliable model to be learnt we make use of the region-based statistical shape model robustly. Given a set of training shapes, the statistical shape described in Sect.  3.2 and combine it with non-rigid itera- model can be represented as: tive closest points (NICP) algorithm [2]. Note that during scanning, the subject is unlikely to hold the exact pose =  + , (5) like template, especially in the parts of arm and leg, thus the hand and foot parts could easily appear as outliers. 1 3 262 Z. Xu et al. Distance term The distance term is used to minimize the Euclidean distances between source and the target. We assume each part has n points and the cost function is denoted as the sum of error of each pair of vertices: i i i 2 E ()=   −   , (10) j j j i=1 j=1 where X is the transformation matrix for jth vertex in the ith part. Since each part is modeled by the shape model i i i i Fig. 3 The summary of our matching framework. Our target is =   +  , based on Eq.  6, the distance term could be j j j j to find a set of affine transformations X and local PCA parameters rewritten and rearranged as: C , such that, when applied to the vertices v of the template mesh i i S, result in a new surface S that matches the target surface T. This p n � � diagram shows the match in progress; S is moving towards the tar- i i i i i 2 E ()= �� ( +  )−  �� get but has not reached it. The whole vertices are divided into three j j j j j F i=1 j=1 parts which are controlled by three local PCAs. The transformation of each vertex is controlled by affine transformation as well as the local 2 � � i ⎡ ̂ ⎤ i (11) parameters of the part which the vertex belongs to p � � ⎡ ⎤  ⎡ ⎤ � 1  1 � ⎢ ⎥ � ⎢ ⎥ ⎢ ⎥� = ⋱ ⋮ − ⋮ . ⎢ ⎥ � � ⎢ ⎥ ⎢ ⎥ i i � � i=1 ⎢ ̂ ⎥ � ⎣ ⎦ ⎣ ⎦� ni ni ⎣ ⎦ � � Although the first level fitting alleviates this effect, original NICP algorithm still might not generate satisfactory fitting We can see that the above equation is not in the standard result. Therefore, for the parts of hand and foot, we use only linear form of  −  = . To differentiate, we need to ̂ ̂ T the non-rigid parameters to control the deformation in a swap the position of the unknown  and  =[ , … ,  ] . coarse grained level and add a regularization term to make Therefore, we obtain the following form. it smooth on the boundary. In this way, we can recover the hand and foot parts which are impaired in the scanning pro- i i i 2 cess. The clear and semantic hands and feet allow for the E ()=   −   , (12) i=1 shape statistical modeling in the next stage. T T T i i i i where the term  = diag( ,  , ...,  ), and the set of 1 2 i i i i T closest points  =[ ,  , ...,  ] . 1 2 n 3.4.1 Main body registration Stiffness term The stiffness term penalizes the difference between the transformation matrices of neighboring vertices. We define the body parts that exclude feet and hands as Similar to [2], it is defined as: the main body. For the main body parts, we combine the statistical shape model with NICP algorithm. Our goal is i i i 2 to find a set of affine matrices X ={ } and non-rigid E ()= ( ⊗  )  , i=1 (13) parameters C ={ } such that the sum of Euclidean dis- i=1 i=1 i i i tances between pair of points of each region is minimal. here, for the ith body part,  =(1, 1, 1,  ), where  is used Here,  is a 3 × 4n matrix that consists of affine matrix for to balance the scale of rotational and skew factor against every template vertex in the ith part. As shown in Fig.  3, the translational factor. It depends on the units of the data we describe our technique for fitting a template S to tar - and the deformation type to be expressed.  is the node- get mesh T. Each of these surface is represented as a tri- arc incidence matrix of the template mesh topology [2]. angle mesh. Each vertex v is influenced by a 4 × 3 affine Complete cost function: We combine Eqs.  11 and 13 to matrix X and non-rigid parameter C . We define data error i i obtain the complete cost function: with these two parameters. The data error, indicated by the E()= E ()+ E () d s arrows in Fig. 3, is a weighted sum of the squared distances between template surface S and target surface T. Besides (14) =  − . data error, to deform the template smoothly, we also define i i i i=1 a stiffness term to constraint the vertices, which do not Equation 14 is not a quadratic function and it is difficult to move directly towards the target, but may move parallelly obtain the optimal local affine transformation  and non- along it. These error terms are summarized in Fig.  3 and rigid parameters  simultaneously. In [10], an alternating described in detail in the following. 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 263 optimization scheme is employed to solve this problem. In this paper, we use the same optimization method to find the optimal set of parameters. For details of the solution, please refer to [11]. 3.4.2 Hands and feet registration Although the main body parts are roughly aligned after the first level registration, the distance between source hands/feet and corresponding target is large in most cases. In this situation, the ICP-based methods easily get trapped in local minima [19]. To address this problem, we perform a PCA-based fitting for the individual part of hand/foot. Given one particular part model of hand/foot ∗ ∗ that has eigenbasis  and mean shape  , we define our objective function that consists of a distance term and a regularization term, and try to obtain the optimal non- rigid parameters  by minimizing it. Distance term It is defined similar to Eq.  11, but with- out the affine transformation matrix, Fig. 4 The top view of spatial arrangement of offline 3D capturing ∗ ∗ ∗ ∗ ∗ 2 E ( )= (  +  )−   . (15) platform Boundary smoothness term To stitch hand/foot with its neighboring part smoothly, we define a boundary smooth- ness term as follows: 4 Kinect scanning platform ∗ ∗ ∗ ∗ ∗ ∗ 2 E ( )=  (  +  )−   , (16) In this part, we introduce the 3D scanning platform with where  is the selection matrix of hand/foot parts that single Microsoft Kinect for Xbox 360. The setup is shown picks out the boundary points.  is the boundary points in Fig. 4. The platform is built upon ReconstructMe appli- of the neighboring part. By enforcing the boundary con- cation which is based on Kinect Fusion [30]. straints between two parts, we can regulate the part fitting To get the best mesh, we choose to keep the Kinect process to avoid erroneous result caused by outlier. position still at three different heights when the subject is Complete cost function The fitting objective function standing on a running turntable at a certain speed (30  s can be formulated as: per round). After we scan one round at the first height, we adjust the height of the Kinect to the second height ∗ ∗ ∗ E( )= E ( )+(1 − )E ( ) d b and scan the second round around the subject. The self- occluded parts such as armpit and crotch are rescanned ∗ ∗ ∗ = (  +  )− ∗ ∗ (17) (1 − ) (1 − ) if the Kinect does not see them in the first time. For each mesh, from our experience, it takes about 90 s to build with ∗ ∗ ∗ ∗ ∗ =  (  +  )−  , this platform. During data capturing, we require the partici- where  is the weighting factor between two terms, pants to wear tight clothes. Each person is captured 5 poses ∗ ∗ T ∗ ∗ ∗ T =[, (1 − ) ] and  =[ , (1 − ) ] . This is which include a natural pose, and other 4 poses (The pose a well-known linear least square problem. The minimum examples are shown in Fig.  13). The capturing process is occurs where the gradient vanishes, that is E ∕ = . ∗ displayed in Fig. 5. Thus, Eq. 17 has closed-form solution: Although some occluded parts can be rescanned, holes still exist on the top of the head and the soles of the feet ∗ ∗ ∗ T ∗ ∗ −1 ∗ ∗ T ∗ ∗ ∗ = −[(  ) (  )] (  ) (  −  ). (18) which the Kinect cannot see. To the best of our knowl- The proposed fitting method for hand and feet has a nice edge, there is no public Kinect-based human body mesh convergence property, we show one example of residual dataset. Therefore, we utilise the platform above to build a error curve for all iterations of fitting in Fig. 11. http://reconstructme.net/. 1 3 264 Z. Xu et al. Fig. 6 The comparison of 3D shape RMS error of ANICP, NICP, PCA and our MABR Fig. 5 The screenshot of our capturing process low-quality mesh dataset, named Kinect-based 3D Human Body (K3D-Hub) Dataset. So far, our K3D-Hub dataset contains 50 different identities and 5 poses for each person. We show examples of our dataset in Fig. 13. 5 Performance evaluation To evaluate the performance of our MABR method, we conducted experiments on both high- and low-quality meshes, and showed the shape root mean square (RMS) error curve as well as some fitting results for visualization purpose. 5.1 High‑quality mesh evaluation Fig. 7 The front view of fitted results of ANICP, NICP and MABR For the evaluation on high-quality data, we use the in the case that the shape of general template differs a lot from the SPRING [48] dataset that contains 3038 meshes with vari- target mesh ous human body shapes. These good quality meshes are complete and points are evenly distributed. Furthermore, it has the point-to-point correspondences with each other, which means it can be used as our ground truth for quan- titative analysis. Also, the SPRING dataset is divided into male and female subsets. To train a model whose muscle and tissue properties are specific to female and male, we separately train male and female shape models. For each gender, 200 meshes from SPRING dataset are used as the training set and the remaining meshes are regarded as the testing set. To show the superior performance of our method, we compare MABR method with NICP in [2], ANICP in [11], and PCA deformation on SPRING dataset. We compute the 3D shape root mean square error (RMS Error) with the Fig. 8 The detail comparison of fitting results on SPRING dataset. Eq. 19 to measure the accuracy of four methods. The side of the raw scan and the fitted results of ANICP (column 2), NICP (column 3), PCA (column 4) and MABR (column 5). Besides (p − p̂ ) i i the comparison of the full body, the details of the face, hand and i=1 (19) RMSError = , elbow from each method are also compared subsequently 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 265 where points p and p̂ are corresponding points of the better fitting results, showing more complete and meaning- i i ground truth and the fitted results. n is the number of the ful limb parts. When the shapes of the template and target points in 3D template mesh. As shown in Fig. 6, the accu- scans vary a lot, it will be easier for the icp-based algo- racy of our method is comparable with ANICP and is much rithms to find the nearest neighbor incorrectly and more higher than NICP and PCA. In PCA method, the whole meaningful vertices will be regarded as outliers. We show model is only controlled by the trained orthogonal basis the front view of some fitted results in Fig.  7, which illus- which cannot cover all the shape variations. Consequently, trates the robustness of MABR to outliers. Due to the poor the accuracy of PCA is the lowest. accuracy of PCA method, we do not show the PCA results Moreover, when the body shape of template is very dif- here. In Fig. 7, we compare the fitting results of two differ - ferent from the shape of target, MABR is able to present ent body shapes from a general template. The first target Fig. 9 Fitting results from Kinect scans. Column 1 shows the raw body scans, the second to the last columns illustrate the shapes from ANICP, NICP, PCA and the proposed MABR method, respectively 1 3 266 Z. Xu et al. mesh differs a lot from the template in the arm part and we The reason for the failure of NICP and ANICP is that, can see that neither ANICP nor NICP can obtain a com- in real scans, the human pose is hard to control so that plete arm for the given target while the proposed MABR the limbs are usually not completely overlapped with the method can not only get a meaningful arm but also make template. the results similar with the target mesh (which is reflected Therefore, the shape of the closest points of the limbs by the face). What is more, the contour of hands of MABR cannot keep the limb shape of scans, resulting in unex- is clearer than ANICP and NICP. This is shown by the fact pected fitted shapes. Since our fitting procedures are that the hole of the fist is visible in MABR results. In the active, the limb parts of the template can be stretched second line, the left arm of the target mesh is bent. Due to along with the direction of PCA basis before performing the low overlapping degree, the limbs parts are regarded as non-rigid ICP, recovering the size of the hand and foot outliers in ANICP and NICP. As a result, the fitting results roughly. In this way, our MABR method is not only able of NICP and ANICP in the hand and foot parts are dis- to keep a good shape of the scan but also robust to noises. torted and erroneous while our MABR method successfully fits to the target mesh and keeps complete arm shapes at the same time. We also compare the details of fitted results from the above four methods in Fig. 8. We can see that for the hand and elbow parts, the PCA method and the proposed MABR method are much better than the other two approaches. In Fig.  8, compared with PCA and MABR, the hand parts of ANICP distorted severely and the fitted hand of NICP is obscure. As for elbow, the results of ANICP and NICP are broken while PCA and MABR are able to preserve the continuity of the fitted mesh. Although PCA can get mean- ingful results, MABR outperforms it in terms of accuracy, which is reflected by the fitted results in face parts. It can be Fig. 10 The comparison of NICP, ANICP and MABR in the case of seen obviously that the face of MABR is much more simi- hierarchical noises lar with the raw scan than face of PCA. Basically, MABR successfully recovers the shape of the target mesh. In the elbow part, we can see that the curvature of the mesh from MABR is much closer to the target than PCA’s result. 5.2 Low‑quality scans evaluation We evaluate the proposed method on low-quality scans which are captured by Microsoft Kinect for XBOX 360. A Kinect is used to scan the person standing at a running turntable from three different heights. The scans are pre- processed to remove background. We compared our pro- posed MABR method with NICP [2] and ANICP in [11]. Fitting results of these three methods are shown in Fig.  9. It is obvious to see that the proposed MABR method is the only one that models the hand and foot parts com- pletely and, meanwhile, keep high accuracy of the fit- ting results. We can see that the raw scans have a lot of Fig. 11 Example of residual error changes as the fitting of left hand noises which are close to surface. Large holes exist on in the second level progresses top of the head. All these challenges require the regis- tration method be robust to noises, outliers and holes at the same time. From the results, we can see that neither ANICP nor NICP is not robust enough to obtain complete and accurate registered mesh. The hand parts of ANICP and NICP tend to be distorted and incomplete while the MABR method enables meaningful and complete hands. Fig. 12 The comparison of hole tolerance 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 267 Fig. 13 Examples of K3D-Hub human body scans dataset. We invited both male and female subjects. The ages of subject ranges from 18 to 40. The nationalities of the subjects mainly include Asia and Europe. Each subject performs 5 different poses 1 3 268 Z. Xu et al. The hierarchical noises are common in the Kinect scans. model is trained with 200 registered mesh, the combina- On one hand, some subtle movements are inevitable tion of PCA makes our method robust to noise, outliers and when the subjects are trying to keep a certain pose for a holes. We have shown that the performance of proposed few minutes. In this case, hierarchical noises may appear algorithm is comparable with the state-of-the-art non-rigid around arms. On the other hand, the subjects are standing registration methods and outperforms them when it comes on a running turntable. The resulting movements of body to the alignment of hands/foot parts. Experiments verify may cause hierarchical noises around body surface. In that our approach is robust to both noisy Kinect scans and Fig.  10, we compare the fitting results of NICP, ANICP high-quality meshes. Besides the robust MABR method, and the proposed MABR in the case of hierarchical a Kinect-based human body dataset, named K3D-Hub, is noises. We can see that hierarchical noises are distributed collected which is the first publicly available low-quality on the face, hand, and chest in the raw scans. The fitting human body scans dataset. results of NICP are ambiguous, without presenting the Limitations Our registration algorithm manages to regis- shape of hands; while, the results of ANICP and MABR ter a high-quality template mesh to noisy Kinect scans with can keep hand shapes. The MABR also shows more simi- similar poses. However, when the initial poses of template lar fitting results in faces and hands. and target scans differ much, it is still challenging for our We also show the robustness to holes of MABR in method to generate a reasonable fitting result, particularly Fig.  12. Even though there exist big holes on top of the in the limb parts. We believe that fitting on various poses head in the raw scan, MABR and ANICP can fill the hole will be one of our future work. What is more, our targets smoothly, which benefits from the training of the prior are scans captured from subjects with tight clothes and hats. knowledge. NICP merely relies on finding the nearest It is still challenging to deform the template to meshes with points on the target, which is sensitive to holes. Therefore, loose clothes like dresses/skirt. This is because the defor- as illustrated in Fig. 12, the fitted result of NICP is uneven. mation of loose clothes and hair does not follow the defor- In addition, our MABR method has very nice conver- mation of human body muscle. Unexpected results will gence properties. In Fig. 11, we show one example of resid- appear if we apply our trained morphable model to loose ual error changes as the fitting of left hand progresses. As cloths and hair. Moreover, in the future, we plan to speed can be seen, the residual error monotonically decreases and up the fitting algorithm to support real-time applications. gradually converges to a minimum value. Acknowledgements Funding was provided by China Scholarship All the fitting results in the above experiments are Council (Grant No. 201406070079). deformed from the same template mesh, the mean shape of the training set. As shown in Figs.  9 and 10, the lift- Open Access This article is distributed under the terms of the ing angles of arms of the test data are not the same and in Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted Fig. 8 the arms of test data are bent, while the arms of tem- use, distribution, and reproduction in any medium, provided you give plate are straight. Arms in these scans tend to be regarded appropriate credit to the original author(s) and the source, provide a as outliers in NICP and ANICP methods but the proposed link to the Creative Commons license, and indicate if changes were method is able to keep the meaningful shape in the registra- made. tion process, which shows that our method can be applied to scans with different poses in some degree. However, when the target presents different postures with the tem- plate like the 2–6 columns shown in Fig. 13, it is still chal- References lenging for our method to obtain a reasonable fitting result from the natural standing template for the reason that the 1. Allen, B., Curless, B., Popović, Z.: The space of human body shapes: reconstruction and parameterization from range scans. searched nearest points cannot keep the shape of hands dur- In: ACM Transactions on Graphics (TOG), vol. 22, pp. 587–594. ing fitting. The resulting fitting results will be erroneous. ACM (2003) Hence, we reckon that fitting on various poses could be one 2. Amberg, B., Romdhani, S., Vetter, T.: Optimal step nonrigid of our future works. icp algorithms for surface registration. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. IEEE (2007) 3. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., 6 Conclusions Davis, J.: Scape: shape completion and animation of people. In: ACM Transactions on Graphics (TOG), vol.  24, pp. 408–416. ACM (2005) In this paper, we propose a multilevel active registration 4. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d method which combines the non-rigid ICP with the sta- faces. In: Proceedings of the 26th Annual Conference on Com- tistical shape model to automatically fit the body template puter Graphics and Interactive Techniques, pp. 187–194. ACM model to the target point clouds. Since the PCA shape Press/Addison-Wesley Publishing Co., New York (1999) 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 269 5. Bogo, F., Romero, J., Loper, M., Black, M.: Faust: Dataset and 25. Li, W., Li, X., Goldberg, M., Zhu, Z.: Face recognition by 3d evaluation for 3d mesh registration. In: Proceedings of the IEEE registration for the visually impaired using a rgb-d sensor. Conference on Computer Vision and Pattern Recognition, pp. In: European Conference on Computer Vision, pp. 763–777. 3794–3801 (2014) Springer (2014) 6. Bondi, E., Pala, P., Berretti, S., Del Bimbo, A.: Reconstructing 26. Liu, Z., Huang, J., Bu, S., Han, J., Tang, X., Li, X.: Template high-resolution face models from kinect depth sequences. IEEE deformation-based 3-d reconstruction of full human body scans Trans. Inf. Forens. Secur. 11(12), 2843 (2016) from low-cost depth cameras. IEEE Trans. Cybern. 47(3), 695– 7. Booth, J., Roussos, A., Zafeiriou, S., Ponniah, A., Dunaway, D.: 708 (2016) A 3D morphable model learnt from 10,000 faces. In: Proceed- 27. Liu, Z., Huang, J., Han, J., Bu, S., Lv, J.: Human motion tracking ings of the IEEE Conference on Computer Vision and Pattern by multiple RGBD cameras. IEEE Trans. Circuits Syst. Video Recognition, pp. 5543–5552 (2016) Technol. (2016). doi:10.1109/TCSVT.2016.2564878 8. Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Numerical 28. Liu, Z., Qin, H., Bu, S., Yan, M., Huang, J., Tang, X., Han, J.: geometry of non-rigid shapes. Springer, New York (2008) 3D real human reconstruction via multiple low-cost depth cam- 9. Chen, Y., Liu, Z., Zhang, Z.: Tensor-based human body mod- eras. Signal Process. 112, 162–179 (2015) eling. In: Proceedings of the IEEE Conference on Computer 29. Mateus, D., Horaud, R., Knossow, D., Cuzzolin, F., Boyer, E.: Vision and Pattern Recognition, pp. 105–112 (2013) Articulated shape matching using laplacian eigenfunctions and 10. Cheng, S., Marras, I., Zafeiriou, S., Pantic, M.: Active nonrigid unsupervised point registration. In: IEEE Conference on Com- icp algorithm. In: Automatic Face and Gesture Recognition puter Vision and Pattern Recognition, 2008. CVPR 2008. pp. (FG), 2015 11th IEEE International Conference and Workshops 1–8. IEEE (2008) on, vol. 1, pp. 1–8. IEEE (2015) 30. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, 11. Cheng, S., Marras, I., Zafeiriou, S., Pantic, M.: Statistical non- D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, rigid ICP algorithm and its application to 3D face alignment. A.: Kinectfusion: Real-time dense surface mapping and track- Image Vis. Comput. 58, 3–12 (2016) ing. In: 2011 10th IEEE International Symposium on Mixed and 12. Cui, Y., Chang, W., Nöll, T., Stricker, D.: Kinectavatar: Fully Augmented Reality (ISMAR), pp. 127–136. IEEE (2011) automatic body capture using a single kinect. In: ACCV Work- 31. Palasek, P., Yang, H., Xu, Z., Hajimirza, N., Izquierdo, E., shops (2), pp. 133–147. Citeseer (2012) Patras, I.: A flexible calibration method of multiple kinects for 13. Darom, T., Keller, Y.: Scale-invariant features for 3-d mesh mod- 3d human reconstruction. In: 2015 IEEE International Confer- els. IEEE Trans. Image Process. 21(5), 2758–2769 (2012) ence on Multimedia & Expo Workshops (ICMEW), pp. 1–4. 14. Dey, T.K., Fu, B., Wang, H., Wang, L.: Automatic posing of a IEEE (2015) meshed human model using point clouds. Comput. Gr. 46, 14–24 32. Pishchulin, L., Wuhrer, S., Helten, T., Theobalt, C., Schiele, (2015) B.: Building statistical shape spaces for 3d human modeling. 15. Elad, A., Kimmel, R.: On bending invariant signatures for sur- arXiv:1503.05860 (2015) faces. IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1285– 33. Pratikakis, I., Spagnuolo, M., Theoharis, T., Veltkamp, R.: A 1295 (2003) robust 3d interest points detector based on harris operator. In: 16. Fechteler, P., Hilsmann, A., Eisert, P.: Kinematic icp for articu- Eurographics Workshop on 3D Object Retrieval, vol. 5. Citeseer lated template fitting. In: Proceedings of International Workshop (2010) on Vision, Modeling and Visualization, pp. 12–14 (2012) 34. Robinette, K.M., Daanen, H., Paquet, E.: The caesar project: a 17. Haehnel, D., Thrun, S., Burgard, W.: An extension of the icp 3-d surface anthropometry survey. In: Second International Con- algorithm for modeling nonrigid objects with mobile robots. ference on 3-D Digital Imaging and Modeling, 1999. Proceed- IJCAI 3, 915–920 (2003) ings. pp. 380–386 (1999). doi:10.1109/IM.1999.805368 18. Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.P.: A 35. Sahillioğlu, Y., Yemez, Y.: 3d shape correspondence by isom- statistical model of human pose and body shape. In: Computer etry-driven greedy optimization. In: 2010 IEEE Conference on Graphics Forum, vol.  28, pp. 337–346. Wiley Online Library Computer Vision and Pattern Recognition (CVPR), pp. 453–458. (2009) IEEE (2010) 19. Huang, Q.X., Adams, B., Wicke, M., Guibas, L.J.: Non-rigid 36. Sahillioglu, Y., Yemez, Y.: Minimum-distortion isometric shape registration under isometric deformations. In: Computer Graph- correspondence using em algorithm. IEEE Trans. Pattern Anal. ics Forum, vol. 27, pp. 1449–1457. Wiley Online Library (2008) Mach. Intell. 34(11), 2203–2215 (2012) 20. Jain, V., Zhang, H.: Robust 3d shape correspondence in the 37. Sandbach, G., Zafeiriou, S., Pantic, M.: Local normal binary spectral domain. In: Shape Modeling and Applications, 2006. patterns for 3d facial action unit detection. In: 2012 19th IEEE SMI 2006. IEEE International Conference on, pp. 19–19. IEEE International Conference on Image Processing (ICIP), pp. 1813– (2006) 1816. IEEE (2012) 21. Jourabloo, A., Liu, X.: Pose-invariant 3d face alignment. In: 38. Smeets, D., Keustermans, J., Vandermeulen, D., Suetens, P.: Proceedings of the IEEE International Conference on Computer meshsift: local surface features for 3d face recognition under Vision, pp. 3694–3702 (2015) expression variations and partial data. Comput. Vis. Image 22. Kou, Q., Yang, Y., Du, S., Luo, S., Cai, D.: A modified non- Underst. 117(2), 158–169 (2013) rigid icp algorithm for registration of chromosome images. In: 39. Sun, J., Ovsjanikov, M., Guibas, L.: A concise and provably International Conference on Intelligent Computing, pp. 503–513. informative multi-scale signature based on heat diffusion. In: Springer (2016) Computer graphics forum, vol. 28, pp. 1383–1392. Wiley Online 23. Li, B.Y., Mian, A.S., Liu, W., Krishna, A.: Using kinect for face Library (2009) recognition under varying poses, expressions, illumination and 40. Tang, S., Wang, X., Lv, X., Han, T.X., Keller, J., He, Z., Skubic, disguise. In: 2013 IEEE Workshop on Applications of Computer M., Lao, S.: Histogram of oriented normal vectors for object rec- Vision (WACV), pp. 186–192. IEEE (2013) ognition with a depth sensor. In: Computer Vision–ACCV 2012, 24. Li, H., Sumner, R.W., Pauly, M.: Global correspondence opti- pp. 525–538. Springer (2012) mization for non-rigid registration of depth scans. In: Computer 41. Tena, J.R., De la Torre, F., Matthews, I.: Interactive region-based linear 3d face models. In: ACM SIGGRAPH, pp. 76:1–76:10. graphics forum, vol.  27, pp. 1421–1430. Wiley Online Library New York, NY, USA (2011). doi: 10.1145/1964921.1964971 (2008) 1 3 270 Z. Xu et al. 42. Thies, J., Zollhöfer, M., Nießner, M., Valgaerts, L., Stamminger, International Conference and Workshops on Automatic Face and M., Theobalt, C.: Real-time expression transfer for facial reen- Gesture Recognition (FG), pp. 1–6. IEEE (2013) actment. ACM Trans. Gr. (TOG) 34(6), 183 (2015) 47. Yang, H., Patras, I.: Sieving regression forest votes for facial fea- 43. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, ture detection in the wild. In: Proceedings of the IEEE Interna- M.: Face2face: Real-time face capture and reenactment of rgb tional Conference on Computer Vision, pp. 1936–1943 (2013) videos. In: Proc. Computer Vision and Pattern Recognition 48. Yang, Y., Yu, Y., Zhou, Y., Du, S., Davis, J., Yang, R.: Seman- (CVPR), IEEE 1 (2016) tic parametric reshaping of human body models. In: 2014 2nd 44. Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3d full International Conference on 3D Vision (3DV), vol. 2, pp. 41–48. human bodies using kinects. IEEE Trans. Vis. Comput. Gr. IEEE (2014) 18(4), 643–650 (2012) 49. Zuffi, S., Black, M.J.: The stitched puppet: A graphical model of 45. Yang, H., He, X., Jia, X., Patras, I.: Robust face alignment under 3d human shape and pose. In: 2015 IEEE Conference on Com- occlusion via regional predictive power estimation. IEEE Trans. puter Vision and Pattern Recognition (CVPR), pp. 3537–3546. Image Process. 24(8), 2393–2403 (2015) IEEE (2015) 46. Yang, H., Patras, I.: Privileged information-based conditional regression forest for facial feature detection. In: 2013 10th IEEE 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Multimedia Systems Springer Journals

Multilevel active registration for kinect human body scans: from low quality to high quality

Free
14 pages

Loading next page...
 
/lp/springer_journal/multilevel-active-registration-for-kinect-human-body-scans-from-low-D0oAIp0XLv
Publisher
Springer Journals
Copyright
Copyright © 2017 by The Author(s)
Subject
Computer Science; Multimedia Information Systems; Computer Communication Networks; Operating Systems; Data Storage Representation; Data Encryption; Computer Graphics
ISSN
0942-4962
eISSN
1432-1882
D.O.I.
10.1007/s00530-017-0541-1
Publisher site
See Article on Publisher Site

Abstract

Multimedia Systems (2018) 24:257–270 DOI 10.1007/s00530-017-0541-1 REGULAR PAPER Multilevel active registration for kinect human body scans: from low quality to high quality 1 1 2 Zongyi Xu  · Qianni Zhang  · Shiyang Cheng   Received: 12 September 2016 / Accepted: 7 February 2017 / Published online: 10 March 2017 © The Author(s) 2017. This article is published with open access at Springerlink.com Abstract Registration of 3D human body has been for high-quality meshes in terms of accuracy and it outper- a challenging research topic for over decades. Most of forms them in the case of low-quality scans where noises, the traditional human body registration methods require holes and obscure parts are prevalent. manual assistance, or other auxiliary information such as texture and markers. The majority of these methods are Keywords Human body modeling · Statistical shape tailored for high-quality scans from expensive scanners. model · Non-rigid registration Following the introduction of the low-quality scans from cost-effective devices such as Kinect, the 3D data captur - ing of human body becomes more convenient and easier. 1 Introduction However, due to the inevitable holes, noises and outliers in the low-quality scan, the registration of human body The modeling of accurate 3D human body is a fundamental becomes even more challenging. To address this problem, problem for many applications such as design, animation, we propose a fully automatic active registration method and virtual reality. The modeling of human body meshes which deforms a high-resolution template mesh to match is performed on a corpus of registered scans. However, the the low-quality human body scans. Our registration method acquirement of high-quality human body meshes and reg- operates on two levels of statistical shape models: (1) the istration of meshes are challenging. Current publicly avail- first level is a holistic body shape model that defines the able high-quality human body datasets, such as SCAPE [3], basic figure of human; (2) the second level includes a set of FAUST [5], TOSCA [8] are built either from costly laser shape models for every body part, aiming at capturing more scanners or need other assistance (e.g makers, texture or body details. Our fitting procedure follows a coarse-to-fine professional tools). With the appearance of low-cost scan- approach that is robust and efficient. Experiments show that ners such as Kinect, it is now possible for an object, a our method is comparable with the state-of-the-art methods room or even a person to be quickly scanned, modeled and tracked [12, 14, 27, 28, 30, 31, 44]. Nowadays human body meshes could be captured for different identities in different Communicated by P. Pala. poses in a few minutes. However, the prevalent noises, out- * Zongyi Xu liers and holes in the scans acquired with low-cost scanners zongyi.xu@qmul.ac.uk bring in more challenges for mesh registration. Qianni Zhang To register the 3D scans, several 3D fitting methods qianni.zhang@qmul.ac.uk are proposed [1, 2, 5, 14, 32, 49]. The invertible finite Shiyang Cheng volume method [14] is used to control the template tetra- shiyang.cheng11@imperial.ac.uk hedral mesh to the target point clouds. The stitched pup- 1 pet model [49] adopts the DPMP algorithm which is a Queen Mary University of London, Mile End Rd, London E1 4NS, UK particle-based method to align a graphical model to target 2 meshes. More efforts are made to perform the nonrigid Imperial College London, Kensington, London SW7 2AZ, UK ICP (iterative closest point) [1, 2] which computes the Vol.:(0123456789) 1 3 258 Z. Xu et al. affine transformation at each vertex of template to allow • Second, we provide a dataset of 250 real human body non-rigid registration of template and scans. Although scans acquired with Microsoft Kinect for XBOX 360. these ICP-based nonrigid registration methods demon- This dataset can be used to evaluate the robustness of strate high accuracy, it is sensitive to missing data, which registration algorithms in case of low-quality scans. The might lead to an erroneous fitting result. For Kinect-like dataset is available for research purposes at http://www. scanners, due to self-occluded parts like crotch and arm- eecs.qmul.ac.uk/~zx300/k3d-hub.html. pit, holes and distortion on the mesh are inevitable. To faithfully register the body scans captured from The rest of this paper is structured as follows. In Sect.  2, low-cost scanners, like Kinect, we present a multilevel the literature review of mesh registration is presented. Our active body registration (MABR) approach to build a proposed method is described in detail in Sect.  3 and we watertight and high fidelity virtual human body in an also introduce the Kinect scanning platform which is used automatic way. We aim to align a template mesh with the to build our K3D-Hub dataset in Sect. 4. The experimental target scans acquired with Kinect as close as possible. evaluation results are shown in Sect. 5 and a brief summary Here, a template mesh is the mean shape which is learned is given in Sect. 6. from an existing high-quality human body mesh dataset. In our method, multilevel registration is performed. In the first level, the overall template and target are roughly 2 Related work aligned. In the second level, a region-based registration is performed where the template is divided into 16 parts Although shape matching has been deeply researched, and each part is fitted to the target separately. For the finding full correspondences for non-rigid and articulated main body parts where the scan is complete and full of meshes is still challenging. Geometry information is usu- details such as torso, legs and arms, the local affine trans- ally used to extract local features. Histogram of Oriented formation for each vertex is computed. As for impaired Normal Vectors [40] and Local Normal Binary Patterns parts such as foot and hand, we deform the correspond- (LNBPs) [37] are descriptors presented based on surface ing parts of the template at a coarse-grained level for normal. Since the colour information cannot represent the completeness. unique feature in 3D mesh domain, it usually is used as an With the proposed method, we are able to automatically auxiliary information to other features [5]. Besides using reconstruct high-quality 3D mesh from low-quality scans the local geometric features, many works extend the exist- or point clouds. This technique can be employed in a vari- ing 2D features to the 3D domain [13, 33, 38]. 3D-Harris ety of applications such as in virtual dressing applications [33] is the 3D extension of the 2D corner detection method to show the clothes from different stereo views and help the with Harris operator. Local depth SIFT (LD-SIFT) [13] customers to choose the best fitting clothes. In the virtual extends SIFT feature by representing the vicinity of each games, the systems can generate realistic full body avatars interest point as a depth map and estimating its dominant according to rough scans of the users instantly, which bene- angle using the principal component analysis to achieve fit from algorithm’s robustness to missing data which com- rotation invariance. MeshSIFT [38] characterizes the sali- monly exist in scans from low-cost scanners. The approach ent points neighbourhood with a feature vector consist- manages to avoid the tediously manual work of build- ing of concatenated histograms of shape indices and slant ing high-fidelity 3D models with professional tools and is angles. MeshSIFT presents robustness to expression varia- capable of building a complete and high-quality meshes tions, missing data and outliers when it is used to 3D face within 2 min automatically, which can be beneficial to the shape matching. Clearly, both of these methods rely on the television production. This method may also be integrated local shape features such as curvature or angles. Since they in software as a tool for preprocessing raw scans, filling in are not pose independent, they cannot be used for shapes missing parts automatically and registering scans. undergoing affine transformation, like human  body shape Our main contributions reported in this paper are: with different poses. Since human body is isometric shape, many works • First, we propose a fully automatic registration method make use of isometry to find the correspondences. If two which performs well even on noisy low-quality data. shapes are perfectly isometric, then there exists an isometry Our method follows the region-based approach to regis- i.e., a distance-preserving mapping, between these shapes ter the human body scans, which improves the accuracy such that the geodesic distance between any two points of registration. According to the nature of different body on one shape is exactly the same as the geodesic distance parts, our approach adopts particular registration strat- between their correspondences on the other [36]. Differ - egies, which makes the method robust to noisy Kinect ent approaches are proposed to exploit isometry for shape scans. correspondences [14, 15, 20, 29, 35]. One way is to embed 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 259 shape into a different domain where geodesic distances are authors make use of the texture information to assist the replaced by Euclidean distance so that isometric deviation alignment of the meshes. The registered mesh has 6890 can be measured and optimized in the embedding space vertices and 13,776 faces. Compared with SCAPE data- [15]. Euclidean embedding can be achieved using various set, the resolution is lower but the mesh is still realistic. techniques such as classical MDS (Multidimensional Scal- Nonetheless, its registration method is not fully auto- ing) [20, 35], least-squares MDS [15], and spectral analy- matic for the reason that it is based on the texture infor- sis of the graph Laplacian [29] or of the Laplace–Beltrami mation which is added by hand. The CAESAR dataset operator [14]. However, when it comes to the meshes from [34] contains 2400 male and female laser scans with tex- low-cost scanners, the above isometry-based methods are ture information and hand-placed landmarks. Each range not applicable as they usually require watertight meshes scan in the dataset has about 150,000–200,000 vertices and suffer from self-symmetry of human body shape. and 73 markers. Unfortunately, this dataset does not pro- Another approach is to fit a common template mesh to vide correspondences and contains many holes. The MPI noisy scans. Once fitted, these scans share a common topol- [18] captures 114 subjects in a subset of 35 poses using ogy with the template and are fully registered. By removing a 3D laser scanner. All the aforementioned models are noises and completing holes in the low-quality scans, a captured from expensive scanners or under the condition high-quality mesh is built straightforwardly. To perform of complex and large scale scanning platform. Compared registration,  traditional methods tend to rely on auxiliary with scans acquired with low-cost scanners, they have 1 2 modeling tools, such as Maya, Blender, manual markers much less noises floating on the surface, no big holes and and texture information. Recently, authors in [26] deform a no hierarchical outliers. The methods working on these high-quality template mesh to scans which are from a ste- high-quality meshes might not be directly applied to low- reo scanning system consisting of multiple RGB-D cam- quality scans from cheap scanners, like Kinect, to get sat- eras in a circle. Various non-rigid ICP algorithms [2, 16, isfactory results. 17, 22, 24] are proposed to register 3D mesh. They usually The statistical shape model also has been introduced combine the classic ICP with some regularization terms to for 3D face reconstruction, face modeling and face ani- make the surface deformation smooth. However, the ICP- mation [4, 41–43]. Unlike human body,  the facial land- based methods are sensitive to missing data and outliers. marks [21, 45–47] can be detected accurately and used as When they are used in noisy Kinect scans, the hand/foot reliable constraints to initialize the fitting of morphable parts and top of the head are usually distorted severely. model. In [25], for aligning two faces, the authors extract Besides the ICP-based registration methods mentioned the facial features before performing ICP registration. above,  statistical shape models are employed to improve Accurate landmarks are extracted in [7] to guide the face the smoothness and robustness, as the prior knowledge modeling from large-scale facial dataset. In [23], a pre- are embedded. Scape [3] learns a shape model with PCA processing algorithm is proposed to fill holes and smooth to describe the body shape variations using 45 instances the noisy depth data from Kinect before performing face in a similar pose. It also builds a pose model which is recognition. A high-resolution face model is constructed a mapping from posture parameters to the body shape in [6] from low-resolution depth frames acquired with a with a dataset that includes 70 poses of one subject. Kinect sensor. In this work, an initial denoising opera- With the learnt model, it builds a human body dataset tion which is based on the anisotropic nature of the error but only pose dataset is released which contains meshes distribution with respect to the viewing direction of the of 70 different poses of a particular person. Since the acquired frames and a following manifold estimation body shapes of different people vary greatly for a par - approach based on the lowess nonparametric regression ticular pose (for example, considering the same pose of method which is used to remove outliers from the data arm lifting, the muscle variations of normal people and are proposed to generate high-resolution face models the athlete are definitely different.), TenBo [9 ] proposes from Kinect depth sequences. However, these approaches to model 3D human body with variations on both pose detect landmarks with the help of RGB images or depth and body shape. It trained the Tenbo model with the data- images as clear and strong initialization or preprocessing set from [18]. The model is used to estimate shape and steps are performed to fill holes or smooths the data. In pose parameters with the depth map and skeleton pro- the case of human body registration where texture infor- vided by Microsoft Kinect sensors. The FAUST [5] con- mation is often missing, accurate initial landmarks are tains 300 scans of 10 people in 30 different poses. The hard to be detected automatically. Compared with human faces, the magnitude of changes of human body surface is larger even though the subjects are asked to perform the same pose, which brings in more challenges in human http://www.autodesk.co.uk/products/maya/overview. body registration. https://www.blender.org/. 1 3 260 Z. Xu et al. Fig. 1 The work flow of the proposed method. We first train a statis- illustrated as follows. In the coarse registration level, we deform the tical shape model from 200 aligned meshes in SPRING dataset using template mesh non-rigidly into the target, making the template over- PCA techniques. The mean shape is used as the template mesh. The lap with target in most parts. In the fine registration level, a region- registration between template and target mesh includes two levels based deformation is used to deform the template more accurately systems. In traditional rigid transformation, correspond- 3 Region‑based human body registration ences are needed to compute the rigid transformation matrix. Some works use markers to establish correspond- Region-based modeling technique has been prevalent in ences manually. Some 3D mesh features like Heat Kernel face [10, 41] and human body [49] modelling, as it allows Signature [39] are based on surface properties like geodesic for richer shape representation and enables the fitting of distance, curvature, or face normals. These features work different parts to be specifically tailored. Inspired by [10], well on public human mesh dataset as they are processed we combine the statistical shape model with non-rigid to share topology and high-quality without noises or fold- iterative closest point algorithm. However, the direct appli- ing faces so that the geometry distance is measurable. cation of this fitting method to low-cost, noisy and incom- However, in our case, the number of vertices of targets var- plete Kinect scans could lead to inconsistent and erroneous ies while the template mesh has fixed number of vertices. results. This happens particularly often when it comes to Moreover, it is obvious that the physique such as height and hands and feet fitting (examples of failed fitting shown in muscle properties of the template are different from those Fig. 9). The main reason is that Kinect scan of the feet can of scans. Lastly, noises and holes exist in our data. There- barely be separated from the stand; while, during data cap- fore, the feature which works on the high-quality surface turing, negligible movement of hands is inevitable, caus- cannot be used in our work. ing serious artifacts in hand scan. Even if we perform the Without using correspondence, we choose to build a coarse level registration, the distance of these parts between shape-aware coordinate system for each model and trans- source and target might be large, the nearest neighbors tend form the source to align its origin and axes with the target. to be incorrect and non-rigid ICP easily gets trapped in PCA is used to identify the most important parts from the local minima [19]. Therefore, we propose a different fitting vertex set. PCA-based alignment is to align the principle method that takes special care of foot and hand modeling. directions of the vertex set. First, given a set of vertices The pipeline of the proposed MABR method is shown in S ={p } and its centroid location , we have  formulated Fig.  1. First, a 3D morphable shape model is trained from p i as Eq. 1. 200 pre-aligned high-quality meshes. The mean shape is used as template. Second, coarse registration is employed to roughly align the template and target. Then different ⎡p − c , p − c ,… , p − c ⎤ 1x x 2x x nx x ⎢ ⎥ = p − c , p − c ,… , p − c , non-rigid deformation techniques are applied on the main (1) 1y y 2y y ny y ⎢ ⎥ p − c , p − c ,… , p − c body parts and hand/foot parts, respectively, with our ⎣ ⎦ 1z z 2z z nz z trained morphable shape model. where p , p , p and c , c , c are the coordinates of vertex ix iy iz x y z and centroid  respectively. The covariance matrix  is 3.1 Rigid registration formulated as: =  . (2) The target mesh is captured from a Kinect scanner and the template mesh is the mean shape from the public dataset. The eigenvectors of the covariance matrix  represent The goal of rigid registration is to unify their coordinate principle directions of shape variation. They are orthogonal 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 261 4N×1 where  ∈ ℜ are the 3D coordinates (x,  y,  z) plus corresponding homogeneous coordinates of all N verti- 4N×k ces;  ∈ ℜ are the eigenvectors of the PCA model, 4N×1 k×1 ∈ ℜ is the mean shape, and  ∈ ℜ contains the non-rigid parameters for shape deformation. Apart from a holistic body shape model, to further describe the large amount of shape variability in human body, we model each region of the body with its own PCA model. In this paper, we employ the body segmentation model provided by the SCAPE [3] dataset. Assume that we have p independent parts in the segmented template Fig. 2 One of the examples of initial rigid alignment  ={ } , and the ith part  can also be modeled using i=1 Eq. 6: to each other while the eigenvalues indicate the amount of i i i i =   +  . (6) variation along each eigenvector. Therefore, the eigenvec- i i i Here,  ,  and  are the shape coordinates, eigenbasis tor with largest eigenvalue is the direction where the mesh and mean shape of the model for ith region, respectively, shape varies the most. In the human body mesh, the princi- and  is the latent variable controlling deformation of the ple direction should be along the height direction. The next model. As a result, we trained two levels of shape model: two directions should be along the width and thickness of the first level is a holistic model for the entire body and the the human body, respectively. second ones is region-based model that models each body Given two human body meshes S and S , their covariance a b part separately. matrices  and  can be computed with Eq.  2. We form two matrices  and  where columns are the eigenvectors of  and  , respectively. To align these two orthogonal 3.3 Coarse level registration matrices, we compute the rotation  such that The main goal of this registration is to overlap the template =  , (3) and target scan, while minor details of the body can be Finally, the PCA-based alignment can be performed with ignored in this level. After rigid transformation, we apply the following formula. the holistic PCA model trained in the Sect.  3.2 to get the deformed template that would sit closer to the target point =  + ( −  ), (4) clouds. Here, with target point clouds  retrieved by near- where  and  are the centroids of S and S correspond- a b est neighbors search using the k-d tree algorithm, the cost ingly. After we perform the rigid registration, S should be function to be minimized can be formulated as: aligned with S in terms of main directions. One of the ini- 2 2 tially rigid alignment examples is shown in Fig. 2. We can E()=  −  = ( + )−  . (7) see that both of meshes look forward after we align them To solve this equation, we take the partial derivative with rigidly. However, in terms of height, body shapes, they still regard to  and take the minimum when it approaches to differ a lot. zero: T T 3.2 Morphable shape models +  ( − )= , (8) and get the closed-form solution, In this part, we introduce the statistical body shape model T −1 T = −( )  ( − ). (9) trained from 200 entire human body meshes using PCA technique. The training set is from the SPRING dataset [48] 3.4 Fine level registration which includes 3038 high-resolution body models and each mesh has 12,500 vertices and 25,000 faces. All the meshes After the coarse level registration, to capture the non-rigid have been placed in point to point correspondence. This nature of body surface and provide an accurate fitted mesh, large aligned dataset allows for a reliable model to be learnt we make use of the region-based statistical shape model robustly. Given a set of training shapes, the statistical shape described in Sect.  3.2 and combine it with non-rigid itera- model can be represented as: tive closest points (NICP) algorithm [2]. Note that during scanning, the subject is unlikely to hold the exact pose =  + , (5) like template, especially in the parts of arm and leg, thus the hand and foot parts could easily appear as outliers. 1 3 262 Z. Xu et al. Distance term The distance term is used to minimize the Euclidean distances between source and the target. We assume each part has n points and the cost function is denoted as the sum of error of each pair of vertices: i i i 2 E ()=   −   , (10) j j j i=1 j=1 where X is the transformation matrix for jth vertex in the ith part. Since each part is modeled by the shape model i i i i Fig. 3 The summary of our matching framework. Our target is =   +  , based on Eq.  6, the distance term could be j j j j to find a set of affine transformations X and local PCA parameters rewritten and rearranged as: C , such that, when applied to the vertices v of the template mesh i i S, result in a new surface S that matches the target surface T. This p n � � diagram shows the match in progress; S is moving towards the tar- i i i i i 2 E ()= �� ( +  )−  �� get but has not reached it. The whole vertices are divided into three j j j j j F i=1 j=1 parts which are controlled by three local PCAs. The transformation of each vertex is controlled by affine transformation as well as the local 2 � � i ⎡ ̂ ⎤ i (11) parameters of the part which the vertex belongs to p � � ⎡ ⎤  ⎡ ⎤ � 1  1 � ⎢ ⎥ � ⎢ ⎥ ⎢ ⎥� = ⋱ ⋮ − ⋮ . ⎢ ⎥ � � ⎢ ⎥ ⎢ ⎥ i i � � i=1 ⎢ ̂ ⎥ � ⎣ ⎦ ⎣ ⎦� ni ni ⎣ ⎦ � � Although the first level fitting alleviates this effect, original NICP algorithm still might not generate satisfactory fitting We can see that the above equation is not in the standard result. Therefore, for the parts of hand and foot, we use only linear form of  −  = . To differentiate, we need to ̂ ̂ T the non-rigid parameters to control the deformation in a swap the position of the unknown  and  =[ , … ,  ] . coarse grained level and add a regularization term to make Therefore, we obtain the following form. it smooth on the boundary. In this way, we can recover the hand and foot parts which are impaired in the scanning pro- i i i 2 cess. The clear and semantic hands and feet allow for the E ()=   −   , (12) i=1 shape statistical modeling in the next stage. T T T i i i i where the term  = diag( ,  , ...,  ), and the set of 1 2 i i i i T closest points  =[ ,  , ...,  ] . 1 2 n 3.4.1 Main body registration Stiffness term The stiffness term penalizes the difference between the transformation matrices of neighboring vertices. We define the body parts that exclude feet and hands as Similar to [2], it is defined as: the main body. For the main body parts, we combine the statistical shape model with NICP algorithm. Our goal is i i i 2 to find a set of affine matrices X ={ } and non-rigid E ()= ( ⊗  )  , i=1 (13) parameters C ={ } such that the sum of Euclidean dis- i=1 i=1 i i i tances between pair of points of each region is minimal. here, for the ith body part,  =(1, 1, 1,  ), where  is used Here,  is a 3 × 4n matrix that consists of affine matrix for to balance the scale of rotational and skew factor against every template vertex in the ith part. As shown in Fig.  3, the translational factor. It depends on the units of the data we describe our technique for fitting a template S to tar - and the deformation type to be expressed.  is the node- get mesh T. Each of these surface is represented as a tri- arc incidence matrix of the template mesh topology [2]. angle mesh. Each vertex v is influenced by a 4 × 3 affine Complete cost function: We combine Eqs.  11 and 13 to matrix X and non-rigid parameter C . We define data error i i obtain the complete cost function: with these two parameters. The data error, indicated by the E()= E ()+ E () d s arrows in Fig. 3, is a weighted sum of the squared distances between template surface S and target surface T. Besides (14) =  − . data error, to deform the template smoothly, we also define i i i i=1 a stiffness term to constraint the vertices, which do not Equation 14 is not a quadratic function and it is difficult to move directly towards the target, but may move parallelly obtain the optimal local affine transformation  and non- along it. These error terms are summarized in Fig.  3 and rigid parameters  simultaneously. In [10], an alternating described in detail in the following. 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 263 optimization scheme is employed to solve this problem. In this paper, we use the same optimization method to find the optimal set of parameters. For details of the solution, please refer to [11]. 3.4.2 Hands and feet registration Although the main body parts are roughly aligned after the first level registration, the distance between source hands/feet and corresponding target is large in most cases. In this situation, the ICP-based methods easily get trapped in local minima [19]. To address this problem, we perform a PCA-based fitting for the individual part of hand/foot. Given one particular part model of hand/foot ∗ ∗ that has eigenbasis  and mean shape  , we define our objective function that consists of a distance term and a regularization term, and try to obtain the optimal non- rigid parameters  by minimizing it. Distance term It is defined similar to Eq.  11, but with- out the affine transformation matrix, Fig. 4 The top view of spatial arrangement of offline 3D capturing ∗ ∗ ∗ ∗ ∗ 2 E ( )= (  +  )−   . (15) platform Boundary smoothness term To stitch hand/foot with its neighboring part smoothly, we define a boundary smooth- ness term as follows: 4 Kinect scanning platform ∗ ∗ ∗ ∗ ∗ ∗ 2 E ( )=  (  +  )−   , (16) In this part, we introduce the 3D scanning platform with where  is the selection matrix of hand/foot parts that single Microsoft Kinect for Xbox 360. The setup is shown picks out the boundary points.  is the boundary points in Fig. 4. The platform is built upon ReconstructMe appli- of the neighboring part. By enforcing the boundary con- cation which is based on Kinect Fusion [30]. straints between two parts, we can regulate the part fitting To get the best mesh, we choose to keep the Kinect process to avoid erroneous result caused by outlier. position still at three different heights when the subject is Complete cost function The fitting objective function standing on a running turntable at a certain speed (30  s can be formulated as: per round). After we scan one round at the first height, we adjust the height of the Kinect to the second height ∗ ∗ ∗ E( )= E ( )+(1 − )E ( ) d b and scan the second round around the subject. The self- occluded parts such as armpit and crotch are rescanned ∗ ∗ ∗ = (  +  )− ∗ ∗ (17) (1 − ) (1 − ) if the Kinect does not see them in the first time. For each mesh, from our experience, it takes about 90 s to build with ∗ ∗ ∗ ∗ ∗ =  (  +  )−  , this platform. During data capturing, we require the partici- where  is the weighting factor between two terms, pants to wear tight clothes. Each person is captured 5 poses ∗ ∗ T ∗ ∗ ∗ T =[, (1 − ) ] and  =[ , (1 − ) ] . This is which include a natural pose, and other 4 poses (The pose a well-known linear least square problem. The minimum examples are shown in Fig.  13). The capturing process is occurs where the gradient vanishes, that is E ∕ = . ∗ displayed in Fig. 5. Thus, Eq. 17 has closed-form solution: Although some occluded parts can be rescanned, holes still exist on the top of the head and the soles of the feet ∗ ∗ ∗ T ∗ ∗ −1 ∗ ∗ T ∗ ∗ ∗ = −[(  ) (  )] (  ) (  −  ). (18) which the Kinect cannot see. To the best of our knowl- The proposed fitting method for hand and feet has a nice edge, there is no public Kinect-based human body mesh convergence property, we show one example of residual dataset. Therefore, we utilise the platform above to build a error curve for all iterations of fitting in Fig. 11. http://reconstructme.net/. 1 3 264 Z. Xu et al. Fig. 6 The comparison of 3D shape RMS error of ANICP, NICP, PCA and our MABR Fig. 5 The screenshot of our capturing process low-quality mesh dataset, named Kinect-based 3D Human Body (K3D-Hub) Dataset. So far, our K3D-Hub dataset contains 50 different identities and 5 poses for each person. We show examples of our dataset in Fig. 13. 5 Performance evaluation To evaluate the performance of our MABR method, we conducted experiments on both high- and low-quality meshes, and showed the shape root mean square (RMS) error curve as well as some fitting results for visualization purpose. 5.1 High‑quality mesh evaluation Fig. 7 The front view of fitted results of ANICP, NICP and MABR For the evaluation on high-quality data, we use the in the case that the shape of general template differs a lot from the SPRING [48] dataset that contains 3038 meshes with vari- target mesh ous human body shapes. These good quality meshes are complete and points are evenly distributed. Furthermore, it has the point-to-point correspondences with each other, which means it can be used as our ground truth for quan- titative analysis. Also, the SPRING dataset is divided into male and female subsets. To train a model whose muscle and tissue properties are specific to female and male, we separately train male and female shape models. For each gender, 200 meshes from SPRING dataset are used as the training set and the remaining meshes are regarded as the testing set. To show the superior performance of our method, we compare MABR method with NICP in [2], ANICP in [11], and PCA deformation on SPRING dataset. We compute the 3D shape root mean square error (RMS Error) with the Fig. 8 The detail comparison of fitting results on SPRING dataset. Eq. 19 to measure the accuracy of four methods. The side of the raw scan and the fitted results of ANICP (column 2), NICP (column 3), PCA (column 4) and MABR (column 5). Besides (p − p̂ ) i i the comparison of the full body, the details of the face, hand and i=1 (19) RMSError = , elbow from each method are also compared subsequently 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 265 where points p and p̂ are corresponding points of the better fitting results, showing more complete and meaning- i i ground truth and the fitted results. n is the number of the ful limb parts. When the shapes of the template and target points in 3D template mesh. As shown in Fig. 6, the accu- scans vary a lot, it will be easier for the icp-based algo- racy of our method is comparable with ANICP and is much rithms to find the nearest neighbor incorrectly and more higher than NICP and PCA. In PCA method, the whole meaningful vertices will be regarded as outliers. We show model is only controlled by the trained orthogonal basis the front view of some fitted results in Fig.  7, which illus- which cannot cover all the shape variations. Consequently, trates the robustness of MABR to outliers. Due to the poor the accuracy of PCA is the lowest. accuracy of PCA method, we do not show the PCA results Moreover, when the body shape of template is very dif- here. In Fig. 7, we compare the fitting results of two differ - ferent from the shape of target, MABR is able to present ent body shapes from a general template. The first target Fig. 9 Fitting results from Kinect scans. Column 1 shows the raw body scans, the second to the last columns illustrate the shapes from ANICP, NICP, PCA and the proposed MABR method, respectively 1 3 266 Z. Xu et al. mesh differs a lot from the template in the arm part and we The reason for the failure of NICP and ANICP is that, can see that neither ANICP nor NICP can obtain a com- in real scans, the human pose is hard to control so that plete arm for the given target while the proposed MABR the limbs are usually not completely overlapped with the method can not only get a meaningful arm but also make template. the results similar with the target mesh (which is reflected Therefore, the shape of the closest points of the limbs by the face). What is more, the contour of hands of MABR cannot keep the limb shape of scans, resulting in unex- is clearer than ANICP and NICP. This is shown by the fact pected fitted shapes. Since our fitting procedures are that the hole of the fist is visible in MABR results. In the active, the limb parts of the template can be stretched second line, the left arm of the target mesh is bent. Due to along with the direction of PCA basis before performing the low overlapping degree, the limbs parts are regarded as non-rigid ICP, recovering the size of the hand and foot outliers in ANICP and NICP. As a result, the fitting results roughly. In this way, our MABR method is not only able of NICP and ANICP in the hand and foot parts are dis- to keep a good shape of the scan but also robust to noises. torted and erroneous while our MABR method successfully fits to the target mesh and keeps complete arm shapes at the same time. We also compare the details of fitted results from the above four methods in Fig. 8. We can see that for the hand and elbow parts, the PCA method and the proposed MABR method are much better than the other two approaches. In Fig.  8, compared with PCA and MABR, the hand parts of ANICP distorted severely and the fitted hand of NICP is obscure. As for elbow, the results of ANICP and NICP are broken while PCA and MABR are able to preserve the continuity of the fitted mesh. Although PCA can get mean- ingful results, MABR outperforms it in terms of accuracy, which is reflected by the fitted results in face parts. It can be Fig. 10 The comparison of NICP, ANICP and MABR in the case of seen obviously that the face of MABR is much more simi- hierarchical noises lar with the raw scan than face of PCA. Basically, MABR successfully recovers the shape of the target mesh. In the elbow part, we can see that the curvature of the mesh from MABR is much closer to the target than PCA’s result. 5.2 Low‑quality scans evaluation We evaluate the proposed method on low-quality scans which are captured by Microsoft Kinect for XBOX 360. A Kinect is used to scan the person standing at a running turntable from three different heights. The scans are pre- processed to remove background. We compared our pro- posed MABR method with NICP [2] and ANICP in [11]. Fitting results of these three methods are shown in Fig.  9. It is obvious to see that the proposed MABR method is the only one that models the hand and foot parts com- pletely and, meanwhile, keep high accuracy of the fit- ting results. We can see that the raw scans have a lot of Fig. 11 Example of residual error changes as the fitting of left hand noises which are close to surface. Large holes exist on in the second level progresses top of the head. All these challenges require the regis- tration method be robust to noises, outliers and holes at the same time. From the results, we can see that neither ANICP nor NICP is not robust enough to obtain complete and accurate registered mesh. The hand parts of ANICP and NICP tend to be distorted and incomplete while the MABR method enables meaningful and complete hands. Fig. 12 The comparison of hole tolerance 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 267 Fig. 13 Examples of K3D-Hub human body scans dataset. We invited both male and female subjects. The ages of subject ranges from 18 to 40. The nationalities of the subjects mainly include Asia and Europe. Each subject performs 5 different poses 1 3 268 Z. Xu et al. The hierarchical noises are common in the Kinect scans. model is trained with 200 registered mesh, the combina- On one hand, some subtle movements are inevitable tion of PCA makes our method robust to noise, outliers and when the subjects are trying to keep a certain pose for a holes. We have shown that the performance of proposed few minutes. In this case, hierarchical noises may appear algorithm is comparable with the state-of-the-art non-rigid around arms. On the other hand, the subjects are standing registration methods and outperforms them when it comes on a running turntable. The resulting movements of body to the alignment of hands/foot parts. Experiments verify may cause hierarchical noises around body surface. In that our approach is robust to both noisy Kinect scans and Fig.  10, we compare the fitting results of NICP, ANICP high-quality meshes. Besides the robust MABR method, and the proposed MABR in the case of hierarchical a Kinect-based human body dataset, named K3D-Hub, is noises. We can see that hierarchical noises are distributed collected which is the first publicly available low-quality on the face, hand, and chest in the raw scans. The fitting human body scans dataset. results of NICP are ambiguous, without presenting the Limitations Our registration algorithm manages to regis- shape of hands; while, the results of ANICP and MABR ter a high-quality template mesh to noisy Kinect scans with can keep hand shapes. The MABR also shows more simi- similar poses. However, when the initial poses of template lar fitting results in faces and hands. and target scans differ much, it is still challenging for our We also show the robustness to holes of MABR in method to generate a reasonable fitting result, particularly Fig.  12. Even though there exist big holes on top of the in the limb parts. We believe that fitting on various poses head in the raw scan, MABR and ANICP can fill the hole will be one of our future work. What is more, our targets smoothly, which benefits from the training of the prior are scans captured from subjects with tight clothes and hats. knowledge. NICP merely relies on finding the nearest It is still challenging to deform the template to meshes with points on the target, which is sensitive to holes. Therefore, loose clothes like dresses/skirt. This is because the defor- as illustrated in Fig. 12, the fitted result of NICP is uneven. mation of loose clothes and hair does not follow the defor- In addition, our MABR method has very nice conver- mation of human body muscle. Unexpected results will gence properties. In Fig. 11, we show one example of resid- appear if we apply our trained morphable model to loose ual error changes as the fitting of left hand progresses. As cloths and hair. Moreover, in the future, we plan to speed can be seen, the residual error monotonically decreases and up the fitting algorithm to support real-time applications. gradually converges to a minimum value. Acknowledgements Funding was provided by China Scholarship All the fitting results in the above experiments are Council (Grant No. 201406070079). deformed from the same template mesh, the mean shape of the training set. As shown in Figs.  9 and 10, the lift- Open Access This article is distributed under the terms of the ing angles of arms of the test data are not the same and in Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted Fig. 8 the arms of test data are bent, while the arms of tem- use, distribution, and reproduction in any medium, provided you give plate are straight. Arms in these scans tend to be regarded appropriate credit to the original author(s) and the source, provide a as outliers in NICP and ANICP methods but the proposed link to the Creative Commons license, and indicate if changes were method is able to keep the meaningful shape in the registra- made. tion process, which shows that our method can be applied to scans with different poses in some degree. However, when the target presents different postures with the tem- plate like the 2–6 columns shown in Fig. 13, it is still chal- References lenging for our method to obtain a reasonable fitting result from the natural standing template for the reason that the 1. Allen, B., Curless, B., Popović, Z.: The space of human body shapes: reconstruction and parameterization from range scans. searched nearest points cannot keep the shape of hands dur- In: ACM Transactions on Graphics (TOG), vol. 22, pp. 587–594. ing fitting. The resulting fitting results will be erroneous. ACM (2003) Hence, we reckon that fitting on various poses could be one 2. Amberg, B., Romdhani, S., Vetter, T.: Optimal step nonrigid of our future works. icp algorithms for surface registration. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. IEEE (2007) 3. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., 6 Conclusions Davis, J.: Scape: shape completion and animation of people. In: ACM Transactions on Graphics (TOG), vol.  24, pp. 408–416. ACM (2005) In this paper, we propose a multilevel active registration 4. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d method which combines the non-rigid ICP with the sta- faces. In: Proceedings of the 26th Annual Conference on Com- tistical shape model to automatically fit the body template puter Graphics and Interactive Techniques, pp. 187–194. ACM model to the target point clouds. Since the PCA shape Press/Addison-Wesley Publishing Co., New York (1999) 1 3 Multilevel active registration for kinect human body scans: from low quality to high quality 269 5. Bogo, F., Romero, J., Loper, M., Black, M.: Faust: Dataset and 25. Li, W., Li, X., Goldberg, M., Zhu, Z.: Face recognition by 3d evaluation for 3d mesh registration. In: Proceedings of the IEEE registration for the visually impaired using a rgb-d sensor. Conference on Computer Vision and Pattern Recognition, pp. In: European Conference on Computer Vision, pp. 763–777. 3794–3801 (2014) Springer (2014) 6. Bondi, E., Pala, P., Berretti, S., Del Bimbo, A.: Reconstructing 26. Liu, Z., Huang, J., Bu, S., Han, J., Tang, X., Li, X.: Template high-resolution face models from kinect depth sequences. IEEE deformation-based 3-d reconstruction of full human body scans Trans. Inf. Forens. Secur. 11(12), 2843 (2016) from low-cost depth cameras. IEEE Trans. Cybern. 47(3), 695– 7. Booth, J., Roussos, A., Zafeiriou, S., Ponniah, A., Dunaway, D.: 708 (2016) A 3D morphable model learnt from 10,000 faces. In: Proceed- 27. Liu, Z., Huang, J., Han, J., Bu, S., Lv, J.: Human motion tracking ings of the IEEE Conference on Computer Vision and Pattern by multiple RGBD cameras. IEEE Trans. Circuits Syst. Video Recognition, pp. 5543–5552 (2016) Technol. (2016). doi:10.1109/TCSVT.2016.2564878 8. Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Numerical 28. Liu, Z., Qin, H., Bu, S., Yan, M., Huang, J., Tang, X., Han, J.: geometry of non-rigid shapes. Springer, New York (2008) 3D real human reconstruction via multiple low-cost depth cam- 9. Chen, Y., Liu, Z., Zhang, Z.: Tensor-based human body mod- eras. Signal Process. 112, 162–179 (2015) eling. In: Proceedings of the IEEE Conference on Computer 29. Mateus, D., Horaud, R., Knossow, D., Cuzzolin, F., Boyer, E.: Vision and Pattern Recognition, pp. 105–112 (2013) Articulated shape matching using laplacian eigenfunctions and 10. Cheng, S., Marras, I., Zafeiriou, S., Pantic, M.: Active nonrigid unsupervised point registration. In: IEEE Conference on Com- icp algorithm. In: Automatic Face and Gesture Recognition puter Vision and Pattern Recognition, 2008. CVPR 2008. pp. (FG), 2015 11th IEEE International Conference and Workshops 1–8. IEEE (2008) on, vol. 1, pp. 1–8. IEEE (2015) 30. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, 11. Cheng, S., Marras, I., Zafeiriou, S., Pantic, M.: Statistical non- D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, rigid ICP algorithm and its application to 3D face alignment. A.: Kinectfusion: Real-time dense surface mapping and track- Image Vis. Comput. 58, 3–12 (2016) ing. In: 2011 10th IEEE International Symposium on Mixed and 12. Cui, Y., Chang, W., Nöll, T., Stricker, D.: Kinectavatar: Fully Augmented Reality (ISMAR), pp. 127–136. IEEE (2011) automatic body capture using a single kinect. In: ACCV Work- 31. Palasek, P., Yang, H., Xu, Z., Hajimirza, N., Izquierdo, E., shops (2), pp. 133–147. Citeseer (2012) Patras, I.: A flexible calibration method of multiple kinects for 13. Darom, T., Keller, Y.: Scale-invariant features for 3-d mesh mod- 3d human reconstruction. In: 2015 IEEE International Confer- els. IEEE Trans. Image Process. 21(5), 2758–2769 (2012) ence on Multimedia & Expo Workshops (ICMEW), pp. 1–4. 14. Dey, T.K., Fu, B., Wang, H., Wang, L.: Automatic posing of a IEEE (2015) meshed human model using point clouds. Comput. Gr. 46, 14–24 32. Pishchulin, L., Wuhrer, S., Helten, T., Theobalt, C., Schiele, (2015) B.: Building statistical shape spaces for 3d human modeling. 15. Elad, A., Kimmel, R.: On bending invariant signatures for sur- arXiv:1503.05860 (2015) faces. IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1285– 33. Pratikakis, I., Spagnuolo, M., Theoharis, T., Veltkamp, R.: A 1295 (2003) robust 3d interest points detector based on harris operator. In: 16. Fechteler, P., Hilsmann, A., Eisert, P.: Kinematic icp for articu- Eurographics Workshop on 3D Object Retrieval, vol. 5. Citeseer lated template fitting. In: Proceedings of International Workshop (2010) on Vision, Modeling and Visualization, pp. 12–14 (2012) 34. Robinette, K.M., Daanen, H., Paquet, E.: The caesar project: a 17. Haehnel, D., Thrun, S., Burgard, W.: An extension of the icp 3-d surface anthropometry survey. In: Second International Con- algorithm for modeling nonrigid objects with mobile robots. ference on 3-D Digital Imaging and Modeling, 1999. Proceed- IJCAI 3, 915–920 (2003) ings. pp. 380–386 (1999). doi:10.1109/IM.1999.805368 18. Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.P.: A 35. Sahillioğlu, Y., Yemez, Y.: 3d shape correspondence by isom- statistical model of human pose and body shape. In: Computer etry-driven greedy optimization. In: 2010 IEEE Conference on Graphics Forum, vol.  28, pp. 337–346. Wiley Online Library Computer Vision and Pattern Recognition (CVPR), pp. 453–458. (2009) IEEE (2010) 19. Huang, Q.X., Adams, B., Wicke, M., Guibas, L.J.: Non-rigid 36. Sahillioglu, Y., Yemez, Y.: Minimum-distortion isometric shape registration under isometric deformations. In: Computer Graph- correspondence using em algorithm. IEEE Trans. Pattern Anal. ics Forum, vol. 27, pp. 1449–1457. Wiley Online Library (2008) Mach. Intell. 34(11), 2203–2215 (2012) 20. Jain, V., Zhang, H.: Robust 3d shape correspondence in the 37. Sandbach, G., Zafeiriou, S., Pantic, M.: Local normal binary spectral domain. In: Shape Modeling and Applications, 2006. patterns for 3d facial action unit detection. In: 2012 19th IEEE SMI 2006. IEEE International Conference on, pp. 19–19. IEEE International Conference on Image Processing (ICIP), pp. 1813– (2006) 1816. IEEE (2012) 21. Jourabloo, A., Liu, X.: Pose-invariant 3d face alignment. In: 38. Smeets, D., Keustermans, J., Vandermeulen, D., Suetens, P.: Proceedings of the IEEE International Conference on Computer meshsift: local surface features for 3d face recognition under Vision, pp. 3694–3702 (2015) expression variations and partial data. Comput. Vis. Image 22. Kou, Q., Yang, Y., Du, S., Luo, S., Cai, D.: A modified non- Underst. 117(2), 158–169 (2013) rigid icp algorithm for registration of chromosome images. In: 39. Sun, J., Ovsjanikov, M., Guibas, L.: A concise and provably International Conference on Intelligent Computing, pp. 503–513. informative multi-scale signature based on heat diffusion. In: Springer (2016) Computer graphics forum, vol. 28, pp. 1383–1392. Wiley Online 23. Li, B.Y., Mian, A.S., Liu, W., Krishna, A.: Using kinect for face Library (2009) recognition under varying poses, expressions, illumination and 40. Tang, S., Wang, X., Lv, X., Han, T.X., Keller, J., He, Z., Skubic, disguise. In: 2013 IEEE Workshop on Applications of Computer M., Lao, S.: Histogram of oriented normal vectors for object rec- Vision (WACV), pp. 186–192. IEEE (2013) ognition with a depth sensor. In: Computer Vision–ACCV 2012, 24. Li, H., Sumner, R.W., Pauly, M.: Global correspondence opti- pp. 525–538. Springer (2012) mization for non-rigid registration of depth scans. In: Computer 41. Tena, J.R., De la Torre, F., Matthews, I.: Interactive region-based linear 3d face models. In: ACM SIGGRAPH, pp. 76:1–76:10. graphics forum, vol.  27, pp. 1421–1430. Wiley Online Library New York, NY, USA (2011). doi: 10.1145/1964921.1964971 (2008) 1 3 270 Z. Xu et al. 42. Thies, J., Zollhöfer, M., Nießner, M., Valgaerts, L., Stamminger, International Conference and Workshops on Automatic Face and M., Theobalt, C.: Real-time expression transfer for facial reen- Gesture Recognition (FG), pp. 1–6. IEEE (2013) actment. ACM Trans. Gr. (TOG) 34(6), 183 (2015) 47. Yang, H., Patras, I.: Sieving regression forest votes for facial fea- 43. Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, ture detection in the wild. In: Proceedings of the IEEE Interna- M.: Face2face: Real-time face capture and reenactment of rgb tional Conference on Computer Vision, pp. 1936–1943 (2013) videos. In: Proc. Computer Vision and Pattern Recognition 48. Yang, Y., Yu, Y., Zhou, Y., Du, S., Davis, J., Yang, R.: Seman- (CVPR), IEEE 1 (2016) tic parametric reshaping of human body models. In: 2014 2nd 44. Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3d full International Conference on 3D Vision (3DV), vol. 2, pp. 41–48. human bodies using kinects. IEEE Trans. Vis. Comput. Gr. IEEE (2014) 18(4), 643–650 (2012) 49. Zuffi, S., Black, M.J.: The stitched puppet: A graphical model of 45. Yang, H., He, X., Jia, X., Patras, I.: Robust face alignment under 3d human shape and pose. In: 2015 IEEE Conference on Com- occlusion via regional predictive power estimation. IEEE Trans. puter Vision and Pattern Recognition (CVPR), pp. 3537–3546. Image Process. 24(8), 2393–2403 (2015) IEEE (2015) 46. Yang, H., Patras, I.: Privileged information-based conditional regression forest for facial feature detection. In: 2013 10th IEEE 1 3

Journal

Multimedia SystemsSpringer Journals

Published: Mar 10, 2017

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off