Structure from motion using dense CNN features with keypoint relocalization

Structure from motion using dense CNN features with keypoint relocalization Structure from motion (SfM) using imagery that involves extreme appearance changes is yet a challenging task due to a loss of feature repeatability. Using feature correspondences obtained by matching densely extracted convolutional neural network (CNN) features significantly improves the SfM reconstruction capability. However, the reconstruction accuracy is limited by the spatial resolution of the extracted CNN features which is not even pixel-level accuracy in the existing approach. Providing dense feature matches with precise keypoint positions is not trivial because of memory limitation and computational burden of dense features. To achieve accurate SfM reconstruction with highly repeatable dense features, we propose an SfM pipeline that uses dense CNN features with relocalization of keypoint position that can efficiently and accurately provide pixel-level feature correspondences. Then, we demonstrate on the Aachen Day-Night dataset that the proposed SfM using dense CNN features with the keypoint relocalization outperforms a state-of-the-art SfM (COLMAP using RootSIFT) by a large margin. Keywords: Structure from Motion, Feature detection and description, Feature matching, 3D reconstruction 1 Introduction using densely detected features on a regular grid [34, 35] Structure from motion (SfM) is getting ready for 3D but their merit is only demonstrated in image retrieval reconstruction only using images, thanks to off-the-shelf [32, 36] or image classification tasks [26, 34]thatuse softwares [1–3] and open-source libraries [4–10]. They the features for global image representation and do not provide impressive 3D models, especially, when targets require one-to-one feature correspondences as in SfM. are captured from many viewpoints with large over- Only recently, SfM with densely detected features are laps. The state-of-the-art SfM pipelines, in general, start presented in [37]. DenseSfM [37] uses convolutional neu- with extracting local features [11–17] and matching them ral network (CNN) features as densely detected features, across images, followed by pose estimation, triangulation, i.e., it extracts convolutional layers of deep neural network and bundle adjustment [18–20]. The performance of local [38] and converts them as feature descriptors of keypoints features and their matching, therefore, is crucial for 3D on a grid pattern (Section 3.1). As the main focus of [37] reconstruction by SfM. is camera localization, the SfM architecture including nei- In this decade, the performance of local features, ther dense CNN feature description and matching nor its namely, SIFT [11]anditsvariants[16, 21–24]are vali- 3D reconstruction performance is not studied in detail. dated on 3D reconstruction as well as many other tasks [25–27]. The local features give promising matches for 1.1 Contribution well-textured surfaces/objects but significantly drop its In this work, we first review the details of the SfM pipeline performance for matching weakly textured objects [28], with dense CNN feature extraction and matching. We repeated patterns [29], extreme changes of viewpoints then propose a keypoint relocalization that uses the struc- [21, 30, 31], and illumination change [32, 33]becauseof ture of convolutional layers (Section 3.2)toovercome degradation in repeatability of feature point (keypoint) keypoint inaccuracy on the grid resolution and compu- extraction [21, 31]. This problem can be mitigated by tational burden of dense feature matching. Finally, the performance of SfM with dense CNN feature using the *Correspondence: widya.a.aa@m.titech.ac.jp proposed keypoint relocalization is evaluated on Aachen Tokyo Institute of Technology, O-okayama, Meguro-ku, 152-8550 Tokyo, Japan © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 2 of 7 Day-Night [37] dataset and additionally on Strecha [39] its loss of scale, rotation invariant, inaccuracy of localized dataset. keypoints, and computational burden. 2 Related work 2.3 CNN features 2.1 SfM and VisualSLAM Fischer et al. [48] reported that, given feature positions, The state-of-the-art SfM is divided into a few mainstream descriptors extracted from CNN layer have better match- pipelines: incremental (or sequential) [4, 6, 40], global [8, ability compared to SIFT [11]. More recently, Schon- berger et al. [49] also showed that CNN-based learned 9, 41], and hybrid [10, 42]. VisualSLAM approaches, namely, LSD-SLAM [43]and local features such as LIFT [17], Deep-Desc [50], and Con- DTAM [44], repeat camera pose estimation based on vOpt [51] have higher recall compared to SIFT [11]but selected keyframe and (semi-)dense reconstruction using still cannot outperform its variants, e.g., DSP-SIFT [16] the pixel-level correspondences in real-time. These meth- and SIFT-PCA [52]. ods are particularly designed to work with video streams, Those studies motivate us to adopt CNN architecture i.e., short baseline camera motion, but not with general for extracting features from images and matching them wide-baseline camera motion. for SfM as it efficiently outputs multi-resolution features Recently, Sattler et al. [37]introducesCNN-based Dens- and has potential to be improved by better training or eSfM that adopts densely detected and described features. architecture. But their SfM uses fixed poses and intrinsic parameters of reference images in evaluating the performance of query 3 The pipeline: SfM using dense CNN features image localization. They also do not address keypoint with keypoint relocalization inaccuracy of CNN features. Therefore, it remains as an Our SfM using densely detected features mimics the open challenge. state-of-the-art incremental SfM pipeline that con- sists of feature extraction (Section 3.1), feature match- 2.2 Feature points ing (Section 3.2 to 3.4), and incremental reconstruc- Thedefactostandardlocalfeature,SIFT[11], is capa- tion (Section 3.5). Figure 1 overviews the pipeline. ble of matching images under viewpoint and illumination In this section, we describe each component while changes thanks to scale and rotation invariant keypoint stating the difference to the sparse keypoint-based patches described by histograms of the oriented gradi- approaches. ent. ASIFT [21]and itsvariants[30, 31] explicitly generate synthesized views in order to improve repeatability of key- 3.1 Dense feature extraction point detection and description under extreme viewpoint Firstly, our method densely extracts the feature descrip- changes. tors and their locations from the input image. In the same An alternative approach to improve feature matching spirit of [53, 54], we input images in a modern CNN archi- between images across extreme appearance changes is tecture [38, 55, 56] and use the convolutional layers as to use densely sampled features from images. Densely densely detected keypoints on a regular grid, i.e., crop- detected features are often used in multi-view stereo [45] ping out the fully connected and softmax layers. In the with DAISY [46], or image retrieval and classification following, we chose VGG-16 [38] as the base network [35, 47]withDense SIFT[34]. However, dense features architecture and focus on the description tailored to it, but are not spotlighted in the task of one-to-one feature cor- this can be replaced with other networks with marginal respondence search under unknown camera poses due to modification. Fig. 1 Pipeline of the proposed SfM using dense CNN features with keypoint relocalization. Our SfM starts from dense feature extraction (Section 3.1), feature matching (Section 3.2), the proposed keypoint relocalization (Section 3.3), feature verification using RANSAC with multiple homographies (Section 3.4), and followed by 3D reconstruction (Section 3.5) Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 3 of 7 As illustrated in Fig. 2,VGG-16[38]iscomposedoffive constrained by the transferred keypoints. This can be max-pooling layers and 16 weight layers. We extract the repeated until reaching conv1_2 layer. However, this naive max-pooling layers as dense features. As can be seen in coarse-to-fine matching generates too many keypoints Fig. 2, the conv1 max-pooling layer is not yet the same that may lead to a problem in computational and mem- resolution as the input image. We, therefore, also extract ory usage in incremental SfM step, especially, bundle conv1_2, one layer before the conv1 max-pooling layer, adjustment. that has pixel-level accuracy. To generate dense feature matches with pixel-level accu- racy while preserving their quantity, we propose a method 3.2 Tentative matching of keypoint relocalization as follows. For each feature Given multi-level feature point locations and descriptors, point at the current layer, we retrieve the descriptors on tentative matching uses upper max-pooling layer (lower the lower layer (higher spatial resolution) in the corre- spatial resolution) to establish initial correspondences. sponding K × K pixels . The feature point is relocalized at This is motivated by that the upper max-pooling layer the pixel position that has the largest descriptor norm (L2 has a larger receptive field and encodes more seman- norm) in the K × K pixels. This relocalization is repeated tic information [48, 57, 58] which potentially gives high until it reaches the conv1_2 layer which has the same matchability across appearance changes. Having the lower resolution as the input image (see also Fig. 3). spatial resolution is also advantageous in the sense of computational efficiency. 3.4 Feature verification using RANSAC with multiple For a pair of images, CNN descriptors are tentatively homographies matched by searching their nearest neighbors (L2 dis- Using all the relocated feature points, we next remove tances) and refined by taking mutually nearest neighbors. outliers from a set of tentative matches by Homography- Note that the standard ratio test [11] removes too many RANSAC. We rather use a vanilla RANSAC instead feature matches as neighborhood features on a regularly of the state-of-the-art spatial verification [59]bytaking sampled grid tend to be similar to each other. into account the spatial density of feature correspon- We perform feature descriptor matching for all the pairs dences. To detect inlier matches lying on several planes, of images or shortlisted images by image retrieval, e.g., Homography-RANSAC is repeated while excluding the NetVLAD [53]. inlier matches of the best hypothesis. The RANSAC inlier/outlier threshold is set to be loose to allow features 3.3 Keypoint relocalization off the planes. The tentative matching using the upper max-pooling lay- ers, e.g., conv5, generates distinctive correspondences but 3.5 3D reconstruction the accuracy of keypoint position is limited by their spatial Having all the relocalized keypoints filtered by RANSAC, resolution. This inaccuracy of keypoints can be mitigated we can export them to any available pipelines that per- by a coarse-to-fine matching from the extracted max- form pose estimation, point triangulation, and bundle pooling layer up to conv1_2 layer utilizing extracted inter- adjustment. mediate max-pooling layers between them. For exam- Dense matching may produce many confusing feature ple, the matched keypoints found on the conv3 layer matches on the scene with many repetitive structures, e.g., are transferred to the conv2 (higher spatial resolution) windows, doors, and pillars. In such cases, we keep only and new correspondences are searched only in the area the N best matching image pairs for each image in the Fig. 2 Features extracted using CNN. The figure summarizes blocks of convolutional layers of VGG-16 as an example of CNN architecture. Our SfM uses the layers colored in red as features. For example, given an input image of 1600 × 1200 pixels, we extract 256 dimensional features of 200 × 150 spatial resolution from the conv3 max-pooling Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 4 of 7 Fig. 3 Keypoint relocalization. a A keypoint on a sparser level is relocalized using a map computed from descriptors’ L2 norm on an lower level which has higher spatial resolution. It is reassigned at the position on the lower level which has the largest value in the corresponding K × K neighborhood. By repeating this, the relocalized keypoint position in conv1_2 has the accuracy as in the input image pixels. b The green dots show the extracted conv3 features points (top) and the result of our keypoint relocalization (bottom) dataset based on the number of inlier matches of multiple on a computer equipped with a 3.20-GHz Intel Core i7- Homography-RANSAC. 6900K CPU with 16 threads and a 12-GB GeForce GTX 1080Ti. 4 Experiments We implement feature detection, description, and match- 4.1 Results on Aachen Day-Night dataset ing (Sections 3.1 to 3.4) in MATLAB with third-party The Aachen Day-Night dataset [37]isaimedforevaluat- libraries (MatConvNet [60]and Yaellibrary[61]). Dense ing SfM and visual localization under large illumination CNN features are extracted using the VGG-16 network changes such as day and night. It includes 98 subsets of [38]. Using conv4 and conv3 max-pooling layers, feature images. Each subset consists of 20 day-time images and matches are computed by the coarse-to-fine matching fol- one night-time image, their reference camera poses, and lowed by multiple Homography-RANSAC that finds at 3D points . most five homographies supported by an inlier thresh- For each subset, we run SfM and evaluate the esti- old of 10 pixels. The best N pairs based on multiple mated camera pose of the night image as follows. First, Homography-RANSAC of every image are imported to the reconstructed SfM model is registered to the ref- COLMAP [6] with the fixed intrinsic parameter option for erence camera poses by adopting a similarity transform scene with many repetitive structures. Otherwise, we use obtained from the camera positions of day-time images. all the image pairs. We then evaluate the estimated camera pose of the night In our preliminary experiments, we tested other lay- image by measuring positional (L2 distance) and angular trace(R R )−1 ers having the same spatial resolution, e.g., using conv4_3 ref night (acos( ))error. and conv3_3 layers in the coarse-to-fine matching but Table 1 shows the number of reconstructed cameras. we observed no improvement in 3D reconstruction. As a The proposed SfM with keypoint relocalization (conv1_2) max-pooling layer has a half depth dimension in compar- ison with the other layers at the same spatial resolution, we chose the max-pooling layer as the dense features for Table 1 Number of cameras reconstructed on the Aachen efficiency. dataset In the following, we evaluate the reconstruction per- DoG+ DenseCNN DenseCNN formance on Aachen Day-Night [37] and Strecha [39] RootSIFT [6] w/o reloc w/ reloc (Ours) dataset. We compare our SfM using dense CNN features with keypoint relocalization to the baseline COLMAP Night 48 95 96 with DoG+RootSIFT features [6]. In addition, we also Day 1910 1924 1944 compare our SfM to SfM using dense CNN without The proposed method have the most number of reconstructed cameras for either keypoint relocalization [37]. All experiments are tested day or night images Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 5 of 7 Fig. 4 Quantitative evaluation on the Aachen Day-Night dataset. The poses of night images reconstructed by the baseline DoG+RootSIFT [6] (red), the DenseCNN without keypoint relocalization (green), and the proposed DenseCNN (blue) are evaluated using the reference poses. The graphs show the percentages of correctly reconstructed camera poses of night images (y-axis) at positional (a) and angular (b) error threshold (x-axis) canreconstruct 96nightimagesthatare twiceasmany poses are evaluated. In our SfM, we take only feature as that of the baseline method using COLMAP with matches from the best N = 5 image pairs for each image DoG+RootSIFT [6]. This result validates the benefit of to suppress artifacts from confusing image pairs. densely detected features that can provide correspon- The mean average position and angular errors resulted dences across large illumination changes as they have by our SfM are 0.59 m and 2.27 . Although these errors smaller loss in keypoint detection repeatability than a are worse than those of the state-of-the-art COLMAP standard DoG. On the other hand, both methods with with DoG+RootSIFT [6] which are 0.17 m and 0.90 ,the sparse and dense features work well for reconstructing quantitative evaluation on the Strecha dataset demon- day images. The difference between with and without key- strated that our SfM does not overfit to specific challeng- point localization can be seen more clearly in the next ing tasks but works reasonably well for standard (easy) evaluation. situations. Figure 4 shows the percentages of night images recon- structed (y-axis) within certain positional and angular 5Conclusion error threshold (x-axis). Similarly, Table 2 shows the We presented a new SfM using dense features extracted reconstruction percentages of night images for vary- from CNN with the proposed keypoint relocalization to ing distance error thresholds with a fixed angular error improve the accuracy of feature positions sampled on a threshold at 10°. As can be seen from both evaluations, the regular grid. The advantage of our SfM has demonstrated proposed SfM using dense CNN features with keypoint on the Aachen Day-Night dataset that includes images relocalization outperforms the baseline DoG+RootSIFT with large illumination changes. The result on the Strecha [6] by a large margin. The improvement by the proposed dataset also showed that our SfM works for standard keypoint relocalization is significant when the evalua- datasets and does not overfit to a particular task although tion accounts for pose accuracy. Notice that the SfM it is less accurate than the state-of-the-art SfM with local using dense CNN without keypoint relocalization [37] features. We wish the proposed SfM becomes a mile- performs worse than the baseline DoG+RootSIFT [6]at stone in the 3D reconstruction, in particularly challenging small thresholds, e.g., below 3.5 m position and 2 angular situations. error. This indicates that the proposed keypoint relocal- Table 2 Evaluation of reconstructed camera poses (both ization gives features at more stable and accurate posi- position and orientation) tions and provides better inlier matches for COLMAP DoG+ DenseCNN DenseCNN reconstruction which results 3D reconstruction in higher RootSIFT [6] w/o reloc w/ reloc (Ours) quality. Figure 5 illustrates the qualitative comparison result 0.5m 15.31 5.10 18.37 between our method and the baseline DoG+RootSIFT [6]. 1.0m 25.61 14.29 33.67 5.0m 36.73 45.92 69.39 4.2 Results on Strecha dataset 10.0m 35.71 61.22 81.63 We additionally evaluate our SfM using dense CNN with 20.0m 39.80 69.39 82.65 the proposed keypoint relocalization on all six subsets The numbers show the percentage of the reconstructed night images within given of Strecha dataset [39] which is a standard benchmark positional error thresholds and an angular error fixed at 10° dataset for SfM and MVS. Position and angular error The proposed method have the most number of reconstructed cameras for either between the reconstructed cameras and the ground truth day or night images Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 6 of 7 Fig. 5 Example of 3D reconstruction in the Aachen dataset. These figures show qualitative examples of SfM using DoG+RootSIFT [6](a)andourdense CNN with keypoint relocalization (b). Our method can reconstruct all the 21 images in the subset whereas the baseline DoG+RootSIFT [6]fails to reconstruct it. As a nature of dense feature matching, our method reconstructs 42,402 3D points which are 8.2 times more than the baseline method Endnotes 8. Wilson K, Snavely N (2014) Robust global translations with 1dsfm. In: Proc. ECCV. Springer, Cham. pp 61–75 We use K = 2 throughout the experiments. 9. Moulon P, Monasse P, Perrot R, Marlet R (2016) OpenMVG: Open multiple view geometry. In: International Workshop on Reproducible Research in Although the poses are carefully obtained with manual Pattern Recognition. Springer, Cham. pp 60–74 verification, the poses are called as “reference poses” but 10. Cui H, Gao X, Shen S, Hu Z (2017) Hsfm: Hybrid structure-from-motion. In: Proc. CVPR. IEEE, Boston. pp 2393-2402 not ground truth. 11. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110 12. Mikolajczyk K, Schmid C (2004) Scale & affine invariant interest point Acknowledgements detectors. IJCV 60(1):63–86 This work was partly supported by JSPS KAKENHI Grant Number 17H00744, 13. Kadir T, Zisserman A, Brady M (2004) An affine invariant salient region 15H05313, 16KK0002, and Indonesia Endowment Fund for Education. detector. In: Proc. ECCV. Springer, Cham. pp 228–241 14. Tuytelaars T, Van Gool L (2004) Matching widely separated views based Availability of data and materials on affine invariant regions. IJCV 59(1):61–85 The code will be made publicly available on acceptance. 15. Arandjelovic´ R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: Proc. CVPR. IEEE, Providence. pp 2911–2918 Authors’ contributions 16. Dong J, Soatto S (2015) Domain-size pooling in local descriptors: Dsp-sift. AR run all the experiment and wrote initial draft of the manuscript. AT revised In: Proc. CVPR. IEEE, Boston. pp 5097–5106 the manuscript. Both AT and MXO provided supervision, many meaningful 17. Yi KM, Trulls E, Lepetit V, Fua P (2016) Lift: Learned invariant feature discussion, and guidance to AR in this research. All authors read and approved transform. In: Proc. ECCV. Springer, Cham. pp 467–483 the final manuscript. 18. Snavely N, Seitz SM, Szeliski R (2008) Modeling the world from internet photo collections. IJCV 80(2):189–210 Competing interests 19. Agarwal S, Furukawa Y, Snavely N, Curless B, Seitz SM, Szeliski R (2010) The authors declare that they have no competing interests. Reconstructing rome. Computer 43(6):40–47 20. Agarwal S, Furukawa Y, Snavely N, Simon I, Curless B, Seitz SM, Szeliski R (2011) Building rome in a day. Commun ACM 54(10):105–112 Publisher’s Note 21. Morel JM, Yu G (2009) Asift: A new framework for fully affine invariant Springer Nature remains neutral with regard to jurisdictional claims in image comparison. SIAM J Imaging Sci 2(2):438–469 published maps and institutional affiliations. 22. Ke Y, Sukthankar R (2004) Pca-sift: A more distinctive representation for local image descriptors. In: Proc. CVPR, vol 2. IEEE, Washington Received: 14 March 2018 Accepted: 6 May 2018 23. Abdel-Hakim AE, Farag AA (2006) Csift: A sift descriptor with color invariant characteristics. In: Proc. CVPR, vol, 2. IEEE, New York. pp 1978–1983 24. Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features. In: References Proc. ECCV. Springer, Cham. pp 404–417 1. Pix4D - Professional drone mapping and photogrammetry software. 25. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual https://pix4d.com/. Accessed 11 Feb 2018 categorization with bags of keypoints. In: Proc. ECCV, vol 1. Springer, 2. Agisoft Photoscan. http://www.agisoft.com/. Accessed 11 Feb 2018 Cham.pp1–2 3. Discover Photogrammetry Software - Photomodeler. http://www. 26. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial photomodeler.com/index.html. Accessed 11 Feb 2018 pyramid matching for recognizing natural scene categories. In: Proc. 4. Fuhrmann S, Langguth F, Goesele M (2014) MVE-A Multi-View CVPR, vol 2. IEEE, New York. pp 2169–2178 Reconstruction Environment. In: GCH. Eurographics Association, 27. Chong W, Blei D, Li FF (2009) Simultaneous image classification and Aire-la-Ville. pp 11–18 annotation. In: Computer Vision and Pattern Recognition, 2009. CVPR 5. Sweeney C, Hollerer T, Turk M (2015) Theia: A fast and scalable 2009. IEEE Conference On. IEEE, Miami. pp 1903–1910 structure-from-motion library. In: Proc. ACMM. ACM, New York. 28. Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, Lepetit V (2012) pp 693–696 Gradient response maps for real-time detection of textureless objects. 6. Schonberger JL, Frahm JM (2016) Structure-from-motion revisited. In: IEEE PAMI 34(5):876–888 Proc. CVPR. IEEE. pp 4104–4113 29. Torii A, Sivic J, Pajdla T, Okutomi M (2013) Visual place recognition with 7. Schönberger JL, Zheng E, Frahm JM, Pollefeys M (2016) Pixelwise view repetitive structures. In: Proc. CVPR. IEEE, Portland. pp 883–890 selection for unstructured multi-view stereo. In: Proc. ECCV. Springer, 30. Mishkin D, Matas J, Perdoch M (2015) Mods: Fast and robust method for Cham. pp 501–518 two-view matching. CVIU 141:81–93 Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 7 of 7 31. Taira H, Torii A, Okutomi M (2016) Robust feature matching by learning 59. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval descriptor covariance with viewpoint synthesis. In: Proc. ICPR. IEEE, with large vocabularies and fast spatial matching. In: Proc. CVPR. IEEE, Cancun. pp 1953–1958 Minneapolis 32. Torii A, Arandjelovic´ R, Sivic J, Okutomi M, Pajdla T (2015) 24/7 place 60. Vedaldi A, Lenc K (2015) Matconvnet – convolutional neural networks for recognition by view synthesis. In: Proc. CVPR. IEEE, Boston. pp 1808–1817 matlab. In: Proc. ACMM. ACM, New York 33. Radenovic F, Schonberger JL, Ji D, Frahm JM, Chum O, Matas J (2016) 61. Douze M, Jégou H (2014) The yael library. In: Proc. ACMM. MM ’14. ACM, From dusk till dawn: Modeling in the dark. In: Proc. CVPR. IEEE, Las Vegas. New York, USA. pp 687–690. https://doi.org/10.1145/2647868.2654892. pp 5488–5496 http://doi.acm.org/10.1145/2647868.2654892 34. Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: Proc. ICCV. IEEE, Rio de Janeiro. pp 1–8 35. Liu C, Yuen J, Torralba A (2016) Sift flow: Dense correspondence across scenes and its applications. In: Dense Image Correspondences for Computer Vision. Springer, Cham. pp 15–49 36. Zhao WL, Jégou H, Gravier G (2013) Oriented pooling for dense and non-dense rotation-invariant features. In: Proc. BMVC. BMVA, South Road 37. Sattler T, Maddern W, Toft C, Torii A, Hammarstrand L, Stenborg E, Safari D, Sivic J, Pajdla T, Pollefeys M, Kahl F, Okutomi M (2017) Benchmarking 6dof outdoor visual localization in changing conditions. arXiv preprint arXiv:1707.09092 38. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 39. Strecha C, Von Hansen W, Van Gool L, Fua P, Thoennessen U (2008) On benchmarking camera calibration and multi-view stereo for high resolution imagery. In: Proc. CVPR. IEEE, Anchorage. pp 1–8 40. Wu C (2013) Towards linear-time incremental structure from motion. In: Proc. 3DV. IEEE, Seattle. pp 127–134 41. Cui Z, Tan P (2015) Global structure-from-motion by similarity averaging. In: Proc. ICCV. IEEE, Santiago. pp 864–872 42. Magerand L, Del Bue A (2017) Practical projective structure from motion (p2sfm). In: Proc. CVPR. IEEE, Venice. pp 39–47 43. Engel J, Schöps T, Cremers D (2014) Lsd-slam: Large-scale direct monocular slam. In: Proc. ECCV. Springer, Cham. pp 834–849 44. Newcombe RA, Lovegrove SJ, Davison AJ (2011) Dtam: Dense tracking and mapping in real-time. In: Proc. ICCV. IEEE, Barcelona. pp 2320–2327 45. Furukawa Y, Hernández C, et al (2015) Multi-view stereo: A tutorial. Found Trends® Comput Graph Vis 9(1-2):1–148 46. Tola E, Lepetit V, Fua P (2010) Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE PAMI 32(5):815–830 47. Tuytelaars T (2010) Dense interest points. In: Proc. CVPR. IEEE, San Francisco. pp 2281–2288 48. Fischer P, Dosovitskiy A, Brox T (2014) Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769 49. Schonberger JL, Hardmeier H, Sattler T, Pollefeys M (2017) Comparative evaluation of hand-crafted and learned local features. In: Proc. CVPR. IEEE, Honolulu. pp 6959–6968 50. Simo-Serra E, Trulls E, Ferraz L, Kokkinos I, Fua P, Moreno-Noguer F (2015) Discriminative learning of deep convolutional feature point descriptors. In: Proc. ICCV. IEEE, Santiago. pp 118–126 51. Simonyan K, Vedaldi A, Zisserman A (2014) Learning local feature descriptors using convex optimisation. IEEE PAMI 36(8):1573–1585 52. Bursuc A, Tolias G, Jégou H (2015) Kernel local descriptors with implicit rotation matching. In: Proc. ACMM. ACM, New York. pp 595–598 53. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proc. CVPR. IEEE, Las Vegas. pp 5297–5307 54. Radenovic´ F, Tolias G, Chum O (2016) CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In: Proc. ECCV. Springer, Cham 55. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, et al. (2015) Going deeper with convolutions. In: Proc. CVPR, Boston 56. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR. IEEE, Las Vegas. pp 770–778 57. Berkes P, Wiskott L (2006) On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Comput 18(8):1868–1895 58. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European Conference on Computer Vision. Springer, Cham. pp 818–833 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png IPSJ Transactions on Computer Vision and Applications Springer Journals

Structure from motion using dense CNN features with keypoint relocalization

Free
7 pages

Loading next page...
 
/lp/springer_journal/structure-from-motion-using-dense-cnn-features-with-keypoint-1Hl15YbdCX
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Computer Science; Computer Imaging, Vision, Pattern Recognition and Graphics; Image Processing and Computer Vision; User Interfaces and Human Computer Interaction; Artificial Intelligence (incl. Robotics); Robotics and Automation
eISSN
1882-6695
D.O.I.
10.1186/s41074-018-0042-y
Publisher site
See Article on Publisher Site

Abstract

Structure from motion (SfM) using imagery that involves extreme appearance changes is yet a challenging task due to a loss of feature repeatability. Using feature correspondences obtained by matching densely extracted convolutional neural network (CNN) features significantly improves the SfM reconstruction capability. However, the reconstruction accuracy is limited by the spatial resolution of the extracted CNN features which is not even pixel-level accuracy in the existing approach. Providing dense feature matches with precise keypoint positions is not trivial because of memory limitation and computational burden of dense features. To achieve accurate SfM reconstruction with highly repeatable dense features, we propose an SfM pipeline that uses dense CNN features with relocalization of keypoint position that can efficiently and accurately provide pixel-level feature correspondences. Then, we demonstrate on the Aachen Day-Night dataset that the proposed SfM using dense CNN features with the keypoint relocalization outperforms a state-of-the-art SfM (COLMAP using RootSIFT) by a large margin. Keywords: Structure from Motion, Feature detection and description, Feature matching, 3D reconstruction 1 Introduction using densely detected features on a regular grid [34, 35] Structure from motion (SfM) is getting ready for 3D but their merit is only demonstrated in image retrieval reconstruction only using images, thanks to off-the-shelf [32, 36] or image classification tasks [26, 34]thatuse softwares [1–3] and open-source libraries [4–10]. They the features for global image representation and do not provide impressive 3D models, especially, when targets require one-to-one feature correspondences as in SfM. are captured from many viewpoints with large over- Only recently, SfM with densely detected features are laps. The state-of-the-art SfM pipelines, in general, start presented in [37]. DenseSfM [37] uses convolutional neu- with extracting local features [11–17] and matching them ral network (CNN) features as densely detected features, across images, followed by pose estimation, triangulation, i.e., it extracts convolutional layers of deep neural network and bundle adjustment [18–20]. The performance of local [38] and converts them as feature descriptors of keypoints features and their matching, therefore, is crucial for 3D on a grid pattern (Section 3.1). As the main focus of [37] reconstruction by SfM. is camera localization, the SfM architecture including nei- In this decade, the performance of local features, ther dense CNN feature description and matching nor its namely, SIFT [11]anditsvariants[16, 21–24]are vali- 3D reconstruction performance is not studied in detail. dated on 3D reconstruction as well as many other tasks [25–27]. The local features give promising matches for 1.1 Contribution well-textured surfaces/objects but significantly drop its In this work, we first review the details of the SfM pipeline performance for matching weakly textured objects [28], with dense CNN feature extraction and matching. We repeated patterns [29], extreme changes of viewpoints then propose a keypoint relocalization that uses the struc- [21, 30, 31], and illumination change [32, 33]becauseof ture of convolutional layers (Section 3.2)toovercome degradation in repeatability of feature point (keypoint) keypoint inaccuracy on the grid resolution and compu- extraction [21, 31]. This problem can be mitigated by tational burden of dense feature matching. Finally, the performance of SfM with dense CNN feature using the *Correspondence: widya.a.aa@m.titech.ac.jp proposed keypoint relocalization is evaluated on Aachen Tokyo Institute of Technology, O-okayama, Meguro-ku, 152-8550 Tokyo, Japan © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 2 of 7 Day-Night [37] dataset and additionally on Strecha [39] its loss of scale, rotation invariant, inaccuracy of localized dataset. keypoints, and computational burden. 2 Related work 2.3 CNN features 2.1 SfM and VisualSLAM Fischer et al. [48] reported that, given feature positions, The state-of-the-art SfM is divided into a few mainstream descriptors extracted from CNN layer have better match- pipelines: incremental (or sequential) [4, 6, 40], global [8, ability compared to SIFT [11]. More recently, Schon- berger et al. [49] also showed that CNN-based learned 9, 41], and hybrid [10, 42]. VisualSLAM approaches, namely, LSD-SLAM [43]and local features such as LIFT [17], Deep-Desc [50], and Con- DTAM [44], repeat camera pose estimation based on vOpt [51] have higher recall compared to SIFT [11]but selected keyframe and (semi-)dense reconstruction using still cannot outperform its variants, e.g., DSP-SIFT [16] the pixel-level correspondences in real-time. These meth- and SIFT-PCA [52]. ods are particularly designed to work with video streams, Those studies motivate us to adopt CNN architecture i.e., short baseline camera motion, but not with general for extracting features from images and matching them wide-baseline camera motion. for SfM as it efficiently outputs multi-resolution features Recently, Sattler et al. [37]introducesCNN-based Dens- and has potential to be improved by better training or eSfM that adopts densely detected and described features. architecture. But their SfM uses fixed poses and intrinsic parameters of reference images in evaluating the performance of query 3 The pipeline: SfM using dense CNN features image localization. They also do not address keypoint with keypoint relocalization inaccuracy of CNN features. Therefore, it remains as an Our SfM using densely detected features mimics the open challenge. state-of-the-art incremental SfM pipeline that con- sists of feature extraction (Section 3.1), feature match- 2.2 Feature points ing (Section 3.2 to 3.4), and incremental reconstruc- Thedefactostandardlocalfeature,SIFT[11], is capa- tion (Section 3.5). Figure 1 overviews the pipeline. ble of matching images under viewpoint and illumination In this section, we describe each component while changes thanks to scale and rotation invariant keypoint stating the difference to the sparse keypoint-based patches described by histograms of the oriented gradi- approaches. ent. ASIFT [21]and itsvariants[30, 31] explicitly generate synthesized views in order to improve repeatability of key- 3.1 Dense feature extraction point detection and description under extreme viewpoint Firstly, our method densely extracts the feature descrip- changes. tors and their locations from the input image. In the same An alternative approach to improve feature matching spirit of [53, 54], we input images in a modern CNN archi- between images across extreme appearance changes is tecture [38, 55, 56] and use the convolutional layers as to use densely sampled features from images. Densely densely detected keypoints on a regular grid, i.e., crop- detected features are often used in multi-view stereo [45] ping out the fully connected and softmax layers. In the with DAISY [46], or image retrieval and classification following, we chose VGG-16 [38] as the base network [35, 47]withDense SIFT[34]. However, dense features architecture and focus on the description tailored to it, but are not spotlighted in the task of one-to-one feature cor- this can be replaced with other networks with marginal respondence search under unknown camera poses due to modification. Fig. 1 Pipeline of the proposed SfM using dense CNN features with keypoint relocalization. Our SfM starts from dense feature extraction (Section 3.1), feature matching (Section 3.2), the proposed keypoint relocalization (Section 3.3), feature verification using RANSAC with multiple homographies (Section 3.4), and followed by 3D reconstruction (Section 3.5) Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 3 of 7 As illustrated in Fig. 2,VGG-16[38]iscomposedoffive constrained by the transferred keypoints. This can be max-pooling layers and 16 weight layers. We extract the repeated until reaching conv1_2 layer. However, this naive max-pooling layers as dense features. As can be seen in coarse-to-fine matching generates too many keypoints Fig. 2, the conv1 max-pooling layer is not yet the same that may lead to a problem in computational and mem- resolution as the input image. We, therefore, also extract ory usage in incremental SfM step, especially, bundle conv1_2, one layer before the conv1 max-pooling layer, adjustment. that has pixel-level accuracy. To generate dense feature matches with pixel-level accu- racy while preserving their quantity, we propose a method 3.2 Tentative matching of keypoint relocalization as follows. For each feature Given multi-level feature point locations and descriptors, point at the current layer, we retrieve the descriptors on tentative matching uses upper max-pooling layer (lower the lower layer (higher spatial resolution) in the corre- spatial resolution) to establish initial correspondences. sponding K × K pixels . The feature point is relocalized at This is motivated by that the upper max-pooling layer the pixel position that has the largest descriptor norm (L2 has a larger receptive field and encodes more seman- norm) in the K × K pixels. This relocalization is repeated tic information [48, 57, 58] which potentially gives high until it reaches the conv1_2 layer which has the same matchability across appearance changes. Having the lower resolution as the input image (see also Fig. 3). spatial resolution is also advantageous in the sense of computational efficiency. 3.4 Feature verification using RANSAC with multiple For a pair of images, CNN descriptors are tentatively homographies matched by searching their nearest neighbors (L2 dis- Using all the relocated feature points, we next remove tances) and refined by taking mutually nearest neighbors. outliers from a set of tentative matches by Homography- Note that the standard ratio test [11] removes too many RANSAC. We rather use a vanilla RANSAC instead feature matches as neighborhood features on a regularly of the state-of-the-art spatial verification [59]bytaking sampled grid tend to be similar to each other. into account the spatial density of feature correspon- We perform feature descriptor matching for all the pairs dences. To detect inlier matches lying on several planes, of images or shortlisted images by image retrieval, e.g., Homography-RANSAC is repeated while excluding the NetVLAD [53]. inlier matches of the best hypothesis. The RANSAC inlier/outlier threshold is set to be loose to allow features 3.3 Keypoint relocalization off the planes. The tentative matching using the upper max-pooling lay- ers, e.g., conv5, generates distinctive correspondences but 3.5 3D reconstruction the accuracy of keypoint position is limited by their spatial Having all the relocalized keypoints filtered by RANSAC, resolution. This inaccuracy of keypoints can be mitigated we can export them to any available pipelines that per- by a coarse-to-fine matching from the extracted max- form pose estimation, point triangulation, and bundle pooling layer up to conv1_2 layer utilizing extracted inter- adjustment. mediate max-pooling layers between them. For exam- Dense matching may produce many confusing feature ple, the matched keypoints found on the conv3 layer matches on the scene with many repetitive structures, e.g., are transferred to the conv2 (higher spatial resolution) windows, doors, and pillars. In such cases, we keep only and new correspondences are searched only in the area the N best matching image pairs for each image in the Fig. 2 Features extracted using CNN. The figure summarizes blocks of convolutional layers of VGG-16 as an example of CNN architecture. Our SfM uses the layers colored in red as features. For example, given an input image of 1600 × 1200 pixels, we extract 256 dimensional features of 200 × 150 spatial resolution from the conv3 max-pooling Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 4 of 7 Fig. 3 Keypoint relocalization. a A keypoint on a sparser level is relocalized using a map computed from descriptors’ L2 norm on an lower level which has higher spatial resolution. It is reassigned at the position on the lower level which has the largest value in the corresponding K × K neighborhood. By repeating this, the relocalized keypoint position in conv1_2 has the accuracy as in the input image pixels. b The green dots show the extracted conv3 features points (top) and the result of our keypoint relocalization (bottom) dataset based on the number of inlier matches of multiple on a computer equipped with a 3.20-GHz Intel Core i7- Homography-RANSAC. 6900K CPU with 16 threads and a 12-GB GeForce GTX 1080Ti. 4 Experiments We implement feature detection, description, and match- 4.1 Results on Aachen Day-Night dataset ing (Sections 3.1 to 3.4) in MATLAB with third-party The Aachen Day-Night dataset [37]isaimedforevaluat- libraries (MatConvNet [60]and Yaellibrary[61]). Dense ing SfM and visual localization under large illumination CNN features are extracted using the VGG-16 network changes such as day and night. It includes 98 subsets of [38]. Using conv4 and conv3 max-pooling layers, feature images. Each subset consists of 20 day-time images and matches are computed by the coarse-to-fine matching fol- one night-time image, their reference camera poses, and lowed by multiple Homography-RANSAC that finds at 3D points . most five homographies supported by an inlier thresh- For each subset, we run SfM and evaluate the esti- old of 10 pixels. The best N pairs based on multiple mated camera pose of the night image as follows. First, Homography-RANSAC of every image are imported to the reconstructed SfM model is registered to the ref- COLMAP [6] with the fixed intrinsic parameter option for erence camera poses by adopting a similarity transform scene with many repetitive structures. Otherwise, we use obtained from the camera positions of day-time images. all the image pairs. We then evaluate the estimated camera pose of the night In our preliminary experiments, we tested other lay- image by measuring positional (L2 distance) and angular trace(R R )−1 ers having the same spatial resolution, e.g., using conv4_3 ref night (acos( ))error. and conv3_3 layers in the coarse-to-fine matching but Table 1 shows the number of reconstructed cameras. we observed no improvement in 3D reconstruction. As a The proposed SfM with keypoint relocalization (conv1_2) max-pooling layer has a half depth dimension in compar- ison with the other layers at the same spatial resolution, we chose the max-pooling layer as the dense features for Table 1 Number of cameras reconstructed on the Aachen efficiency. dataset In the following, we evaluate the reconstruction per- DoG+ DenseCNN DenseCNN formance on Aachen Day-Night [37] and Strecha [39] RootSIFT [6] w/o reloc w/ reloc (Ours) dataset. We compare our SfM using dense CNN features with keypoint relocalization to the baseline COLMAP Night 48 95 96 with DoG+RootSIFT features [6]. In addition, we also Day 1910 1924 1944 compare our SfM to SfM using dense CNN without The proposed method have the most number of reconstructed cameras for either keypoint relocalization [37]. All experiments are tested day or night images Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 5 of 7 Fig. 4 Quantitative evaluation on the Aachen Day-Night dataset. The poses of night images reconstructed by the baseline DoG+RootSIFT [6] (red), the DenseCNN without keypoint relocalization (green), and the proposed DenseCNN (blue) are evaluated using the reference poses. The graphs show the percentages of correctly reconstructed camera poses of night images (y-axis) at positional (a) and angular (b) error threshold (x-axis) canreconstruct 96nightimagesthatare twiceasmany poses are evaluated. In our SfM, we take only feature as that of the baseline method using COLMAP with matches from the best N = 5 image pairs for each image DoG+RootSIFT [6]. This result validates the benefit of to suppress artifacts from confusing image pairs. densely detected features that can provide correspon- The mean average position and angular errors resulted dences across large illumination changes as they have by our SfM are 0.59 m and 2.27 . Although these errors smaller loss in keypoint detection repeatability than a are worse than those of the state-of-the-art COLMAP standard DoG. On the other hand, both methods with with DoG+RootSIFT [6] which are 0.17 m and 0.90 ,the sparse and dense features work well for reconstructing quantitative evaluation on the Strecha dataset demon- day images. The difference between with and without key- strated that our SfM does not overfit to specific challeng- point localization can be seen more clearly in the next ing tasks but works reasonably well for standard (easy) evaluation. situations. Figure 4 shows the percentages of night images recon- structed (y-axis) within certain positional and angular 5Conclusion error threshold (x-axis). Similarly, Table 2 shows the We presented a new SfM using dense features extracted reconstruction percentages of night images for vary- from CNN with the proposed keypoint relocalization to ing distance error thresholds with a fixed angular error improve the accuracy of feature positions sampled on a threshold at 10°. As can be seen from both evaluations, the regular grid. The advantage of our SfM has demonstrated proposed SfM using dense CNN features with keypoint on the Aachen Day-Night dataset that includes images relocalization outperforms the baseline DoG+RootSIFT with large illumination changes. The result on the Strecha [6] by a large margin. The improvement by the proposed dataset also showed that our SfM works for standard keypoint relocalization is significant when the evalua- datasets and does not overfit to a particular task although tion accounts for pose accuracy. Notice that the SfM it is less accurate than the state-of-the-art SfM with local using dense CNN without keypoint relocalization [37] features. We wish the proposed SfM becomes a mile- performs worse than the baseline DoG+RootSIFT [6]at stone in the 3D reconstruction, in particularly challenging small thresholds, e.g., below 3.5 m position and 2 angular situations. error. This indicates that the proposed keypoint relocal- Table 2 Evaluation of reconstructed camera poses (both ization gives features at more stable and accurate posi- position and orientation) tions and provides better inlier matches for COLMAP DoG+ DenseCNN DenseCNN reconstruction which results 3D reconstruction in higher RootSIFT [6] w/o reloc w/ reloc (Ours) quality. Figure 5 illustrates the qualitative comparison result 0.5m 15.31 5.10 18.37 between our method and the baseline DoG+RootSIFT [6]. 1.0m 25.61 14.29 33.67 5.0m 36.73 45.92 69.39 4.2 Results on Strecha dataset 10.0m 35.71 61.22 81.63 We additionally evaluate our SfM using dense CNN with 20.0m 39.80 69.39 82.65 the proposed keypoint relocalization on all six subsets The numbers show the percentage of the reconstructed night images within given of Strecha dataset [39] which is a standard benchmark positional error thresholds and an angular error fixed at 10° dataset for SfM and MVS. Position and angular error The proposed method have the most number of reconstructed cameras for either between the reconstructed cameras and the ground truth day or night images Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 6 of 7 Fig. 5 Example of 3D reconstruction in the Aachen dataset. These figures show qualitative examples of SfM using DoG+RootSIFT [6](a)andourdense CNN with keypoint relocalization (b). Our method can reconstruct all the 21 images in the subset whereas the baseline DoG+RootSIFT [6]fails to reconstruct it. As a nature of dense feature matching, our method reconstructs 42,402 3D points which are 8.2 times more than the baseline method Endnotes 8. Wilson K, Snavely N (2014) Robust global translations with 1dsfm. In: Proc. ECCV. Springer, Cham. pp 61–75 We use K = 2 throughout the experiments. 9. Moulon P, Monasse P, Perrot R, Marlet R (2016) OpenMVG: Open multiple view geometry. In: International Workshop on Reproducible Research in Although the poses are carefully obtained with manual Pattern Recognition. Springer, Cham. pp 60–74 verification, the poses are called as “reference poses” but 10. Cui H, Gao X, Shen S, Hu Z (2017) Hsfm: Hybrid structure-from-motion. In: Proc. CVPR. IEEE, Boston. pp 2393-2402 not ground truth. 11. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110 12. Mikolajczyk K, Schmid C (2004) Scale & affine invariant interest point Acknowledgements detectors. IJCV 60(1):63–86 This work was partly supported by JSPS KAKENHI Grant Number 17H00744, 13. Kadir T, Zisserman A, Brady M (2004) An affine invariant salient region 15H05313, 16KK0002, and Indonesia Endowment Fund for Education. detector. In: Proc. ECCV. Springer, Cham. pp 228–241 14. Tuytelaars T, Van Gool L (2004) Matching widely separated views based Availability of data and materials on affine invariant regions. IJCV 59(1):61–85 The code will be made publicly available on acceptance. 15. Arandjelovic´ R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: Proc. CVPR. IEEE, Providence. pp 2911–2918 Authors’ contributions 16. Dong J, Soatto S (2015) Domain-size pooling in local descriptors: Dsp-sift. AR run all the experiment and wrote initial draft of the manuscript. AT revised In: Proc. CVPR. IEEE, Boston. pp 5097–5106 the manuscript. Both AT and MXO provided supervision, many meaningful 17. Yi KM, Trulls E, Lepetit V, Fua P (2016) Lift: Learned invariant feature discussion, and guidance to AR in this research. All authors read and approved transform. In: Proc. ECCV. Springer, Cham. pp 467–483 the final manuscript. 18. Snavely N, Seitz SM, Szeliski R (2008) Modeling the world from internet photo collections. IJCV 80(2):189–210 Competing interests 19. Agarwal S, Furukawa Y, Snavely N, Curless B, Seitz SM, Szeliski R (2010) The authors declare that they have no competing interests. Reconstructing rome. Computer 43(6):40–47 20. Agarwal S, Furukawa Y, Snavely N, Simon I, Curless B, Seitz SM, Szeliski R (2011) Building rome in a day. Commun ACM 54(10):105–112 Publisher’s Note 21. Morel JM, Yu G (2009) Asift: A new framework for fully affine invariant Springer Nature remains neutral with regard to jurisdictional claims in image comparison. SIAM J Imaging Sci 2(2):438–469 published maps and institutional affiliations. 22. Ke Y, Sukthankar R (2004) Pca-sift: A more distinctive representation for local image descriptors. In: Proc. CVPR, vol 2. IEEE, Washington Received: 14 March 2018 Accepted: 6 May 2018 23. Abdel-Hakim AE, Farag AA (2006) Csift: A sift descriptor with color invariant characteristics. In: Proc. CVPR, vol, 2. IEEE, New York. pp 1978–1983 24. Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features. In: References Proc. ECCV. Springer, Cham. pp 404–417 1. Pix4D - Professional drone mapping and photogrammetry software. 25. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual https://pix4d.com/. Accessed 11 Feb 2018 categorization with bags of keypoints. In: Proc. ECCV, vol 1. Springer, 2. Agisoft Photoscan. http://www.agisoft.com/. Accessed 11 Feb 2018 Cham.pp1–2 3. Discover Photogrammetry Software - Photomodeler. http://www. 26. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial photomodeler.com/index.html. Accessed 11 Feb 2018 pyramid matching for recognizing natural scene categories. In: Proc. 4. Fuhrmann S, Langguth F, Goesele M (2014) MVE-A Multi-View CVPR, vol 2. IEEE, New York. pp 2169–2178 Reconstruction Environment. In: GCH. Eurographics Association, 27. Chong W, Blei D, Li FF (2009) Simultaneous image classification and Aire-la-Ville. pp 11–18 annotation. In: Computer Vision and Pattern Recognition, 2009. CVPR 5. Sweeney C, Hollerer T, Turk M (2015) Theia: A fast and scalable 2009. IEEE Conference On. IEEE, Miami. pp 1903–1910 structure-from-motion library. In: Proc. ACMM. ACM, New York. 28. Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, Lepetit V (2012) pp 693–696 Gradient response maps for real-time detection of textureless objects. 6. Schonberger JL, Frahm JM (2016) Structure-from-motion revisited. In: IEEE PAMI 34(5):876–888 Proc. CVPR. IEEE. pp 4104–4113 29. Torii A, Sivic J, Pajdla T, Okutomi M (2013) Visual place recognition with 7. Schönberger JL, Zheng E, Frahm JM, Pollefeys M (2016) Pixelwise view repetitive structures. In: Proc. CVPR. IEEE, Portland. pp 883–890 selection for unstructured multi-view stereo. In: Proc. ECCV. Springer, 30. Mishkin D, Matas J, Perdoch M (2015) Mods: Fast and robust method for Cham. pp 501–518 two-view matching. CVIU 141:81–93 Widya et al. IPSJ Transactions on Computer Vision and Applications (2018) 10:6 Page 7 of 7 31. Taira H, Torii A, Okutomi M (2016) Robust feature matching by learning 59. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval descriptor covariance with viewpoint synthesis. In: Proc. ICPR. IEEE, with large vocabularies and fast spatial matching. In: Proc. CVPR. IEEE, Cancun. pp 1953–1958 Minneapolis 32. Torii A, Arandjelovic´ R, Sivic J, Okutomi M, Pajdla T (2015) 24/7 place 60. Vedaldi A, Lenc K (2015) Matconvnet – convolutional neural networks for recognition by view synthesis. In: Proc. CVPR. IEEE, Boston. pp 1808–1817 matlab. In: Proc. ACMM. ACM, New York 33. Radenovic F, Schonberger JL, Ji D, Frahm JM, Chum O, Matas J (2016) 61. Douze M, Jégou H (2014) The yael library. In: Proc. ACMM. MM ’14. ACM, From dusk till dawn: Modeling in the dark. In: Proc. CVPR. IEEE, Las Vegas. New York, USA. pp 687–690. https://doi.org/10.1145/2647868.2654892. pp 5488–5496 http://doi.acm.org/10.1145/2647868.2654892 34. Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: Proc. ICCV. IEEE, Rio de Janeiro. pp 1–8 35. Liu C, Yuen J, Torralba A (2016) Sift flow: Dense correspondence across scenes and its applications. In: Dense Image Correspondences for Computer Vision. Springer, Cham. pp 15–49 36. Zhao WL, Jégou H, Gravier G (2013) Oriented pooling for dense and non-dense rotation-invariant features. In: Proc. BMVC. BMVA, South Road 37. Sattler T, Maddern W, Toft C, Torii A, Hammarstrand L, Stenborg E, Safari D, Sivic J, Pajdla T, Pollefeys M, Kahl F, Okutomi M (2017) Benchmarking 6dof outdoor visual localization in changing conditions. arXiv preprint arXiv:1707.09092 38. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 39. Strecha C, Von Hansen W, Van Gool L, Fua P, Thoennessen U (2008) On benchmarking camera calibration and multi-view stereo for high resolution imagery. In: Proc. CVPR. IEEE, Anchorage. pp 1–8 40. Wu C (2013) Towards linear-time incremental structure from motion. In: Proc. 3DV. IEEE, Seattle. pp 127–134 41. Cui Z, Tan P (2015) Global structure-from-motion by similarity averaging. In: Proc. ICCV. IEEE, Santiago. pp 864–872 42. Magerand L, Del Bue A (2017) Practical projective structure from motion (p2sfm). In: Proc. CVPR. IEEE, Venice. pp 39–47 43. Engel J, Schöps T, Cremers D (2014) Lsd-slam: Large-scale direct monocular slam. In: Proc. ECCV. Springer, Cham. pp 834–849 44. Newcombe RA, Lovegrove SJ, Davison AJ (2011) Dtam: Dense tracking and mapping in real-time. In: Proc. ICCV. IEEE, Barcelona. pp 2320–2327 45. Furukawa Y, Hernández C, et al (2015) Multi-view stereo: A tutorial. Found Trends® Comput Graph Vis 9(1-2):1–148 46. Tola E, Lepetit V, Fua P (2010) Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE PAMI 32(5):815–830 47. Tuytelaars T (2010) Dense interest points. In: Proc. CVPR. IEEE, San Francisco. pp 2281–2288 48. Fischer P, Dosovitskiy A, Brox T (2014) Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769 49. Schonberger JL, Hardmeier H, Sattler T, Pollefeys M (2017) Comparative evaluation of hand-crafted and learned local features. In: Proc. CVPR. IEEE, Honolulu. pp 6959–6968 50. Simo-Serra E, Trulls E, Ferraz L, Kokkinos I, Fua P, Moreno-Noguer F (2015) Discriminative learning of deep convolutional feature point descriptors. In: Proc. ICCV. IEEE, Santiago. pp 118–126 51. Simonyan K, Vedaldi A, Zisserman A (2014) Learning local feature descriptors using convex optimisation. IEEE PAMI 36(8):1573–1585 52. Bursuc A, Tolias G, Jégou H (2015) Kernel local descriptors with implicit rotation matching. In: Proc. ACMM. ACM, New York. pp 595–598 53. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proc. CVPR. IEEE, Las Vegas. pp 5297–5307 54. Radenovic´ F, Tolias G, Chum O (2016) CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In: Proc. ECCV. Springer, Cham 55. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, et al. (2015) Going deeper with convolutions. In: Proc. CVPR, Boston 56. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR. IEEE, Las Vegas. pp 770–778 57. Berkes P, Wiskott L (2006) On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Comput 18(8):1868–1895 58. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European Conference on Computer Vision. Springer, Cham. pp 818–833

Journal

IPSJ Transactions on Computer Vision and ApplicationsSpringer Journals

Published: May 31, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off