How Robots Make Maps— An Intro to SLAM (Simultaneous Localisation and Mapping)

The Startup - Medium

A big picture of odometry, localisation, mapping, feature matching and loop closure for non-roboticists

Photo by Dominik Scythe on Unsplash

Perfection or efficiency? — SLAM, SfM or VO

Even before we get started with different approaches in SLAM, let’s talk about the alternatives to SLAM: Structure from Motion (SfM) [4] and Visual Odometry (VO) [1]. All these fields address the same problem —for a collection of images, estimate where the camera was when each image was taken, so that the images tell a consistent story when they are reverse-projected back into 3D world space. They share many procedural steps, such as feature detection, feature matching, and optimising the estimated pose based on a consistency metric. However, the constraints they assume are slightly different.

Why do we have two eyes? — Stereo, Monocular, or Depth Camera

Even if we were able to localise the camera, it would be a struggle to reconstruct the surroundings without knowing the depth of each pixel of the image i.e. how far the surface is from the camera. Humans and most animals have two eyes, because that allows us to figure out where objects are relative to us using triangulation. Similarly, the depth can be inferred using stereo cameras, with some extra work of finding correspondences between the two images by feature matching. There also exist depth cameras, which provides the depth information in addition to RGB, either by measuring the time of flight or the pattern density of the Infra Red light emitted from the camera.

How do we find correspondences and consistency between frames? — Feature-based Method vs Direct Method

As we explore, we identify salient features such as landmarks and objects, which we use to match pieces of observations together and construct a consistent mental map. We now need to invent a convenient definition of “salient features” that our system can detect and match.

A limitation to the feature-based approach is that, once the features are extracted, all other information contained in the image is disregarded. This could be problematic when feature matching alone cannot offer robust reconstruction, e.g. in environments with too many or very few salient features, or with repetitive textures.

An alternative approach to feature-based matching is minimising the photometric (light intensity) error, which is called the direct method. Well-known examples are LSD-SLAM [9] and Direct Sparse Odometry [10]. These methods by-pass the computationally expensive process of feature extraction and matching, and can benefit from the rich information in the environment. However, it is sensitive to changes in lighting or large motions.

How do we adjust camera poses? — RANSAC, Iterative Closest Point

Now that we have identified some correspondences between successive frames, it is time to find the camera motion that best explain those matches. While something as simple as least squares would work well if all our matches are correct, we have to bare in mind that our correspondences are noisy and inaccurate. RANSAC [11] is a method widely used to estimate a transform from one set of data points to another. It randomly samples the minimum number of paired coordinates (Xᵢ, Yᵢ) required to estimate a transformation from set X to set Y. For every estimated transformation, the transformation is applied to all remaining data points in set X to see if the transformed points are close enough to their corresponding matches in set Y (these are called inliers). This process is repeated and the best transform that yielded a good number of inliers is kept as a final estimate.

Once we apply RANSAC, we have a fairly good estimate of the camera motion between two frames X and Y. Note though, that this was estimated only from the feature matches that we have, which doesn’t utilise the entire information obtainable from the images. We can further refine our estimate by reverse-projecting images into 3D for each frame as point clouds, and trying to align these point clouds together.

Iterative Closest Point (ICP) is a common technique in computer vision for aligning 3D points obtained by different frames. Using the estimate given by RANSAC as an initial transform estimate, ICP aligns the closest points in two point clouds, using a fast nearest neighbour search, and re-estimates the transformation to minimise the distances between the matches. This process is performed iteratively until convergence.

How do we correct the map? — Filter Methods vs Bundle Adjustment

As the robot moves around, we may want to adjust our previous estimates of the camera poses, especially when loop closing happens, i.e. when the robot revisits somewhere we’ve seen previously. To make the matter complicated, adjusting the camera pose will also mean amending the map we have built using that camera pose. Again, this is why localisation and mapping is a chicken and egg problem.

(a) Bundle Adjustment over all camera poses and features (b) Filter method (c) Key frame Bundle Adjustment. Image taken from [12]. T: camera trajectory, x: features

There are two main approaches to correcting the camera pose and map estimates — filter methods and bundle adjustment (BA) methods. The major difference is that, whereas bundle adjustment at its most basic form re-evaluates all the camera poses and feature poses in the map from scratch at every time step as a maximum likelihood estimate, filter methods represent its estimate of the camera and feature poses as a probability density function, allowing those estimates to be incrementally updated by taking new observations into account. Extended Karman Filters (EKF), which gives a Bayesian solution to the state estimate update given new observations, is typically used. A multivariate Gaussian is a common choice to model probabilities for the state estimate and the observation noise. Particle Filters, which do not assume the shape of the probability density function, is another option within the family of filter methods.

While filter methods seem to have an advantage over BA methods in terms of reusing previous estimates, there is a hidden cost. Because filter methods only retain the most recent camera pose estimate together with feature nodes in the map, when it comes to updating the map based on accumulated error, we have to compute at how every feature node affects every other feature node. This is captured by the covariance matrix of the multivariate Gaussian, which grows as more and more features get added to the map.

A more computationally efficient approach to BA would be to evaluate the optimal solution on a subset of past camera poses and features. Typically, key frames which have distinct features are chosen at regular intervals, and only these will be used to compute the optimal poses in order to reduce computational complexity. More techniques have been developed to reduce the computational complexity of real-time BA, such as Pose Graph Optimisation. For a more in-depth comparison, see [12].

When do we close the loop? — Place Recognition

As we observed at the beginning of this article, loop closure is key to ensure a global consistency of the map that we construct. This aspect also characterises SLAM from other methods such as Visual Odometry.

There are three ways to find out whether you came back to the same place. You can either compare the current image frame to previous frames (image-to-image), compare the current frame against the map (image-to-map), or compare the current location in the map to other regions in the map (map-to-map).

For image-to-image comparison, Bag of Words (BoW) is an effective way to quickly narrow down candidates for corresponding frames. Bag of Words is a rather lazy way of describing images — basically it only cares about what kind of local features appears in the image, but throws away all the high level spatial context. However, this is effective for quickly looking up images that may have been taken from the same place.

Challenges in SLAM

We have barely scraped the surface of research in SLAM. We have yet to discuss some of the typical challenges in SLAM and how some systems overcome them. The ultimate goal is to make SLAM handle environments with too many or very few salient features, large scale environments, dynamic environments (i.e. with lots of moving objects or people), non-smooth movements of the camera, occlusions, transparency and reflection, while meeting computation time, memory and sensory device constraints.

[1] K. Yousif et al. An Overview to Visual Odometry and Visual SLAM: Applications to Mobile Robotics (2015), Intelligent Industrial Systems

[2] J. Fuentes-Pacheco et al. Visual Simultaneous Localization and Mapping: A Survey (2015), Artificial Intelligence Review

[3] M. R. U. Saputra et al. Visual SLAM and Structure from Motion in Dynamic Environments: A Survey (2018), ACM Computing Surveys

[4] O. Ozyesil et al. A Survey of Structure from Motion (2017), Acta Numerica

[5] D. G. Lowe Distinctive Image Featuresfrom Scale-Invariant Keypoints (2014), International Journal of Computer Vision

[6] H. Bay et al. SURF: Speeded Up Robust Features (2006), ECCV

[7] E. Rublee et al. ORB: an efficient alternative to SIFT or SURF (2011), ICCV

[8] R. Mur-Artal et al. ORB-SLAM: A Versatile and Accurate Monocular SLAM System (2015), IEEE ROBIO

[9] J. Engel et al. LSD-SLAM: Large-ScaleDirect Monocular SLAM (2014), ECCV

[10] J. Engel et al. Direct Sparse Odometry (2018), IEEE Transactions on Pattern Analysis and Machine Intelligence

[11] M. Fischler et al. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography (1981), Communications with the ACM

[12] H. Strasdat et al. Visual SLAM: Why Filter? (2012), Image and Vision Computing

Related Articles