A big picture of odometry, localisation, mapping, feature matching and loop closure for non-roboticists
Remember the last time you’ve visited somewhere new? If you are reading this in 2020, you might have to rack your brain for those happy days. It may have been a holiday trip to an exotic place. You are excited to explore the area, check out all the nice restaurants and attractions. What would you do?
Yes, you read a map! (…or so it used to be)
Life has become too simplified nowadays with the intelligent little devices in our pockets, but let’s just imagine we are back in the stone ages. What if your phone has died and your only hope is a tourist guide in your backpack?
The first challenge is to find where you are. If you don’t have a GPS, you can’t just follow the blue blinking dot. You will probably look for street names or distinct landmarks nearby, and eventually, locate yourself. In robotics terms, this is called localisation. Once you know where you are and you know where you are going, you’re all sorted.
Well, what if you don’t have a map to begin with?
In theory, we can always explore the area to figure out where everything is relative to each other and memorise it, constructing a mental map. Even if we don’t have a ruler and a protractor, we roughly know how long our strides are and what a 90 degrees’ turn looks like. As we move, we know how far we moved by identifying landmarks we have previously observed. Furthermore, as soon as we see new landmarks and scenes, we can add those to our mental map to expand our territory of known places.
Multiple things are happening here. We used both (i) the estimate of our motion dynamics, and (ii) triangulation (broadly speaking) from known landmarks. (i) tells us the change in our position i.e. odometry. If we have a good idea of where we were at previously, we can use the estimated odometry to figure out where we are now. Most robots operate on wheels so it is fairly easy to measure how much they’ve moved. Some robots even have sensors called inertial measurement unit (IMU) that measure their body motion. Even so, we cannot just rely on odometry alone to know where you are. That is like walking with your eyes closed. At some point, you have to open your eyes and correct for the accumulated error. That is what (ii) does. For (ii), you need a half-built map already to correct for your localisation. This is why localisation and mapping has to happen simultaneously.
Even if we take carefully measured strides and construct the map carefully, it is impossible to rule out all kinds of noise. After a couple of turns, the error accumulates and we are no longer certain of our location. How do we make a recovery then?
This is where loop closure comes into play. Assuming that buildings and trees don’t move, we know when we do a full loop and come back to the same place we have been before. We can then correct our mental map and adjust for the noise that had accumulated.
And now, you have explored the area and have constructed a mental map. If you ever need to visit the same location again, you’d be able to do that, using your mental map. You don’t even have to take the same route. You can take shortcuts because you already know the layout. This is what robots do too.
This is the core idea behind Simultaneous Localisation and Mapping, which is used very widely in any kind of robotics applications that require the robot to move around a new environment. It doesn’t have to be just robots — it can be used for smartphones and other devices with a camera too. Augmented Reality (AR), for instance, requires the virtual world to be mapped onto the real scenery with precision, so some advanced use cases would require SLAM as a part of the pipeline.
Unlike humans, different robots have different sensors and use cases, and there are many approaches to solving the same problem. That is why there are so many variations of SLAM systems out there. Here, I will try to present a high level picture of some of the major approaches and how they differ from each other. For a more detailed, comprehensive and accurate description, please refer to some of the survey papers [1, 2, 3].
Perfection or efficiency? — SLAM, SfM or VO
Even before we get started with different approaches in SLAM, let’s talk about the alternatives to SLAM: Structure from Motion (SfM)  and Visual Odometry (VO) . All these fields address the same problem —for a collection of images, estimate where the camera was when each image was taken, so that the images tell a consistent story when they are reverse-projected back into 3D world space. They share many procedural steps, such as feature detection, feature matching, and optimising the estimated pose based on a consistency metric. However, the constraints they assume are slightly different.
Structure from Motion is a problem of reconstructing the 3D structure of the environment from a collection of 2D images. The images could be ordered or unordered, hence it cannot rely on odometry. SLAM poses a more specific problem in which the images are captured sequentially by a camera that is moving through an environment, following some kind of trajectory. Some SfM algorithms focus more on accuracy and global consistency rather than time constraints, and can work at a very large scale if need be. For instance, SfM can be applied to reconstruct a subset of the globe in 3D from Google Street View images. Obviously, this takes a huge amount of time and resources, so such algorithms aren’t expected to run in real time. Hence, these are called offline methods. In contrast, the main purpose of SLAM is to aid the robot in its exploration and navigation, so real time execution is crucial. Hence, this falls under the category of online methods.
Visual Odometry simplifies the SLAM problem further, focussing more on computational and resource efficiency. Rather than ensuring global consistency of the map it creates like SLAM does, VO’s main concern is about local consistency. Its goal is to estimate the odometry (differences in camera poses between two images taken one after another) without having to rely on sensors such as IMU or ground truth odometry measurements. If we don’t care about global consistency, the loop closure step can be taken away, saving us some computation and making the system simpler.
Why do we have two eyes? — Stereo, Monocular, or Depth Camera
Even if we were able to localise the camera, it would be a struggle to reconstruct the surroundings without knowing the depth of each pixel of the image i.e. how far the surface is from the camera. Humans and most animals have two eyes, because that allows us to figure out where objects are relative to us using triangulation. Similarly, the depth can be inferred using stereo cameras, with some extra work of finding correspondences between the two images by feature matching. There also exist depth cameras, which provides the depth information in addition to RGB, either by measuring the time of flight or the pattern density of the Infra Red light emitted from the camera.
Even with just a single camera, the depth can be inferred by comparing frames taken at different camera positions. In this case, the scale of the reconstructed map is unknown unless the ground truth odometry is available.
Some other SLAM systems are developed for LASER or LiDAR scanners.
How do we find correspondences and consistency between frames? — Feature-based Method vs Direct Method
As we explore, we identify salient features such as landmarks and objects, which we use to match pieces of observations together and construct a consistent mental map. We now need to invent a convenient definition of “salient features” that our system can detect and match.
While landmarks such as buildings and sign posts work well for humans, it is much easier for machines to identify and match low level features such as corners, edges and blobs. More sophisticated feature definitions, together with detection algorithms and descriptors (a distinct feature representation) have been invented, such as Scale-invariant Feature Transform (SIFT) , Speeded Up Robust Features (SURF)  and Oriented FAST and Rotated BRIEF (ORB) [7, 8]. These features are designed to be robust to translation, rotation, variations in scale, view-point, lighting, etc.
A limitation to the feature-based approach is that, once the features are extracted, all other information contained in the image is disregarded. This could be problematic when feature matching alone cannot offer robust reconstruction, e.g. in environments with too many or very few salient features, or with repetitive textures.
An alternative approach to feature-based matching is minimising the photometric (light intensity) error, which is called the direct method. Well-known examples are LSD-SLAM  and Direct Sparse Odometry . These methods by-pass the computationally expensive process of feature extraction and matching, and can benefit from the rich information in the environment. However, it is sensitive to changes in lighting or large motions.
How do we adjust camera poses? — RANSAC, Iterative Closest Point
Now that we have identified some correspondences between successive frames, it is time to find the camera motion that best explain those matches. While something as simple as least squares would work well if all our matches are correct, we have to bare in mind that our correspondences are noisy and inaccurate. RANSAC  is a method widely used to estimate a transform from one set of data points to another. It randomly samples the minimum number of paired coordinates (Xᵢ, Yᵢ) required to estimate a transformation from set X to set Y. For every estimated transformation, the transformation is applied to all remaining data points in set X to see if the transformed points are close enough to their corresponding matches in set Y (these are called inliers). This process is repeated and the best transform that yielded a good number of inliers is kept as a final estimate.
Once we apply RANSAC, we have a fairly good estimate of the camera motion between two frames X and Y. Note though, that this was estimated only from the feature matches that we have, which doesn’t utilise the entire information obtainable from the images. We can further refine our estimate by reverse-projecting images into 3D for each frame as point clouds, and trying to align these point clouds together.
Iterative Closest Point (ICP) is a common technique in computer vision for aligning 3D points obtained by different frames. Using the estimate given by RANSAC as an initial transform estimate, ICP aligns the closest points in two point clouds, using a fast nearest neighbour search, and re-estimates the transformation to minimise the distances between the matches. This process is performed iteratively until convergence.
How do we correct the map? — Filter Methods vs Bundle Adjustment
As the robot moves around, we may want to adjust our previous estimates of the camera poses, especially when loop closing happens, i.e. when the robot revisits somewhere we’ve seen previously. To make the matter complicated, adjusting the camera pose will also mean amending the map we have built using that camera pose. Again, this is why localisation and mapping is a chicken and egg problem.
There are two main approaches to correcting the camera pose and map estimates — filter methods and bundle adjustment (BA) methods. The major difference is that, whereas bundle adjustment at its most basic form re-evaluates all the camera poses and feature poses in the map from scratch at every time step as a maximum likelihood estimate, filter methods represent its estimate of the camera and feature poses as a probability density function, allowing those estimates to be incrementally updated by taking new observations into account. Extended Karman Filters (EKF), which gives a Bayesian solution to the state estimate update given new observations, is typically used. A multivariate Gaussian is a common choice to model probabilities for the state estimate and the observation noise. Particle Filters, which do not assume the shape of the probability density function, is another option within the family of filter methods.
While filter methods seem to have an advantage over BA methods in terms of reusing previous estimates, there is a hidden cost. Because filter methods only retain the most recent camera pose estimate together with feature nodes in the map, when it comes to updating the map based on accumulated error, we have to compute at how every feature node affects every other feature node. This is captured by the covariance matrix of the multivariate Gaussian, which grows as more and more features get added to the map.
A more computationally efficient approach to BA would be to evaluate the optimal solution on a subset of past camera poses and features. Typically, key frames which have distinct features are chosen at regular intervals, and only these will be used to compute the optimal poses in order to reduce computational complexity. More techniques have been developed to reduce the computational complexity of real-time BA, such as Pose Graph Optimisation. For a more in-depth comparison, see .
When do we close the loop? — Place Recognition
As we observed at the beginning of this article, loop closure is key to ensure a global consistency of the map that we construct. This aspect also characterises SLAM from other methods such as Visual Odometry.
There are three ways to find out whether you came back to the same place. You can either compare the current image frame to previous frames (image-to-image), compare the current frame against the map (image-to-map), or compare the current location in the map to other regions in the map (map-to-map).
For image-to-image comparison, Bag of Words (BoW) is an effective way to quickly narrow down candidates for corresponding frames. Bag of Words is a rather lazy way of describing images — basically it only cares about what kind of local features appears in the image, but throws away all the high level spatial context. However, this is effective for quickly looking up images that may have been taken from the same place.
Challenges in SLAM
We have barely scraped the surface of research in SLAM. We have yet to discuss some of the typical challenges in SLAM and how some systems overcome them. The ultimate goal is to make SLAM handle environments with too many or very few salient features, large scale environments, dynamic environments (i.e. with lots of moving objects or people), non-smooth movements of the camera, occlusions, transparency and reflection, while meeting computation time, memory and sensory device constraints.
 K. Yousif et al. An Overview to Visual Odometry and Visual SLAM: Applications to Mobile Robotics (2015), Intelligent Industrial Systems
 J. Fuentes-Pacheco et al. Visual Simultaneous Localization and Mapping: A Survey (2015), Artificial Intelligence Review
 M. R. U. Saputra et al. Visual SLAM and Structure from Motion in Dynamic Environments: A Survey (2018), ACM Computing Surveys
 O. Ozyesil et al. A Survey of Structure from Motion (2017), Acta Numerica
 D. G. Lowe Distinctive Image Featuresfrom Scale-Invariant Keypoints (2014), International Journal of Computer Vision
 H. Bay et al. SURF: Speeded Up Robust Features (2006), ECCV
 E. Rublee et al. ORB: an efficient alternative to SIFT or SURF (2011), ICCV
 R. Mur-Artal et al. ORB-SLAM: A Versatile and Accurate Monocular SLAM System (2015), IEEE ROBIO
 J. Engel et al. LSD-SLAM: Large-ScaleDirect Monocular SLAM (2014), ECCV
 J. Engel et al. Direct Sparse Odometry (2018), IEEE Transactions on Pattern Analysis and Machine Intelligence
 M. Fischler et al. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography (1981), Communications with the ACM
 H. Strasdat et al. Visual SLAM: Why Filter? (2012), Image and Vision Computing