Back home

Structure from motion

Nov 19, 2023

Huge thank you to Chahat Deep Singh’s article on SFM here. These are basically a rehash of his article, as well as chapter 7 of Computer Vision: Algorithms and Applications by Richard Szeliski.

Introduction

Structure from motion, or SfM, is the process of simultaneously estimating a series of camera poses and the 3D structure of a scene using a sequence of 2D images. SfM has the following steps, each of which I’ll cover here:

  1. Feature matching and outlier rejection using RANSAC
  2. Fundamental matrix estimation
  3. Essential matrix estimation using the result from (2)
  4. Camera pose estimation using the result from (3)
  5. Using triangulation to check for chirality
  6. Projecting points into 3D

1. Feature matching, the fundamental matrix, and RANSAC

1.1. The fundamental matrix and epipolar geometry

The fundamental matrix FR3×3F \in \R^{3 \times 3} is a rank-2 matrix that relates the corresponding set of points between two images with different viewpoints. We use epipolar geometry, the math behind stereo imagery, to derive the fundamental matrix. Epipolar geometry is the intrinsic projective geometry between two views. It depends only on the camera parameters (the KK matrix) and the relative pose between the two views. In other words, epipolar geometry and FF are independent of the scene being captured.

1.2. Epipolar geometry

Consider a 3D point XX viewed as a 2D point xx in one image and as xx' in a second. Let the respective camera centers be C,CC, C'.

The points x,x,Xx, x', X (and also C,C,XC, C', X) form a plane that we label π\pi, i.e. Cx(CC×Cx)=0\vec{Cx}\cdot (\vec{CC'} \times \vec{C'x'}) = 0.

img

The epipolar plane $\pi$ (left) and the epipolar line $l'$ (right).

Say we only know xx, not xx'. In other words, we only know that the 3D point XX corresponds to some image point xx, but we don’t know its matching point in the second image. Since the correspondences are constrained along the epipolar plane π\pi, the possible locations for xx' form an epipolar line ll'. Knowing ll' is extremely useful, as it reduces the search space for xx' from the entire second camera image to a single line.

Some terminology:

  • A camera’s epipole is the point of intersection of a camera’s image plane and the line connecting the two camera’s centers. See e,ee, e' above.
  • We call the line connecting the camera centers, CC\vec{CC'}, the baseline.
  • The epipolar plane is the plane containing both the baseline and our 3D point XX.
  • The epipolar line for a given camera is the intersection of the epipolar plane with the camera’s image plane. All epipolar lines intersect at the epipole.

2. Computing the fundamental matrix

We can solve for FF by first applying the epipolar constraint:

xiTFxi=0x_i'^T F x_i = 0

for each correspondence i1,,mi \in {1,\cdots,m}. We also call this the Longuet-Higgins equation or simply the correspondence condition. Since FF is 3×33 \times 3, we can solve it with a homogenous linear system with nine unknowns/DOFs. Expanding this yields:

[xiyi1][f11f12f13f21f22f23f31f32f33][xiyi1]=0 \begin{bmatrix} x_i' & y_i' & 1 \end{bmatrix} \begin{bmatrix} f_{11} & f_{12} & f_{13} \\ f_{21} & f_{22} & f_{23} \\ f_{31} & f_{32} & f_{33} \end{bmatrix} \begin{bmatrix} x_i \\ y_i \\ 1 \end{bmatrix} = 0

Given mm correspondences, we can rework the above into:

[x1x1x1y1x1y1x1y1y1y1x1y11xmxmxmymxmymxmymymymxmym1][f11f21f31f12f22f32f13f23f33]=0 \begin{bmatrix} x_1x_1' & x_1y_1' & x_1 & y_1x_1' & y_1y_1' & y_1 & x_1' & y_1' & 1 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ x_mx_m' & x_my_m' & x_m & y_mx_m' & y_my_m' & y_m & x_m' & y_m' & 1 \\ \end{bmatrix} \begin{bmatrix} f_{11} \\ f_{21} \\ f_{31} \\ f_{12} \\ f_{22} \\ f_{32} \\ f_{13} \\ f_{23} \\ f_{33} \end{bmatrix} = 0

Since the above linear equation has 9 unknowns, we require 8 constraints, or points, to solve it. We therefore refer to this as the eight-point algorithm. We can solve it using singular value decomposition, or SVD.

One last consideration: When solving in real-world systems, we’ll arrive at an approximated FF that includes a bit of noise, which results in a rank of 3, not 2 as in the perfect case. For the epipolar geometry to work properly, we need rank 2. To enforce this, we set the last singular value from SVD to be zero. This forces our null space to be non-empty, and therefore cause all of our epipolar lines to cross at a single point (our desired behavior). See below:

img

If we don't properly constrain our rank to equal 2, then the epipolar lines fail to meet at a single point.

2.1. Rejecting outliers with RANSAC

We don’t consider the process by which point correspondences between two images are found. We assume that they’re provided to us from some upstream algorithm, such as SIFT, that is imperfect and therefore includes a few bad results— outliers. We need to remove these outliers before we can properly reconstruct our scene in 3D.

Enter our hero, RANSAC. We first estimate the matrix FF using SVD against all correspondences as outlined above. Given a set of many correspondences, RANSAC then randomly selects 8 points. It compares the “fit” of each correspondence against FF. RANSAC repeats this process many times, returning the set of 8 correspondences with the greatest average fit.

3. The essential matrix

We’ve estimated the fundamental matrix FF, which relates each point xx in one image with its corresponding point xx' in a second image. But FF does not provide any “metric” context. In other words, it does not leverage the camera’s intrinsic parameters to get a true, physical measurement of the scene or the relative pose between the two images.

That’s the job of the essential matrix, which combines FF with our camera’s intrinsic matrix KK:

E=KTFKE=K^TFK

As is the case with estimating FF, our initial EE estimate may include some noise that cause our singular values to be unequal to (1,1,0)(1,1,0). We can correct this by applying:

E=U[100010000]VTE=U \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}V^T \\

(see Szeliski 7.2).

4. Estimating camera pose

When estimating the relative pose between two cameras, it turns out that there are actually four possible camera pose configurations, not two: (Cn,Rn),n(1,4)(C_n,R_n) , n\in (1,4), where CC is the camera center and RSO(3)R\in SO(3) is the rotation matrix.

Let our essential matrix found earlier also be: E=UDVTE=UDV^T, and let W=Rz,π/2W = R_{z,\pi/2} or:

W=[010100001]W=\begin{bmatrix} 0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix}

Then the four possible configurations are:

  1. C1=U(:,3),R1=UWVTC_1=U(:,3), R_1=UWV^T
  2. C2=U(:,3),R2=UWVTC_2=-U(:,3), R_2=UWV^T
  3. C3=U(:,3),R3=UWTVTC_3=U(:,3), R_3=UW^TV^T
  4. C4=U(:,3),R4=UWTVTC_4=-U(:,3), R_4=UW^TV^T

Note that if det(R)=1\det(R)=-1, then we must correct the pose by setting C=C,R=RC=-C, R=-R. In other words, det(R)\det(R) should always equal 1.

5. Checking for chirality using triangulation

As noted above, the essential matrix yields ambiguous camera poses— four possible poses instead of two. We resolve this ambiguity by checking the chirality condition for each pose.

Given two possible camera poses, triangulate all of the 3D points. Then check the sign of the depth zz for each point. If the depth is negative, that means that the points are being reconstructed behind the cameras, and so the two poses form an invalid combination.

Specifically, a 3D point XX is behind the camera iff r3(XC)<0r_3(X-C) <0, where r3r_3 is the third row of the rotation matrix (the z axis of the camera). Why bother triangulating all points XX for a given pair of camera poses and correspondences, instead of just checking a condition for a single point and assuming it holds for the rest? Because correspondence noise often causes the chirality condition to fail for some points even in good pose pairs. So we say that the best camera configuration (C,R,X)(C,R,X) is the one that has the greatest number of points placed ahead of the camera.

6. Projecting points into 3D

After we’ve triangulated all image points x\vec{x}, we’re nearly there. For each pixel coordinate pair from the two cameras, we convert them into normalized homogenous coordinates (u,v,1)(u, v, 1) and (u,v,1)(u', v', 1). Then, supplied with projection matrices for both cameras, P1,P2P_1, P_2, we form a matrix AA like so:

A=[vp3p2p1up3vp3p2p1up3]A=\begin{bmatrix} v p_3^\intercal - p_2^\intercal \\ p_1^\intercal - up_3^\intercal \\ v'p_3'^\intercal - p_2'^\intercal \\ p_1'^\intercal - u'p_3'^\intercal \end{bmatrix}

To find the corresponding 3D point given the coordinate pairs, we form the equation Ax=0Ax=0 and solve for xx using Singular Value Decomposition. This yields a matrix VhV_h, for which our 3D point lies in the final row.

We repeat this, forming AA and solving for xx using SVD, for every pixel coordinate pair between the two images. Keep in mind that the projection matrices are both slightly noisy, and so the 3D points will have error accordingly.

Recap

Structure from motion is the simultaneous estimation of:

  1. The relative transform of the camera at two (or more) distinct times, and
  2. The 3D structure of a scene captured by two (or more) images corresponding to each time.

Given a set of correspondences between the images, which we calculate beforehand with some other algorithm, we use a special kind of stereo math called epipolar geometry to constrain our search space and reject any noisy correspondences using RANSAC.

RANSAC selects the best four correspondences (the eight points that produce the least squared error). We then use the eight point algorithm to find the fundamental matrix FF, which combines with our camera’s intrinsic parameters to form an essential matrix EE. Finally, we use EE to convert our correspondences into 3D points.

© 2024 Will Heitman