Course Note: Tübingen-CVML-Image formation 1

#computer_vision

Image Formation 1: Primitives and Transformations

Credits: Tübingen Machine Learning | Computer Vision - Andreas Geiger
Computer Vision - Lecture 2.1 (Image Formation: Primitives and Transformations) - YouTube

1.1 Primitives and Transformations

Geometric primitives are the basic building blocks used to describe 3D shapes
Introduction of points, lines and planes
Introduction of the most basic transformations

2D points

2D points can be written in inhomogeneous coordinates as

x = (\begin{matrix} x \\ y \end{matrix}) \in R^{2}

or in homogeneous coordinates as

\tilde{x} = (\begin{matrix} \tilde{x} \\ \tilde{y} \\ \tilde{w} \end{matrix}) \in P^{2}

where $p^{2} = R^{3} / (0, 0, 0)$ is called projective space.

An inhomogeneous vector x is converted to a homogeneous vector x as followed:

\tilde{x} = (\begin{matrix} \tilde{x} \\ \tilde{y} \\ \tilde{w} \end{matrix}) = (\begin{matrix} x \\ y \\ 1 \end{matrix}) = (\begin{matrix} x \\ 1 \end{matrix}) = \bar{x}

with augmented vector $\bar{x}$ . To convert in the opposite direction, we divide by $\tilde{w}$ .

\bar{x} = (\begin{matrix} x \\ 1 \end{matrix}) = (\begin{matrix} x \\ y \\ 1 \end{matrix}) = \frac{1}{\tilde{w}} \tilde{x} = \frac{1}{\tilde{w}} (\begin{matrix} \tilde{x} \\ \tilde{y} \\ \tilde{w} \end{matrix}) = (\begin{matrix} \tilde{x} / \tilde{w} \\ \tilde{y} / \tilde{w} \\ 1 \end{matrix})

Homogeneous points whose last element is $\tilde{w} = 0$ are called ideal points or points at infinity. These points can't be represented with inhomogeneous coordinates.

2D lines

2D lines can also be expressed using homogeneous coordinates $\tilde{l} = (a, b, c)^{⊤}$ :

{\bar{x} | {\tilde{l}}^{⊤} \bar{x} = 0} \Leftrightarrow {x, y | a x + b y + c = 0}

We can normalize $\tilde{l}$ so that $\tilde{l} = (n_{x}, n_{y}, d)^{⊤} = (n, d)^{⊤}$ with $∥ n ∥_{2} = 1$ . In this case, $n$ is the normal vector perpendicular to the line and $d$ is its distance to the origin.

An exception is the line at infinity ${\tilde{l}}_{\infty} = (0, 0, 1)^{⊤}$ which passes through all ideal points.

Cross Product

Cross product expressed as the product of a skew-symmetric matrix and a vector:

a \times b = [a] \times b = [\begin{matrix} 0 & - a_{3} & a_{2} \\ a_{3} & 0 & - a_{1} \\ - a_{2} & 1 & 0 \end{matrix}] (\begin{matrix} b_{1} \\ b_{2} \\ b_{3} \end{matrix}) = (\begin{matrix} a_{2} b_{3} - a_{3} b_{2} \\ a_{3} b_{1} - a_{1} b_{3} \\ a_{1} b_{2} - a_{2} b_{1} \end{matrix})

Remark: Squared brackets are matrices.

2D Line Arithmetic

In homogeneous coordinates, the intersection of two lines is given by:

\tilde{x} = {\tilde{l}}_{1} \times {\tilde{l}}_{2}

Similarly, the line joining two points can be compactly written as:

\tilde{l} = {\bar{x}}_{1} \times {\bar{x}}_{2}

The symbol $\times$ denotes the cross product.

2D Conics

More complex algebraic objects can be represented using polynomial homogeneous equations. For example, conic sections (arising as the intersection of a plane and a3D cone) can be written using quadric equations:

{\bar{x} | {\bar{x}}^{⊤} Q \bar{x} = 0}

3D Points

3D points can be written in inhomogeneous coordinates as

x = (\begin{matrix} x \\ y \\ z \end{matrix}) \in R^{3}

or in homogeneous coordinates as

\tilde{x} = (\begin{matrix} \tilde{x} \\ \tilde{y} \\ \tilde{z} \\ \tilde{w} \end{matrix}) \in P^{3}

with Projective space $P^{3} = R^{4} / (0, 0, 0, 0)$ .

3D Planes

3D planes can also be represented as homogeneous coordinates $\tilde{m} = (a, b, c, d)^{⊤}$ :

{\bar{x} | {\tilde{m}}^{⊤} \bar{x} = 0} \Leftrightarrow {x, y, z | a x + b y + c z + d = 0}

Again, we can normalize $\tilde{m}$ so that $\tilde{m} = (n_{x}, n_{y}, n_{z}, d)^{⊤} = (n, d)^{⊤}$ with $∥ n ∥_{2} = 1$ . In this case, $n$ is the normal perpendicular to the plane and $d$ is its distance to the origin.

An exception is the plane at infinity ${\tilde{m}}_{\infty} = (0, 0, 0, 1)^{⊤}$ which passes through all ideal points (=points at infinity) for which $\tilde{w} = 0$ .

3D lines

3D lines are less elegant than either 2D lines or 3D planes. One possible representation is to express points on a line as a linear combination of two points $p$ and $q$ on the line:

{x | x = (1 - λ) p + λ q \land λ \in R}

However, this representation uses 6 parameters for 4 degrees of freedom.
Alternative minimal representations are the two-plane parameterization or Pluecker coordinates. See Szeliski, Chapter 2.1.

3D Quadrics

The 3D analog of 2D conics is a quadric surface:

{\bar{x} | {\bar{x}}^{⊤} Q \bar{x} = 0}

Useful in the study of multi-view geometry. Also serves as useful modeling primitives (spheres, ellipsoids, cylinders).

2D Transformations

Translation: (2D Translation of the Input, 2 DF)

x^{'} = x + t \Leftrightarrow {\bar{x}}^{'} = [\begin{matrix} I & t \\ 0^{⊤} & 1 \end{matrix}] \bar{x}

Using homogeneous representations allows to chain/invert transformations
Augmented vectors $\bar{x}$ can always be replaced by general homogeneous ones $\tilde{x}$

Euclidean: (2D Translation + 2D Rotation, 3 DF)

x^{'} = R x + t \Leftrightarrow {\bar{x}}^{'} = [\begin{matrix} R & t \\ 0^{⊤} & 1 \end{matrix}] \bar{x}

$R \in S O (2)$ is a rotation matrix and $s$ is an arbitrary scale factor
The similarity transform preserves angles between lines

Affine: (2D Linear Transformation, 6 DF)

x^{'} = A x + t \Leftrightarrow {\bar{x}}^{'} = [\begin{matrix} A & t \\ 0^{⊤} & 1 \end{matrix}] \bar{x}

$A \in R^{2 \times 2}$ is an arbitrary $2 \times 2$ matrix
Parallel lines remain parallel under affine transformations

Perspective: (Homography, 8DF)

{\tilde{x}}^{'} = \tilde{H} \tilde{x} (\begin{matrix} \bar{x} = \frac{1}{\tilde{w}} \tilde{x} \end{matrix})

$\tilde{H} \in R^{3 \times 3}$ is an arbitrary homogeneous $3 \times 3$ matrix (specified up to scale)
Perspective transformations preserve straight lines

2D Transformations on Co-Vectors

Considering any perspective 2D transformation

{\tilde{x}}^{'} = \tilde{H} \tilde{x}

the transformed 2D line equation is given by:

{\tilde{l}}^{' ⊤} {\tilde{x}}^{'} = {\tilde{l}}^{' ⊤} \tilde{H} {\tilde{x}}^{'} = ({\tilde{H}}^{⊤} {\tilde{l}}^{'})^{⊤} {\tilde{x}}^{'} = {\tilde{l}}^{⊤} \tilde{x} = 0

Therefore, we have:

{\tilde{l}}^{'} = {\tilde{H}}^{- ⊤} \tilde{l}

Thus, the action of a projective transformation on a co-vector such as a 2D line or 3D normal can be represented by the transposed inverse of the matrix.

Overview of 2D Transformation

Transformation	Matrix	DF	Preserves
translation	${[\begin{matrix} I & t \end{matrix}]}_{2 \times 3}$	2	orientation
rigid	${[\begin{matrix} R & t \end{matrix}]}_{2 \times 3}$	3	lengths
similarity	${[\begin{matrix} s R & t \end{matrix}]}_{2 \times 3}$	4	angles
affine	${[\begin{matrix} A \end{matrix}]}_{2 \times 3}$	6	parallelism
projective	${[\begin{matrix} \tilde{H} \end{matrix}]}_{3 \times 3}$	8	straight lines

Transformations form nested set of groups
Interpret as restricted $3 \times 3$ matrices operating on 2D homogeneous coordinates
Transformations preserve properties below

Overview of 3D Transformation

Transformation	Matrix	DF	Preserves
translation	${[\begin{matrix} I & t \end{matrix}]}_{3 \times 4}$	3	orientation
rigid	${[\begin{matrix} R & t \end{matrix}]}_{3 \times 4}$	6	lengths
similarity	${[\begin{matrix} s R & t \end{matrix}]}_{3 \times 4}$	7	angles
affine	${[\begin{matrix} A \end{matrix}]}_{3 \times 4}$	12	parallelism
projective	${[\begin{matrix} \tilde{H} \end{matrix}]}_{4 \times 4}$	15	straight lines

3D transformations are defined analogously to 2D transformations
$3 \times 4$ matrices are extended with a fourth $[\begin{matrix} 0^{⊤} 1 \end{matrix}]$ row for homogeneous transforms
Transformations preserve properties below

Direct Linear Transform for Homography Estimation

Q: How can we estimate a homography from a set of 2D correspondences?
Let $X = {\tilde{x_{i}}, \tilde{x_{i}^{'}}}_{i = 1}^{N}$ denote a set of $N$ 2D-to-2D correspondences related by $\tilde{x_{i}^{'}} = \tilde{H} \tilde{x_{i}}$ . As the correspondence vectors are homogeneous, they have the same direction but differ in magnitude. Thus, the equation above can be expressed as $\tilde{{x_{i}}^{'}} \times \tilde{H} \tilde{x_{i}} = 0$ .
Using $\tilde{h_{k}^{⊤}}$ to denote the k'th row of $\tilde{H}$ , this can be rewritten as a linear equation in $\tilde{h}$ :

\underset{A_{i}}{\underset{⏟}{[\begin{matrix} 0^{⊤} & - \tilde{w_{i}^{'}} \tilde{{x_{i}}^{⊤}} & \tilde{y^{'}} {\tilde{x_{i}}}^{⊤} \\ \tilde{w^{'}} {\tilde{x_{i}}}^{⊤} & 0^{⊤} & - \tilde{x^{'}} {\tilde{x_{i}}}^{⊤} \\ - \tilde{y^{'}} {\tilde{x_{i}}}^{⊤} & \tilde{x^{'}} {\tilde{x_{i}}}^{⊤} & 0^{⊤} \end{matrix}]}} \underset{\tilde{h}}{\underset{⏟}{[\begin{matrix} \tilde{h_{1}} \\ \tilde{h_{2}} \\ \tilde{h_{3}} \end{matrix}]}} = 0

Each point correspondence yields two equations. Stacking all equations int a $2 N \times 9$ dimensional matrix $A$ leads to the following constrained least squares problem:

{\tilde{h}}^{*} = a r g m i n_{\tilde{h}} ∥ A \tilde{h} ∥_{2}^{2} + λ (∥ \tilde{h} ∥_{2}^{2} - 1) = a r g m i n_{\tilde{h}} {\tilde{h}}^{⊤} A^{⊤} A \tilde{h} + λ ({\tilde{h}}^{⊤} \tilde{h} - 1)

where we have fixed $∥ \tilde{h} ∥_{2}^{2} = 1$ as $\tilde{H}$ is homogeneous (i.e., defined only up to scale) and the trivial solution to $\tilde{h} = 0$ is not of interest. The solution to the above optimization problem is the singular vector corresponding to the smallest singular value of $A$ (i.e., the last column of $V$ when decomposing $A = {UDV}^{⊤}$ , see also Deep Learning lecture 11.2). The resulting algorithm is called Direct Linear Transformation.

1.2 Geometric Image Formation

Origins of the Pinhole Camera

In a physical pinhole camera the image is projected up-side down onto the image plane which is located behind the focal point
When modeling perspective projection, we assume the image plane in front
Both models are equivalent, with appropriate change of image coordinates

Projection Models

Orthographic Projection
Perspective Projection

Orthographic Projection

Orthographic projection of a 3D point $x_{c} \in R^{3}$ to pixel coordinates $x_{s} \in R^{3}$

The x and y axes of the camera and image coordinate systems are shared
Light rays are parallel to the z-coordinate of the camera coordinate system
During projection, the z-coordinate is dropped, x and y remain the same

An orthographic projection simply drops the z component of the 3D point in camera coordinates $x_{c}$ to obtain the corresponding 2D point on the image plane (=screen) $x_{s}$ .

x_{s} = [\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{matrix}] x_{c} \Leftrightarrow \bar{x_{s}} [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}] \bar{x_{c}}

Orthography is exact for telecentric lenses and an approximation for telephoto lenses. After projection, the distance of the 3D point from the image can't be recovered.

We usually scale orthography:

x_{s} = [\begin{matrix} s & 0 & 0 \\ 0 & s & 0 \end{matrix}] x_{c} \Leftrightarrow \bar{x_{s}} [\begin{matrix} s & 0 & 0 & 0 \\ 0 & s & 0 & 0 \\ 0 & 0 & 0 & s \end{matrix}] \bar{x_{c}}

Here, the unit for $s$ is px/m or px/mm to convert metric 3D points into pixels.

Perspective Projection

Perspective projection of a 3D point $x_{c} \in R^{3}$ to pixel coordinates $x_{s} \in R^{3}$

Light rays passes through the camera center, the pixel $x_{s}$ and the point $x_{c}$
Convention: the principal axis (orthogonal to image plane) aligns with the z-axis

In perspective projection, 3D points in camera coordinates are mapped to the image plane by dividing their z component and multiplying with the focal length:

(\begin{matrix} x_{s} \\ y_{s} \end{matrix}) = (\begin{matrix} f x_{c} / z_{c} \\ f y_{c} / z_{c} \end{matrix}) \Leftrightarrow {\tilde{x}}_{s} = [\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] {\bar{x}}_{c}

Note that this projection is linear when using homogeneous coordinates. After the projection, it is not possible to recover the distance of the 3D point from the image.

To ensure positive pixel coordinates, a principal point offset c is usually added
This moves the image coordinate system to the corner of the image plane

The complete perspective projection model is given by:

(\begin{matrix} x_{s} \\ y_{s} \end{matrix}) = (\begin{matrix} f x_{c} / z_{c} + s x_{c} / z_{c} + c_{x} \\ f y_{c} / z_{c} + c_{y} \end{matrix}) \Leftrightarrow {\tilde{x}}_{s} = [\begin{matrix} f_{x} & s & c_{x} & 0 \\ 0 & f_{y} & c_{y} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] {\bar{x}}_{c}

The left $3 \times 3$ submatrix of the projection matrix is called calibration matrix $K$
The parameters of $K$ are called camera intrinsics (as opposed to extrinsic pose)
Here, $f_{x}$ and $f_{y}$ are independent, allowing for different pixel aspect ratios
The skew $s$ arises due to the sensor not mounted perpendicular to the optical axis
In practice, we often set $f_{x} = f_{y}$ and $s = 0$ , but model $c = (c_{x}, c_{y})^{⊤}$

Chaining Transformations

Let $K$ be the calibration matrix (intrinsics) and $[R | t]$ the camera pose (extrinsics).
We chain both transformations to project a point in world coordinates to the image:

{\tilde{x}}_{s} = [\begin{matrix} K & 0 \end{matrix}] {\bar{x}}_{c} = [\begin{matrix} K & 0 \end{matrix}] [\begin{matrix} K & 0 \\ 0^{⊤} & 1 \end{matrix}] {\bar{x}}_{w} = K [\begin{matrix} R & t \end{matrix}] {\bar{x}}_{w} = P {\bar{x}}_{w}

Full Rank Representation

It is sometimes preferable to use a full rank $4 \times 4$ projection matrix:

{\tilde{x}}_{s} = [\begin{matrix} K & 0 \\ 0^{⊤} & 1 \end{matrix}] [\begin{matrix} R & t \\ 0^{⊤} & 1 \end{matrix}] {\bar{x}}_{w} = P {\bar{x}}_{w}

Now, the homogeneous vector ${\tilde{x}}_{s}$ is a 4D vector and must be normalized wrt. Its 3rd entry to obtain inhomogeneous image pixels:

{\bar{x}}_{s} = {\tilde{x}}_{s} / z_{s} = (x_{s} / z_{s}, y_{s} / z_{s}, 1, 1 / z_{s})^{⊤}

Note that the 4th component of the inhomogeneous 4D vector is the inverse depth. If the inverse depth is known, a 3D point can be retrieved from its pixel coordinates via ${\tilde{x}}_{w} = {\tilde{P}}^{- 1} {\bar{x}}_{w}$ and subsequent normalization of ${\tilde{x}}_{w}$ wrt. its 4th entry.

Lens Distortion

The assumption of linear projection is violated in practice due to the properties of the camera lens which introduces distortions. Both radial and tangential distortion effects can be modeled relatively easily: Let $x = x_{c} / z_{c}, y = y_{c} / z_{c}$ and $r_{2} = x^{2} + y^{2}$ . The distorted point is obtained as:

x^{'} = \underset{R a d i a l D i s t o r t i o n}{\underset{⏟}{(1 + k_{1} r^{2} + k_{2} r^{4})}} (\begin{matrix} x \\ y \end{matrix}) + \underset{T a n g e n t i a l D i s t o r t i o n}{\underset{⏟}{(\begin{matrix} 2 k_{3} x y + k_{4} (r^{2} + 2 x^{2}) \\ 2 k_{4} x y + k_{3} (r^{2} + 2 y^{2}) \end{matrix})}}

x_{s} = (\begin{matrix} f_{x} x^{'} + c_{x} \\ f_{y} y^{'} + c_{y} \end{matrix})

Images can be undistorted such that the perspective projection model applies. More complex distortion models must be used for wide-angle lenses (e.g., fisheye).