**1 . INTRODUCTION**

A typical GPU rendering pipeline is just the process of creating 2D surface from 3D models and show it on the computer monitor. The perspective projection transformation is actually quite fundamental to that process (3D-->2D). Usually we use a 4x4 matrix to complete the transformation. The mainstream 3D API (OpenGL/D3D) has functions to produce the matrix, strangely enough however, very little information about it can be found in function spec or formal books. Definitely it seems like mysterious for some freshman who trying to step into the real time rendering field if they wondering Why the projection matrix like that? How is the derivation of projection matrix? Now this blog will try to answer these questions and give you a comprehensive guide to perspective projection transformation.

Figure 1: Project M on the image plane (at m).

**2 . Essential Background Knowledge**

Basically, the principle of perspective projection is rather simple, In Fig.1, you can figure out the point

*m*from the property of similar triangle, it’s easy. The story, however, does not stop here.
Our real purpose is to encode this projection processing into a matrix, so that projecting a point onto the image plane can be obtained via basic matrix multiplication. Before investigating the projection transformation, we first review the homogenous briefly.

Figure 2: Point and Vector in coordinate system

In geometric algebra, subtracting two points (P-O)yields a vector as depicted in Figure 2. The vector is called position vector when the point O is origin. Algebraically, we can’t distinguish between a point or a vector if only given the triplet with three components(2 3 5). With the homogeneous coordinates, point and vector could be represented respectively and clearly.

Conversion rules the ordinary coordinate (Ordinary Coordinate) and homogeneous coordinates (Homogeneous Coordinate):

(1). From ordinary coordinates converted into homogeneous coordinates

If (x, y, z) is a point, it becomes (x, y, z, 1), so the point P is (2, 3, 5, 1);

If (x, y, z) is a vector, then becomes (x, y, z, 0), so the vector PO is (2, 3, 5, 0);

If (x, y, z, w), it’s same to the (x/w, y/w, z/w, w/w), where w isn’t 0;

On the other hand, it’s easy to combine the translation and other affine transforms(rotation, reflection, shear, scale,…) into unique matrix multiplication with homogenous coordinates.

“Homogeneous coordinates is an important means of computer graphics, both can be used to clearly distinguish between vectors and points, but also easier for affine (linear) geometric transformation. “- FS Hill, JR.

Definitely, projection transformation is convenient by matrix multiplication with homogeneous coordinates.

[x

^{’}y^{’}z^{’}w^{’}] = [x y z w] * M_{p}where M_{p}is the projection matrix
Second, let’s review the general transformation pipeline of GPU and interpolation of GPU. There are some notations:

OCS: Object Coordinate System

WCS: World Coordinate System

VCS: View Coordinate System

CCS: Clip Coordinate System

NDCS: Normalize Device Coordinate System

CVV: Canonical View Volume

Figure 3: Transformation

Figure 4: Interpolation

In order to eliminate redundant workload GPU have to clip objects against six planes of the view frustum, but clipping against arbitrary 3D planes requires considerable computation. For fast clipping GPU transform the viewing volume into a CVV(Canonical View Volume) against which clipping is easy to employ. The coordinate system of CVV is referred to as the normalized device coordinate system or NDCS. The CVV is a cuboid in D3D where its x- and y-coordinates are within range [-1:1] but z-coordinate is within range [0:1]. Points whose projected coordinates are not contained in this range are invisible and thereby are not drawn. Actually the projection transformation of GPU is composite of two steps:

(1). Transform the vertices into CVV from view frustum

(2). Perspective divide

*Note:**If you have a vertex shader, you can carry out the transformation by what you like, but remember that you must not implement the perspective divide(usually divided by w), the perspective divide is carried out by GPU eventually. Consequently the vertex(x,y,z,w) that produced from vertex shader is in the CCS(Clip Coordinate System) and when its x- and y-coordinates are within range [-w:w] but z-coordinate is within range [0:w].*

Figure 5: CVV

Another operation of GPU to mention is interpolation, as we know that pixel shader is invoked on pixel coupled with the attributes output from vertex shader, and GPU interpolate attributes of each pixel based vertex’s attributes. Generally it’s referred to rasterizer. rasterizer perform the linear interpolation of attributes for pixels which contribute the final image. As depicted in Figure 4, the attributes include position, texture coordinate, color, tangent and so on. All of these attributes would be consumed by pixel shader to do data processing such as arithmetic calculation, texture sampling. Rasterizer is employed by GPU which succeeds geometric transformations, consequently, you could imagine that it performs interpolation on the image plane, as depicted in Fig.6, after projection transformation of the spatial line AB, we will get the line AsBs on the image plane, which is the red line in the picture. We have already known the screen coordinates of As, Bs, and all attributes of them. What we want is to get the attributes of any position(pixel) on the line according to the screen line equation and vertices attributes. Obviously, interpolation should be done on the spatial line AB, but not on the screen line AsBs, because of projection transformation is not linear. This is the essential reason that GPU use perspective divide and perspective correction. From the triangle similar in the Fig.6, we can get that X/Z = Xs/d, where d is constant. That is to say, Xs is linear to X/Z, in other words, (Xs * Z) is linear to X. X is linear to attributes, so X

_{s}is linear to attributes/Z. Consequently, if vertices attributes are divided by the real depth value(view space Z), we can interpolate new attributes value with screen coordinates X_{s}, Then we use them to interpolate for attributes of any position(pixel), finally to multiply real depth value(view space Z) which called perspective correction, correct result can be obtained.Figure 6: Projection Interpolation

Let’s conclude the derivation:

Condition (1): X/Z = Xs/d, d is constant àXs is linear to X/Z

Condition (2): Attributes are linear to X (basic rule)

Condition (1) + (2) à Attributes/Z is linear to Xs.

**3. Perspective Projection Matrix Derivation**

Assume we put the image plane on the near plane, and N denote the distance of near plane, F denote the far plane, we have the point P

_{s}(NX/Z, NY/Z, N) after projection transformation. The z-coordinate N is useless after projection transformation, we can restore some useful value in the projection z-coordinate. Remember the property of homogeneous coordinates, we could substitute 1/Z for z-coordinate N, the position P_{s}will like that(NX/Z, NY/Z, 1/Z), As you know, it shouldn’t be so simple. Actually, the attributes linear interpolation alongside 1/Z, we set the z-coordinate as (aZ + b)/Z. The reason to choose this expression is demonstrated as follow:
(1). Generally, GPU employ z buffer to compare the relative location of objects, the (aZ + b)/Z can be restored in the z buffer to implement the comparison. For z-buffer purpose, the final projective correction need not even be computed, because all we need is to reverse the comparison operation during z-value comparison.

(2). Easy to represent by homogeneous coordinates and matrix multiplication.

(3). The CVV’s z-coordinate is within rang [0:1], so we could find some adaptive values(a, b) when

(3). The CVV’s z-coordinate is within rang [0:1], so we could find some adaptive values(a, b) when

Now our projection point P

_{s}is (NX/Z, NY/Z, (aZ + b)/Z), and correspond homogeneous coordinate is (NX/Z, NY/Z, (aZ + b)/Z, 1.0), its simplified to (NX, NY, aZ + b, Z). The perspective divide is just to translate the (NX, NY, aZ + b, Z) to (NX/Z, NY/Z, (aZ + b)/Z, 1.0) with dividing by Z.

*Note:**(1)*.

*Actually, GPU interpolate the*

*(*

*aZ + b)/Z*

*directly with screen coordinate X*

_{s}, and restore it into z buffer.*(2)*.

*Perspective divide and perspective correction is different in this blog, first represents the divided by W*

_{c}(view space Z), second represents multiplied by 1/W_{S}, where W_{s}= 1/W_{c}
Therefore

Putting it all together, the first perspective projection transformation lists below:

Similar to z-coordinate derivation, we have the x- and y-coordinates calculated by linearly interpolation:

Which could be rearranged into

Figure 7: CVV transformation

Two conditions are to take into account:

(1). The center of image plane is just the center of x-y plane

(2). Offset of center

For condition (1), we have

So the perspective transformation matrix could be derived as follow:

Where

Moreover, according to the Pythagoras theorem, these parameters (top, bottom, left, right) could be replaced with FOV (Field of View), the relation depicted in Fig.8.

Figure 8: FOV

For condition (2), we have

Consequently, the perspective transformation matrix is

Where

The D3D API generates the matrices by some functions(D3DXMatrixPerspectiveLH, D3DXMatrixPerspectiveRH, D3DXMatrixPerspectiveFovLH, D3DXMatrixPerspectiveFovRH, D3DXMatrixPerspectiveOffCenterLH, D3DXMatrixPerspectiveOffCenterRH).

**Appendix**

Some differences for OpenGL:

(1). The z-coordinate is within range [-1:1]

(2). The handedness is left

So the perspective projection transformation is

Where

**REFERENCES**

[1] D3D SDK Doc

[2] Some pictures come from internet

[3] Peter Shirley, “Fundamentals of Computer Graphics”

[4] Steve Baker “Learning to Love your Z-buffer”,

http://www.sjbaker.org/steve/omniv/love_your_z_buffer.html

[5] OpenGPU Forum

[5] OpenGPU Forum