The eigenvalues and eigenvectors of a matrix are essential in many applications across the sciences. Despite their utility, students often leave their linear algebra courses with very little intuition for eigenvectors. In this post we describe several surprisingly simple algorithms for computing the eigenvalues and eigenvectors of a matrix, while attempting to convey as much geometric intuition as possible.
Let be a symmetric positive definite matrix. Since is symmetric all of the eigenvalues of are real and has a full set of orthogonal eigenvectors. Let denote the eigenvalues of and let denote their corresponding eigenvectors. The fact that is positive definite means that for all . This condition isn’t strictly necessary for the algorithms described below; I’m assuming it so that I can refer to the largest eigenvalue as opposed to the largest in magnitude eigenvalue.
All of my intuition for positive definite matrices comes from the geometry of the quadratic form . Figure 1 plots in for several matrices. When is positive definite, the quadratic form is shaped like a bowl. More rigorously it has positive curvature in every direction and the curvature at the origin in the direction of each eigenvector is proportional to the eigenvalue of that eigenvector. In , the two eigenvectors give the directions of the maximum and minimum curvature at the origin. These are also known as principal directions in differential geometry, and the curvatures in these directions are known as principal curvatures. I often shorten this intuition by simply stating that positive definite matrices are bowls, because this is always the picture I have in my head when discussing them.
Now suppose we wanted to compute a single eigenvector of . This problem comes up more often than you’d think and it’s a crime that undergraduate linear algebra courses don’t often make this clear. The first algorithm that one generally learns, and the only algorithm in this post that I knew as an undergraduate, is an incredibly simple algorithm called Power Iteration. Starting from a random unit vector we simply compute iteratively. For sufficiently large , converges to the eigenvector corresponding to the largest eigenvalue of , hereafter referred to as the “top eigenvector”.
def PowerIteration(A, max_iter): v = np.random.randn(A.shape) v /= np.linalg.norm(v) #generate a uniformly random unit vector for t in range(max_iter): v = np.dot(A, v) #compute Av v /= np.linalg.norm(v) return v
To see why Power Iteration converges to the top eigenvector of it helps to write in the eigenbasis of as for some coefficients . Then we have that
Since is the largest eigenvalue, the fractions go to 0 as , for all . Thus the only component of that has any weight is that of . How quickly each of those terms goes to 0 depends on the ratio . If this term is close to 1 then it may take many iterations to disambiguate between the top two (or more) eigenvectors. We say that the Power Iteration algorithm converges at a rate of , which for some unfortunate historical reason is referred to as “linear convergence”.
Power Iteration will give us an estimate of the top eigenvector , but what about the other extreme? What if instead we wanted to compute , the eigenvector corresponding to the smallest eigenvalue? It turns out there is a simple modification to the standard Power Iteration algorithm that computes . Instead of multiplying by at each iteration, multiply by . This works because the eigenvalues of are , and thus the smallest eigenvalue of , , corresponds to the largest eigenvalue of , . Furthermore the eigenvectors of are unchanged. This slight modification is called Inverse Iteration, and it exhibits the same convergence as Power Iteration, by the same analysis.
def InverseIteration(A, max_iter): v = np.random.randn(A.shape) v /= np.linalg.norm(v) #generate a uniformly random unit vector lu, piv = scipy.linalg.lu_factor(A) # compute LU factorization of A for t in range(max_iter): v = scipy.linalg.lu_solve((lu, piv), v) #compute A^(-1)v v /= np.linalg.norm(v) return v
Note that we don’t actually compute explicitly. Instead we compute an LU factorization of and solve the system . The matrix that we’re multiplying by does not change at each iteration, so we can compute the LU factorization once and quickly solve a linear system to compute at each iteration.
Power Iteration and Inverse Iteration find the eigenvectors at the extremes of the spectrum of , but sometimes we may want to compute a specific eigenvector corresponding to a specific eigenvalue. Suppose that we have an estimate of an eigenvalue. We can find the eigenvector corresponding to the eigenvalue of closest to by a simple modification to Inverse Iteration. Instead of multiplying by at each iteration, multiply by where is the identity matrix. The eigenvalues of are . Thus the largest eigenvalue of corresponds to the eigenvalue of whose value is closest to . By the same analysis as Power Iteration, Shifted Inverse Iteration also exhibits linear convergence. However the better the estimate the larger and, consequently, the faster the convergence.
def ShiftedInverseIteration(A, mu, max_iter): I = np.identity(A.shape) v = np.random.randn(A.shape) v /= np.linalg.norm(v) #generate a uniformly random unit vector lu, piv = scipy.linalg.lu_factor(mu*I - A) # compute LU factorization of (mu*I - A) for t in range(max_iter): v = scipy.linalg.lu_solve((lu, piv), v) #compute (mu*I - A)^(-1)v v /= np.linalg.norm(v) return v
Shifted Inverse Iteration converges quickly if a good estimate of the target eigenvalue is available. However if is a poor approximation of the desired eigenvalue, Shifted Inverse Iteration may take a long time to converge. In fact all of the algorithms we’ve presented so far have exactly the same convergence rate; they all converge linearly. If instead we could improve on the eigenvalue estimate at each iteration we could potentially develop an algorithm with a faster convergence rate. This is the main idea behind Rayleigh Quotient Iteration.
The Rayleigh quotient is defined as for any vector . There are many different ways in which we can understand the Rayleigh quotient. Some intuition that is often given is that the Rayleigh quotient is the scalar value that behaves most like an “eigenvalue” for , even though may not be an eigenvector. What is meant is that the Rayleigh quotient is the minimum to the optimization problem . This intuition is hardly satisfying.
Let’s return to the geometry of the quadratic forms and which comprise the Rayleigh quotient, drawn in orange and blue respectively in Figure 5. Without loss of generality we can assume that is a diagonal matrix. (This is without loss of generality because we’re merely rotating the surface so that the eigenvectors align with the and axes, which does not affect the geometry of the surface. This is a common trick in the numerical algorithms literature.) In this coordinate system, the quadratic form , where and are the diagonal entries, and thus the eigenvalues, of .
Consider any vector and let be the plane spanned by and the vector . The intersection of with the quadratic forms and is comprised of two parabolas, also shown in Figure 5. (This is a common trick in the geometric algorithms literature.) If is aligned with the -axis, then, within the coordinate system defined by , can be parameterized by and can be parameterized by . (Note that here and refer to local coordinates within and are distinct from the vector used in .) Similarly if is aligned with the -axis, then can be parameterized by . (If is any other vector then can be parameterized by for some dependent upon .) The Rayleigh quotient at is . The curvature of the parabola at the origin is . Thus the Rayleigh quotient is proportional to the the curvature of in the direction !
From this intuition it is clear that the value of the Rayleigh quotient is identical along any ray starting at, but not including, the origin. The length of corresponds to the value of in the coordinate system defined by , which does not affect the Rayleigh quotient. We can also see this algebraically, by choosing a unit vector and parameterizing a ray in the direction as for and . Then we have that
Thus it is sufficient to consider the values of the Rayleigh quotient on the unit sphere.
For a unit vector the value of the Rayleigh quotient can be written in the eigenbasis as where . Thus the Rayleigh quotient is a convex combination of the eigenvalues of and so its value is bounded by the minimum and maximum eigenvalues for all . This fact is also easily seen from the geometric picture above, as the curvature at the origin is bounded by twice the minimum and maximum eigenvalues. It can be readily seen by either direct calculation or by the coefficients of the convex combination, that if is an eigenvector, then is the corresponding eigenvalue of .
Recall that a critical point of a function is a point where the derivative is equal to 0. It should come as no surprise that the eigenvalues are the critical values of the Rayleigh quotient and the eigenvectors are the critical points. What is less obvious is the special geometric structure of the critical points.
The gradient of the Rayleigh quotient is , from which it is easy to see that every eigenvector is a critical point of . The type of critical point is determined by the Hessian matrix, which at the critical point is . The eigenvalues of the Hessian are for . Assuming for a moment that the eigenvalues are all distinct, the matrix has eigenvectors that are positive, one eigenvalue that is 0, and eigenvalues that are negative. The 0 eigenvalue represents the fact that the value of the Rayleigh quotient is unchanged along the ray . The other eigenvalues represent the fact that at , along the unit sphere, there are directions in which we can walk to increase the value of the Rayleigh quotient, and directions that decrease the Rayleigh quotient. Thus each eigenvector gives rise to a different type of saddle, and there are exactly two critical points of each type on the unit sphere.
Finally we come to the crown jewel of the algorithms in this post. The Rayleigh Quotient Iteration algorithm simply updates the estimate at each iteration with the Rayleigh quotient. Other than this slight modification, the algorithm is exactly like Shifted Inverse iteration.
def RayleighQuotientIteration(A, max_iter): I = np.identity(A.shape) v = np.random.randn(A.shape) v /= np.linalg.norm(v) #generate a uniformly random unit vector mu = np.dot(v, np.dot(A, v)) for t in range(max_iter): v = np.linalg.solve(mu * I - A, v) #compute (mu*I - A)^(-1)v v /= np.linalg.norm(v) mu = np.dot(v, np.dot(A, v)) #compute Rayleigh quotient return (v, mu)
This slight modification drastically improves the convergence rate. Unlike the other algorithms in this post which converge linearly, Rayleigh quotient iteration exhibits local cubic convergence! This means that, assuming for some , on the next iteration we will have that . In practice this means that you should expect triple the number of correct digits at each iteration. It’s hard to understate how crazy fast cubic convergence is, and, to the best of the author’s knowledge, algorithms that exhibit cubic convergence are rare in the numerical algorithms literature.
Intuitively, the reason that Rayleigh Quotient Iteration exhibits cubic convergence is because, while the Shifted Inverse Iteration step converges linearly, the Rayleigh quotient is a quadratically good estimate of an eigenvalue near an eigenvector. To see this consider the Taylor series expansion of near an eigenvector .
The second step follows from the fact that is a critical point of and so .
While Rayleigh Quotient Iteration exhibits very fast convergence, it’s not without its drawbacks. First, notice that the system changes at each iteration. Thus we cannot precompute a factorization of this matrix and quickly solve the system using forward and backward substitution at each iteration, like we did in the Shifted Inverse Iteration algorithm. We need to solve a different linear system at each iteration, which is much more expensive. Second, Rayleigh Quotient Iteration gives no control over to which eigenvector it converges. The eigenvector it converges to depends on which basin of attraction the initial random vector falls into. Thus cubic convergence comes at a steep cost. This balance between an improved convergence rate and solving a different linear system at each iteration feels like mathematical poetic justice. The price to pay for cubic convergence is steep.