Jekyll2019-03-27T07:42:12+00:00https://marckhoury.github.io/feed.xmlLost in SpacetimePhD student in Computer Science at UC Berkeley. I like algorithms and geometry.Marc Khourykhoury@eecs.berkeley.eduNumerical Algorithms for Computing Eigenvectors2019-02-17T00:00:00+00:002019-02-17T00:00:00+00:00https://marckhoury.github.io/numerical-algorithms-for-computing-eigenvectors<script type="text/javascript" async="" src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>The eigenvalues and eigenvectors of a matrix are essential in many applications across the sciences. Despite their utility, students often leave their linear algebra courses with very little intuition for eigenvectors. In this post we describe several surprisingly simple algorithms for computing the eigenvalues and eigenvectors of a matrix, while attempting to convey as much geometric intuition as possible.</p>
<p>Let <script type="math/tex">A</script> be a symmetric positive definite matrix. Since <script type="math/tex">A</script> is symmetric all of the eigenvalues of <script type="math/tex">A</script> are real and <script type="math/tex">A</script> has a full set of orthogonal eigenvectors. Let <script type="math/tex">\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n > 0</script> denote the eigenvalues of <script type="math/tex">A</script> and let <script type="math/tex">u_{1}, \ldots u_n</script> denote their corresponding eigenvectors. The fact that <script type="math/tex">A</script> is positive definite means that <script type="math/tex">\lambda_i > 0</script> for all <script type="math/tex">i</script>. This condition isn’t strictly necessary for the algorithms described below; I’m assuming it so that I can refer to the largest eigenvalue as opposed to the largest in magnitude eigenvalue.</p>
<p>All of my intuition for positive definite matrices comes from the geometry of the quadratic form <script type="math/tex">x^{\top}Ax</script>. Figure 1 plots <script type="math/tex">x^{\top}Ax</script> in <script type="math/tex">\mathbb{R}^3</script> for several <script type="math/tex">2 \times 2</script> matrices. When <script type="math/tex">A</script> is positive definite, the quadratic form <script type="math/tex">x^{\top}Ax</script> is shaped like a bowl. More rigorously it has positive curvature in every direction and the curvature at the origin in the direction of each eigenvector is proportional to the eigenvalue of that eigenvector. In <script type="math/tex">\mathbb{R}^3</script>, the two eigenvectors give the directions of the maximum and minimum curvature at the origin. These are also known as principal directions in differential geometry, and the curvatures in these directions are known as principal curvatures. I often shorten this intuition by simply stating that positive definite matrices <em>are</em> bowls, because this is always the picture I have in my head when discussing them.</p>
<figure align="middle">
<img src="/assets/images/post3/Figure1v2.png" width="400" />
<figcaption>
<b>Figure 1:</b> The geometry of the quadratic form \(x^{\top}Ax\) for, from left to right, a positive definite matrix, a positive semi-definite matrix, an indefinite matrix, and a negative definite matrix. When \(A\) is positive definite it has positive curvature in every direction and is shaped like a bowl. The curvature at the origin in the direction of an eigenvector is proportional to the eigenvalue. A positive semi-definite matrix may have one or more eigenvalues equal to 0. This creates a flat (zero curvature) subspace of dimension equal to the number of eigenvalues with value equal to 0. An indefinite matrix has both positive and negative eigenvalues, and so has some directions with positive curvature and some with negative curvature, creating a saddle. A negative definite matrix has all negative eigenvalues and so the curvature in every direction is negative at every point.
</figcaption>
</figure>
<p>Now suppose we wanted to compute a single eigenvector of <script type="math/tex">A</script>. This problem comes up more often than you’d think and it’s a crime that undergraduate linear algebra courses don’t often make this clear. The first algorithm that one generally learns, and the only algorithm in this post that I knew as an undergraduate, is an incredibly simple algorithm called Power Iteration. Starting from a random unit vector <script type="math/tex">v</script> we simply compute <script type="math/tex">A^{t}v</script> iteratively. For sufficiently large <script type="math/tex">t</script>, <script type="math/tex">A^{t}v</script> converges to the eigenvector corresponding to the largest eigenvalue of <script type="math/tex">A</script>, hereafter referred to as the “top eigenvector”.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">PowerIteration</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">max_iter</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="c">#generate a uniformly random unit vector</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iter</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span> <span class="c">#compute Av</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="k">return</span> <span class="n">v</span>
</code></pre></div></div>
<p>To see why Power Iteration converges to the top eigenvector of <script type="math/tex">A</script> it helps to write <script type="math/tex">v</script> in the eigenbasis of <script type="math/tex">A</script> as <script type="math/tex">v = \sum_{i=1}^n\beta_{i}u_{i}</script> for some coefficients <script type="math/tex">\beta_i</script>. Then we have that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A^{t}v &= A^{t}(\sum_{i= 1}^{n}\beta_{i}u_{i})\\
&= \sum_{i=1}^{n}\beta_{i}A^{t}u_{i}\\
&= \sum_{i=1}^{n}\beta_{i}\lambda_{i}^{t}u_{i}\\
&= \lambda_{1}^t \sum_{i=1}^{n}\beta_{i}\left(\frac{\lambda_{i}}{\lambda_{1}}\right)^t u_{i}\\
&= \lambda_{1}^{t} \left( \beta_1 u_1 + \sum_{i=2}^{n}\beta_{i}\left(\frac{\lambda_{i}}{\lambda_{1}}\right)^t u_{i} \right).
\end{align*} %]]></script>
<p>Since <script type="math/tex">\lambda_1</script> is the largest eigenvalue, the fractions <script type="math/tex">\left(\frac{\lambda_i}{\lambda_1}\right)^t</script> go to 0 as <script type="math/tex">t \rightarrow \infty</script>, for all <script type="math/tex">i \neq 1</script>. Thus the only component of <script type="math/tex">A^{t}v</script> that has any weight is that of <script type="math/tex">u_1</script>. How quickly each of those terms goes to 0 depends on the ratio <script type="math/tex">\frac{\lambda_{2}}{\lambda_{1}}</script>. If this term is close to 1 then it may take many iterations to disambiguate between the top two (or more) eigenvectors. We say that the Power Iteration algorithm converges at a rate of <script type="math/tex">O\left(\left(\frac{\lambda_{2}}{\lambda_{1}}\right)^t\right)</script>, which for some unfortunate historical reason is referred to as “linear convergence”.</p>
<figure align="middle">
<img src="/assets/images/post3/Figure2.gif" width="400" />
<figcaption><b>Figure 2:</b> An illustration of the Power Iteration algorithm. The \(i\)th bar represents the component of the current iterate on the \(i\)th eigenvector, in order of decreasing eigenvalue. Notice that the components corresponding to the smallest eigenvalues decrease most rapidly, whereas the components on the largest eigenvalues take longer to converge. This animation represents 50 iterations of Power Iteration.</figcaption>
</figure>
<p>Power Iteration will give us an estimate of the top eigenvector <script type="math/tex">u_1</script>, but what about the other extreme? What if instead we wanted to compute <script type="math/tex">u_n</script>, the eigenvector corresponding to the smallest eigenvalue? It turns out there is a simple modification to the standard Power Iteration algorithm that computes <script type="math/tex">u_n</script>. Instead of multiplying by <script type="math/tex">A</script> at each iteration, multiply by <script type="math/tex">A^{-1}</script>. This works because the eigenvalues of <script type="math/tex">A^{-1}</script> are <script type="math/tex">\frac{1}{\lambda_i}</script>, and thus the smallest eigenvalue of <script type="math/tex">A</script>, <script type="math/tex">\lambda_n</script>, corresponds to the largest eigenvalue of <script type="math/tex">A^{-1}</script>, <script type="math/tex">\frac{1}{\lambda_{n}}</script>. Furthermore the eigenvectors of <script type="math/tex">A^{-1}</script> are unchanged. This slight modification is called Inverse Iteration, and it exhibits the same convergence as Power Iteration, by the same analysis.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">InverseIteration</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">max_iter</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="c">#generate a uniformly random unit vector</span>
<span class="n">lu</span><span class="p">,</span> <span class="n">piv</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">lu_factor</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="c"># compute LU factorization of A</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iter</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">lu_solve</span><span class="p">((</span><span class="n">lu</span><span class="p">,</span> <span class="n">piv</span><span class="p">),</span> <span class="n">v</span><span class="p">)</span> <span class="c">#compute A^(-1)v</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="k">return</span> <span class="n">v</span>
</code></pre></div></div>
<p>Note that we don’t actually compute <script type="math/tex">A^{-1}</script> explicitly. Instead we compute an LU factorization of <script type="math/tex">A</script> and solve the system <script type="math/tex">LUv_{t+1} = v_{t}</script>. The matrix that we’re multiplying by does not change at each iteration, so we can compute the LU factorization once and quickly solve a linear system to compute <script type="math/tex">A^{-1}v</script> at each iteration.</p>
<figure align="middle">
<img src="/assets/images/post3/Figure3.gif" width="400" />
<figcaption><b>Figure 3:</b> The Inverse Iteration algorithm. Notice that in this case the algorithm converges to the eigenvector corresponding to the smallest eigenvalue. </figcaption>
</figure>
<p>Power Iteration and Inverse Iteration find the eigenvectors at the extremes of the spectrum of <script type="math/tex">A</script>, but sometimes we may want to compute a specific eigenvector corresponding to a specific eigenvalue. Suppose that we have an estimate <script type="math/tex">\mu</script> of an eigenvalue. We can find the eigenvector corresponding to the eigenvalue of <script type="math/tex">A</script> closest to <script type="math/tex">\mu</script> by a simple modification to Inverse Iteration. Instead of multiplying by <script type="math/tex">A^{-1}</script> at each iteration, multiply by <script type="math/tex">(\mu I_{n} - A)^{-1}</script> where <script type="math/tex">I_{n}</script> is the identity matrix. The eigenvalues of <script type="math/tex">(\mu I_{n} - A)^{-1}</script> are <script type="math/tex">\frac{1}{\mu - \lambda_{i}}</script>. Thus the largest eigenvalue of <script type="math/tex">(\mu I_{n} - A)^{-1}</script> corresponds to the eigenvalue of <script type="math/tex">A</script> whose value is closest to <script type="math/tex">\mu</script>. By the same analysis as Power Iteration, Shifted Inverse Iteration also exhibits linear convergence. However the better the estimate <script type="math/tex">\mu</script> the larger <script type="math/tex">\frac{1}{\mu - \lambda_{i}}</script> and, consequently, the faster the convergence.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ShiftedInverseIteration</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">max_iter</span><span class="p">):</span>
<span class="n">I</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">identity</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="c">#generate a uniformly random unit vector</span>
<span class="n">lu</span><span class="p">,</span> <span class="n">piv</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">lu_factor</span><span class="p">(</span><span class="n">mu</span><span class="o">*</span><span class="n">I</span> <span class="o">-</span> <span class="n">A</span><span class="p">)</span> <span class="c"># compute LU factorization of (mu*I - A)</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iter</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">lu_solve</span><span class="p">((</span><span class="n">lu</span><span class="p">,</span> <span class="n">piv</span><span class="p">),</span> <span class="n">v</span><span class="p">)</span> <span class="c">#compute (mu*I - A)^(-1)v</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="k">return</span> <span class="n">v</span>
</code></pre></div></div>
<figure align="middle">
<img src="/assets/images/post3/Figure4.gif" width="400" />
<figcaption><b>Figure 4:</b> The Shifted Inverse Iteration algorithm. In this case we converge to the eigenvector corresponding to the eigenvalue nearest \(\mu\). </figcaption>
</figure>
<p>Shifted Inverse Iteration converges quickly if a good estimate of the target eigenvalue is available. However if <script type="math/tex">\mu</script> is a poor approximation of the desired eigenvalue, Shifted Inverse Iteration may take a long time to converge. In fact all of the algorithms we’ve presented so far have exactly the same convergence rate; they all converge linearly. If instead we could improve on the eigenvalue estimate at each iteration we could potentially develop an algorithm with a faster convergence rate. This is the main idea behind Rayleigh Quotient Iteration.</p>
<p>The Rayleigh quotient is defined as
<script type="math/tex">\begin{equation*}
\lambda_{R}(v) = \frac{v^{\top}Av}{v^{\top}v}
\end{equation*}</script>
for any vector <script type="math/tex">v</script>. There are many different ways in which we can understand the Rayleigh quotient. Some intuition that is often given is that the Rayleigh quotient is the scalar value that behaves most like an “eigenvalue” for <script type="math/tex">v</script>, even though <script type="math/tex">v</script> may not be an eigenvector. What is meant is that the Rayleigh quotient is the minimum to the optimization problem
<script type="math/tex">\begin{equation*}
\min_{\lambda \in \mathbb{R}} ||Av - \lambda v||^2
\end{equation*}</script>.
This intuition is hardly satisfying.</p>
<p>Let’s return to the geometry of the quadratic forms <script type="math/tex">x^{\top}Ax</script> and <script type="math/tex">x^{\top}x</script> which comprise the Rayleigh quotient, drawn in orange and blue respectively in Figure 5. Without loss of generality we can assume that <script type="math/tex">A</script> is a diagonal matrix. (This is without loss of generality because we’re merely rotating the surface so that the eigenvectors align with the <script type="math/tex">x</script> and <script type="math/tex">y</script> axes, which does not affect the geometry of the surface. This is a common trick in the numerical algorithms literature.) In this coordinate system, the quadratic form <script type="math/tex">x^{\top}Ax = \lambda_1x_1^2 + \lambda_2 x_2^2</script>, where <script type="math/tex">\lambda_1</script> and <script type="math/tex">\lambda_2</script> are the diagonal entries, and thus the eigenvalues, of <script type="math/tex">A</script>.</p>
<p>Consider any vector <script type="math/tex">v</script> and let <script type="math/tex">h = \operatorname{span}\{v, (0,0,1)\}</script> be the plane spanned by <script type="math/tex">v</script> and the vector <script type="math/tex">(0,0,1)</script>. The intersection of <script type="math/tex">h</script> with the quadratic forms <script type="math/tex">x^{\top}Ax</script> and <script type="math/tex">x^{\top}x</script> is comprised of two parabolas, also shown in Figure 5. (This is a common trick in the geometric algorithms literature.) If <script type="math/tex">v</script> is aligned with the <script type="math/tex">x</script>-axis, then, within the coordinate system defined by <script type="math/tex">h</script>, <script type="math/tex">x^{\top}Ax</script> can be parameterized by <script type="math/tex">y = \lambda_1 x^2</script> and <script type="math/tex">x^{\top}x</script> can be parameterized by <script type="math/tex">y = x^2</script>. (Note that here <script type="math/tex">y</script> and <script type="math/tex">x</script> refer to local coordinates within <script type="math/tex">h</script> and are distinct from the vector <script type="math/tex">x</script> used in <script type="math/tex">x^{\top}Ax</script>.) Similarly if <script type="math/tex">v</script> is aligned with the <script type="math/tex">y</script>-axis, then <script type="math/tex">x^{\top}Ax</script> can be parameterized by <script type="math/tex">y = \lambda_2 x^2</script>. (If <script type="math/tex">v</script> is any other vector then <script type="math/tex">x^{\top}Ax</script> can be parameterized by <script type="math/tex">y = \kappa x^2</script> for some <script type="math/tex">\kappa</script> dependent upon <script type="math/tex">v</script>.) The Rayleigh quotient at <script type="math/tex">v</script> is <script type="math/tex">\lambda_{R}(v) = \frac{\lambda_1 x^2}{x^2} = \lambda_1</script>. The curvature of the parabola <script type="math/tex">y = \lambda_1 x^2</script> at the origin is <script type="math/tex">2\lambda_1</script>. Thus the Rayleigh quotient is proportional to the the curvature of <script type="math/tex">x^{\top}Ax</script> in the direction <script type="math/tex">v</script>!</p>
<figure align="middle">
<img src="/assets/images/post3/Figure5.png" width="400" />
<figcaption><b>Figure 5:</b> The quadratic form \(x^{\top}Ax\) is shown in orange and \(x^{\top} x\) is shown in blue. Intersecting both surfaces with a plane \(h\) gives two parabola. Within the plane \(h\) we can define a local coordinate system and parameterize both parabola as \(\kappa x^2\) and \(x^2\). The Rayleigh quotient is equal to the ratio of the heights of the parabolas at any point, which is always equal to \(\kappa\). </figcaption>
</figure>
<p>From this intuition it is clear that the value of the Rayleigh quotient is identical along any ray starting at, but not including, the origin. The length of <script type="math/tex">v</script> corresponds to the value of <script type="math/tex">x</script> in the coordinate system defined by <script type="math/tex">h</script>, which does not affect the Rayleigh quotient. We can also see this algebraically, by choosing a unit vector <script type="math/tex">v</script> and parameterizing a ray in the direction <script type="math/tex">v</script> as <script type="math/tex">\alpha v</script> for <script type="math/tex">\alpha \in \mathbb{R}</script> and <script type="math/tex">\alpha > 0</script>. Then we have that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\lambda_{R}(\alpha v) &= \frac{(\alpha v^{\top})A(\alpha v)} {\alpha^2 v^{\top}v}\\
&= \frac{v^{\top}Av} {v^{\top}v}\\
&= v^{\top}Av.
\end{align*} %]]></script>
<p>Thus it is sufficient to consider the values of the Rayleigh quotient on the unit sphere.</p>
<p>For a unit vector <script type="math/tex">v</script> the value of the Rayleigh quotient can be written in the eigenbasis as
<script type="math/tex">\begin{align*}
v^{\top}Av = \sum_{i=1}^{n} \lambda_{i}\langle v, u_i\rangle^2
\end{align*}</script>
where <script type="math/tex">\sum_{i=1}^{n} \langle v, u_i\rangle^2 = 1</script>. Thus the Rayleigh quotient is a convex combination of the eigenvalues of <script type="math/tex">A</script> and so its value is bounded by the minimum and maximum eigenvalues <script type="math/tex">\lambda_{n} \leq \lambda_{R}(v) \leq \lambda_{1}</script> for all <script type="math/tex">v</script>. This fact is also easily seen from the geometric picture above, as the curvature at the origin is bounded by twice the minimum and maximum eigenvalues. It can be readily seen by either direct calculation or by the coefficients of the convex combination, that if <script type="math/tex">v</script> is an eigenvector, then <script type="math/tex">\lambda_{R}(v)</script> is the corresponding eigenvalue of <script type="math/tex">v</script>.</p>
<p>Recall that a critical point of a function is a point where the derivative is equal to 0. It should come as no surprise that the eigenvalues are the critical values of the Rayleigh quotient and the eigenvectors are the critical points. What is less obvious is the special geometric structure of the critical points.</p>
<p>The gradient of the Rayleigh quotient is <script type="math/tex">\frac{2}{v^{\top}v}(Av - \lambda_{R}(v)v)</script>, from which it is easy to see that every eigenvector is a critical point of <script type="math/tex">\lambda_{R}</script>. The type of critical point is determined by the Hessian matrix, which at the critical point <script type="math/tex">u_i</script> is <script type="math/tex">2(A - \lambda_{i}I)</script>. The eigenvalues of the Hessian are <script type="math/tex">\lambda_j - \lambda_i</script> for <script type="math/tex">j \in [1,n]</script>. Assuming for a moment that the eigenvalues are all distinct, the matrix <script type="math/tex">2(A - \lambda_{i}I)</script> has <script type="math/tex">i-1</script> eigenvectors that are positive, one eigenvalue that is 0, and <script type="math/tex">n - i</script> eigenvalues that are negative. The 0 eigenvalue represents the fact that the value of the Rayleigh quotient is unchanged along the ray <script type="math/tex">\alpha u_i</script>. The other eigenvalues represent the fact that at <script type="math/tex">u_i</script>, along the unit sphere, there are <script type="math/tex">i-1</script> directions in which we can walk to increase the value of the Rayleigh quotient, and <script type="math/tex">n-i</script> directions that decrease the Rayleigh quotient. Thus each eigenvector gives rise to a different type of saddle, and there are exactly two critical points of each type on the unit sphere.</p>
<figure align="middle">
<img src="/assets/images/post3/Figure6.png" style="width: 400px; margin:auto;" />
<figcaption><b>Figure 6:</b> Contours of the Rayleigh quotient on the unit sphere and the gradient of the Rayleigh quotient at each point. We clearly see one minimum in blue corresponding to the minimum eigenvalue, one saddle point, and one maximum in bright yellow corresponding to the maximum eigenvalue. </figcaption>
</figure>
<p>Finally we come to the crown jewel of the algorithms in this post. The Rayleigh Quotient Iteration algorithm simply updates the estimate <script type="math/tex">\mu</script> at each iteration with the Rayleigh quotient. Other than this slight modification, the algorithm is exactly like Shifted Inverse iteration.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">RayleighQuotientIteration</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">max_iter</span><span class="p">):</span>
<span class="n">I</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">identity</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="c">#generate a uniformly random unit vector</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">v</span><span class="p">))</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iter</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">solve</span><span class="p">(</span><span class="n">mu</span> <span class="o">*</span> <span class="n">I</span> <span class="o">-</span> <span class="n">A</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span> <span class="c">#compute (mu*I - A)^(-1)v</span>
<span class="n">v</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">v</span><span class="p">))</span> <span class="c">#compute Rayleigh quotient</span>
<span class="k">return</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">mu</span><span class="p">)</span>
</code></pre></div></div>
<p>This slight modification drastically improves the convergence rate. Unlike the other algorithms in this post which converge linearly, Rayleigh quotient iteration exhibits local cubic convergence! This means that, assuming <script type="math/tex">\| v_{t} - u_i\| \leq \epsilon</script> for some <script type="math/tex">u_i</script>, on the next iteration we will have that <script type="math/tex">\| v_{t+1} - u_{i} \| \leq \epsilon^3</script>. In practice this means that you should expect triple the number of correct digits at each iteration. It’s hard to understate how crazy fast cubic convergence is, and, to the best of the author’s knowledge, algorithms that exhibit cubic convergence are rare in the numerical algorithms literature.</p>
<figure align="middle">
<img src="/assets/images/post3/Figure7v2.gif" width="400" />
<figcaption><b>Figure 7:</b> The Rayleigh Quotient Iteration algorithm. After only 6 iterations the eigenvalue estimate \(\mu_t\) is so accurate that the resulting matrix \((\mu_t I_{n} - A)\) is singular up-to machine precision and we can no longer solve the system for an inverse. Note that every other figure in this post shows 50 iterations.</figcaption>
</figure>
<p>Intuitively, the reason that Rayleigh Quotient Iteration exhibits cubic convergence is because, while the Shifted Inverse Iteration step converges linearly, the Rayleigh quotient is a quadratically good estimate of an eigenvalue near an eigenvector. To see this consider the Taylor series expansion of <script type="math/tex">\lambda_{R}</script> near an eigenvector <script type="math/tex">u_i</script>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\lambda_{R}(v) &= \lambda_{R}(u_i) + (v - u_{i})^{\top} \nabla \lambda_{R}(u_i) + O(||v - u_i||^2)\\
&= \lambda_{R}(u_i) + O(||v - u_i||^2)\\
\lambda_{R}(v) - \lambda_{R}(u_i) &= O(||v - u_i||^2)
\end{align*} %]]></script>
<p>The second step follows from the fact that <script type="math/tex">u_i</script> is a critical point of <script type="math/tex">\lambda_{R}</script> and so <script type="math/tex">\nabla \lambda_{R}(u_i) = 0</script>.</p>
<p>While Rayleigh Quotient Iteration exhibits very fast convergence, it’s not without its drawbacks. First, notice that the system <script type="math/tex">(\mu_{t}I - A)^{-1}</script> changes at each iteration. Thus we cannot precompute a factorization of this matrix and quickly solve the system using forward and backward substitution at each iteration, like we did in the Shifted Inverse Iteration algorithm. We need to solve a different linear system at each iteration, which is much more expensive. Second, Rayleigh Quotient Iteration gives no control over to which eigenvector it converges. The eigenvector it converges to depends on which basin of attraction the initial random vector <script type="math/tex">v_{0}</script> falls into. Thus cubic convergence comes at a steep cost. This balance between an improved convergence rate and solving a different linear system at each iteration feels like mathematical poetic justice. The price to pay for cubic convergence is steep.</p>Marc Khourykhoury@eecs.berkeley.eduCounterintuitive Properties of High Dimensional Space2018-03-02T00:00:00+00:002018-03-02T00:00:00+00:00https://marckhoury.github.io/counterintuitive-properties-of-high-dimensional-space<script type="text/javascript" async="" src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>Our geometric intuition developed in our three-dimensional world often fails us in higher dimensions. Many properties of even simple objects, such as higher dimensional analogs of cubes and spheres, are very counterintuitive. Below we discuss just a few of these properties in an attempt to convey some of the weirdness of high dimensional space.</p>
<p>You may be used to using the word “circle” in two dimensions and “sphere” in three dimensions. However, in higher dimensions we generally just use the word sphere, or <script type="math/tex">d</script>-sphere when the dimension of the sphere is not clear from context. With this terminology, a circle is also called a 1-sphere, for a 1-dimensional sphere. A standard sphere in three dimensions is called a 2-sphere, and so on. This sometimes causes confusion, because a <script type="math/tex">d</script>-sphere is usually thought of as existing in <script type="math/tex">(d+1)</script>-dimensional space. When we say <script type="math/tex">d</script>-sphere, the value of <script type="math/tex">d</script> refers to the dimension of the sphere locally on the object, not the dimension in which it lives. Similarly we’ll often use the word cube for a square, a standard cube, and its higher dimensional analogues.</p>
<h3 id="escaping-spheres">Escaping Spheres</h3>
<p>Consider a square with side length 1. At each corner of the square place a circle of radius 1/2, so that the circles cover the edges of the square. Then consider the circle centered at the center of the square that is just large enough to touch the circles at the corners of the square. In two dimensions it’s clear that the inner circle is entirely contained in the square.</p>
<div align="middle">
<img src="/assets/images/Figure1.png" width="300" />
<figcaption><b>Figure 1:</b> At each corner of the square we place a circle of radius 1/2. The inner circle is just large enough to touch the circles at the corners.
</figcaption>
</div>
<p>We can do the same thing in three dimensions. At each corner of the unit cube place a sphere of radius 1/2, again covering the edges of the cube. The sphere centered at the center of the cube and tangent to spheres at the corners of the cube is shown in red in Figure 2. Again we see that, in three dimensions, the inner sphere is entirely contained in the cube.</p>
<figure align="middle">
<img src="/assets/images/Figure2.png" />
<figcaption><b>Figure 2:</b> In three dimensions we place a sphere at the each of the eight corners of a cube.</figcaption>
</figure>
<p>To understand what happens in higher dimensions we need to compute the radius of the inner sphere in terms of the dimension. The radius of the inner sphere is equal to the length of the diagonal of the cube minus the radius of the spheres at the corners. See Figure 3. The latter value is always 1/2, regardless of the dimension. We can compute the length of the diagonal as</p>
<div align="middle">
$$
\begin{align*}
d((\frac{1}{2}, \frac{1}{2}, \ldots, \frac{1}{2}), (1,1, \ldots, 1)) &= \sqrt{\sum_{i = 1}^{d} (1 - 1/2)^2}\\
&= \sqrt{d}/2
\end{align*}
$$
</div>
<p>Thus the radius of the inner sphere is <script type="math/tex">\sqrt{d}/2 - 1/2</script>. Notice that the radius of the inner sphere is increasing with the dimension!</p>
<figure align="middle">
<img src="/assets/images/Figure3.png" />
<figcaption><b>Figure 3:</b> The size of the radius of the inner sphere is growing as the dimension increases because the distance to the corner increases while the radius of the corner sphere remains constant. </figcaption>
</figure>
<p>In dimensions two and three, the sphere is strictly inside the cube, as we’ve seen in the figures above. However in four dimensions something very interesting happens. The radius of the inner sphere is exactly 1/2, which is just large enough for the inner sphere to touch the sides of the cube! In five dimensions, the radius of the inner sphere is <script type="math/tex">0.618034</script>, and the sphere starts poking outside of the cube! By ten dimensions, the radius is <script type="math/tex">1.08114</script> and the sphere is poking very far outside of the cube!</p>
<h3 id="volume-in-high-dimensions">Volume in High Dimensions</h3>
<p>The area of a circle <script type="math/tex">A(r) = \pi r^2</script>, where <script type="math/tex">r</script> is the radius. Given the equation for the area of a circle, we can compute the volume of a sphere by considering cross sections of the sphere. That is, we intersect the sphere with a plane at some height <script type="math/tex">h</script> above the center of the sphere.</p>
<div align="middle">
<img src="/assets/images/Figure4.png" width="300" />
<figcaption><b>Figure 4:</b> Intersecting the sphere with a plane gives a circle. </figcaption>
</div>
<p>The intersection between a sphere and a plane is a circle. If we look at the sphere from a side view, as shown in Figure 5, we see that the radius can be computed using the Pythagorean theorem (<script type="math/tex">a^2 + b^2 = c^2</script>). The radius of the circle is <script type="math/tex">\sqrt{r^2 - h^2}</script>.</p>
<div align="middle">
<img src="/assets/images/Figure5.png" width="300" />
<figcaption><b>Figure 5:</b> A side view of Figure 4. The radius of the circle defined by the intersection can be found using the Pythagorean theorem.</figcaption>
</div>
<p>Summing up the area of each cross section from the bottom of the sphere to the top of the sphere gives the volume</p>
<div align="middle">
$$
\begin{align*}
V(r) &= \int_{-r}^{r} A(\sqrt{r^2 - h^2})\; dh\\
&= \int_{-r}^{r} \pi \sqrt{r^2 - h^2}^2 \; dh\\
&= \frac{4}{3}\pi r^3.
\end{align*}
$$
</div>
<p>Now that we know the volume of the <script type="math/tex">2</script>-sphere, we can compute the volume of the <script type="math/tex">3</script>-sphere in a similar way. The only difference is where before we used the equation for the area of a circle, we instead use the equation for the volume of the <script type="math/tex">2</script>-sphere. The general formula for the volume of a <script type="math/tex">d</script>-sphere is approximately</p>
<div align="middle">
$$
\begin{equation*}
\frac{\pi^{d/2}}{(d/2+1)!}r^d.
\end{equation*}
$$
</div>
<p>(Approximately because the denominator should be the <a href="https://en.wikipedia.org/wiki/Gamma_function">Gamma function</a>, but that’s not important for understanding the intuition.)</p>
<p>Set <script type="math/tex">r = 1</script> and consider the volume of the unit <script type="math/tex">d</script>-sphere as <script type="math/tex">d</script> increases. The plot of the volume is shown in Figure 6.</p>
<figure align="middle">
<img src="/assets/images/Figure6.png" />
<figcaption><b>Figure 6:</b> The volume of the unit d-sphere goes to 0 as d increases! </figcaption>
</figure>
<p>The volume of the unit <script type="math/tex">d</script>-sphere goes to 0 as <script type="math/tex">d</script> grows! A high dimensional unit sphere encloses almost no volume! The volume increases from dimensions one to five, but begins decreasing rapidly toward 0 after dimension six.</p>
<h3 id="more-accurate-pictures">More Accurate Pictures</h3>
<p>Given the rather unexpected properties of high dimensional cubes and spheres, I hope that you’ll agree that the following are somewhat more accurate pictorial representations.</p>
<div align="middle">
<img src="/assets/images/Figure7.png" height="200" />
<figcaption><b>Figure 7:</b> More accurate pictorial representations of high dimensional cubes (left) and spheres (right).</figcaption>
</div>
<p>Notice that the corners of the cube are much further away from the center than are the sides. The <script type="math/tex">d</script>-sphere is drawn so that it contains almost no volume but still has radius 1. This image also suggests the next counterintuitive property of high dimensional spheres.</p>
<h3 id="concentration-of-measure">Concentration of Measure</h3>
<p>Suppose that you wanted to place a band around the equator of the unit sphere so that, say, 99% of the surface area of the sphere falls within that band. See Figure 8. How large do you think that band would have to be?</p>
<div align="middle">
<img src="/assets/images/Figure8.png" height="250" />
<figcaption><b>Figure 8:</b> In two dimensions a the width of a band around the equator must be very large to contain 99% of the perimeter. </figcaption>
</div>
<p>In two dimensions the width of the band needs to be pretty large, indeed nearly 2, to capture 99% of the perimeter of the unit circle. However as the dimension increases the width of the band needed to capture 99% of the surface area gets smaller. In very high dimensional space nearly all of the surface area of the sphere lies a very small distance away from the equator!</p>
<figure align="middle">
<img src="/assets/images/Figure9.png" />
<figcaption><b>Figure 9:</b> As the dimension increases the width of the band necessary to capture 99% of the surface area decreases rapidly. Nearly all of the surface area of a high dimensional sphere lies near the equator. </figcaption>
</figure>
<p>To provide some intuition consider the situation in two dimensions, as shown in Figure 10. For a point on the circle to be close to the equator, its <script type="math/tex">y</script>-coordinate must be small.</p>
<figure align="middle">
<img src="/assets/images/Figure10.png" />
<figcaption><b>Figure 10:</b> Points near the equator have small y coordinate.</figcaption>
</figure>
<p>What happens to the values of the coordinates as the dimensions increases? Figure 11 is a plot of 20000 random points sampled uniformly from a <script type="math/tex">d</script>-sphere. As <script type="math/tex">d</script> increases the values become more and more concentrated around 0.</p>
<figure align="middle">
<img src="/assets/images/Figure11.gif" />
<figcaption><b>Figure 11:</b> As the dimension increases the coordinates become increasingly concentrated around 0. </figcaption>
</figure>
<p>Recall that every point on a <script type="math/tex">d</script>-sphere must satisfy the equation <script type="math/tex">x_1^2 + x_2^2 + \ldots x_{d+1}^2 = 1</script>. Intuitively as <script type="math/tex">d</script> increases the number of terms in the sum increases, and each coordinate gets a smaller share of the single unit, on the average.</p>
<p>The really weird thing is that any choice of “equator” works! It must, since the sphere is, well, spherically symmetrical. We could have just as easily have chosen any of the options shown in Figure 12.</p>
<figure align="middle">
<img src="/assets/images/Figure12.png" />
<figcaption><b>Figure 12:</b> Any choice of equator works equally well! </figcaption>
</figure>
<h3 id="kissing-numbers">Kissing Numbers</h3>
<p>Consider a unit circle in the plane, shown in Figure 13 in red. The blue circle is said to <em>kiss</em> the red circle if it just barely touches the red circle. (Leave it to mathematicians to think that barely touching is a desirable property of a kiss.) The <em>kissing number</em> is the maximum number of non-overlapping blue circles that can simultaneously kiss the red circle.</p>
<div align="middle">
<img src="/assets/images/Figure13.png" width="300" />
<figcaption><b>Figure 13:</b> The kissing number is six in two dimensions. </figcaption>
</div>
<p>In two dimensions it’s easy to see that the kissing number is 6. The entire proof is shown in Figure 14. The proof is by contradiction. Assume that more than six non-overlapping blue circles can simultaneously kiss the red circle. We draw the edges from the center of the red circle to the centers of the blue circles, as shown in Figure 14. The angles between these edges must sum to exactly <script type="math/tex">360^{\circ}</script>. Since there are more than six angles, at least one must be less than <script type="math/tex">60^{\circ}</script>. The resulting triangle, shown in Figure 14, is an isosceles triangle. The side opposite the angle that is less than <script type="math/tex">60^{\circ}</script> must be strictly shorter than the other two sides, which are <script type="math/tex">2r</script> in length. Thus the centers of the two circles must be closer than <script type="math/tex">2r</script> and the circles must overlap, which is a contradiction.</p>
<div align="middle">
<img src="/assets/images/Figure14.png" />
<figcaption><b>Figure 14:</b> A proof that the kissing number is six in two dimensions. If more than six blue circles can kiss the red, then one of the angles must be less than 60 degrees. It follows that the two blue circles that form that angle must overlap, which is a contradiction. </figcaption>
</div>
<p>It is more difficult to see that in three dimensions the kissing number is 12. Indeed this was famously disputed between Isaac Newton, who correctly thought the kissing number was 12, and David Gregory, who thought it was 13. (Never bet against Newton.) Looking at the optimal configuration, it’s easy to see why Gregory thought it might be possible to fit a 13th sphere in the space between the other 12. As the dimension increases there is suddenly even more space between neighboring spheres and the problem becomes even more difficult.</p>
<div align="middle">
<img src="/assets/images/Figure15.png" width="300" />
<figcaption><b>Figure 15:</b> The kissing number is 12 in three dimensions. </figcaption>
</div>
<p>In fact, there are very few dimensions where we know the kissing number exactly. In most dimensions we only have an upper and lower bound on the kissing number, and these bounds can vary by as much as several thousand spheres!</p>
<table>
<thead>
<tr>
<th>Dimension</th>
<th style="text-align: center">Lower Bound</th>
<th style="text-align: right">Upper Bound</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>1</strong></td>
<td style="text-align: center"><strong>2</strong></td>
<td style="text-align: right"><strong>2</strong></td>
</tr>
<tr>
<td><strong>2</strong></td>
<td style="text-align: center"><strong>6</strong></td>
<td style="text-align: right"><strong>6</strong></td>
</tr>
<tr>
<td><strong>3</strong></td>
<td style="text-align: center"><strong>12</strong></td>
<td style="text-align: right"><strong>12</strong></td>
</tr>
<tr>
<td><strong>4</strong></td>
<td style="text-align: center"><strong>24</strong></td>
<td style="text-align: right"><strong>24</strong></td>
</tr>
<tr>
<td>5</td>
<td style="text-align: center">40</td>
<td style="text-align: right">44</td>
</tr>
<tr>
<td>6</td>
<td style="text-align: center">72</td>
<td style="text-align: right">78</td>
</tr>
<tr>
<td>7</td>
<td style="text-align: center">126</td>
<td style="text-align: right">134</td>
</tr>
<tr>
<td><strong>8</strong></td>
<td style="text-align: center"><strong>240</strong></td>
<td style="text-align: right"><strong>240</strong></td>
</tr>
<tr>
<td>9</td>
<td style="text-align: center">306</td>
<td style="text-align: right">364</td>
</tr>
<tr>
<td>10</td>
<td style="text-align: center">500</td>
<td style="text-align: right">554</td>
</tr>
<tr>
<td>11</td>
<td style="text-align: center">582</td>
<td style="text-align: right">870</td>
</tr>
<tr>
<td>12</td>
<td style="text-align: center">840</td>
<td style="text-align: right">1357</td>
</tr>
<tr>
<td>13</td>
<td style="text-align: center">1154</td>
<td style="text-align: right">2069</td>
</tr>
<tr>
<td>14</td>
<td style="text-align: center">1606</td>
<td style="text-align: right">3183</td>
</tr>
<tr>
<td>15</td>
<td style="text-align: center">2564</td>
<td style="text-align: right">4866</td>
</tr>
<tr>
<td>16</td>
<td style="text-align: center">4320</td>
<td style="text-align: right">7355</td>
</tr>
<tr>
<td>17</td>
<td style="text-align: center">5346</td>
<td style="text-align: right">11072</td>
</tr>
<tr>
<td>18</td>
<td style="text-align: center">7398</td>
<td style="text-align: right">16572</td>
</tr>
<tr>
<td>19</td>
<td style="text-align: center">10668</td>
<td style="text-align: right">24812</td>
</tr>
<tr>
<td>20</td>
<td style="text-align: center">17400</td>
<td style="text-align: right">36764</td>
</tr>
<tr>
<td>21</td>
<td style="text-align: center">27720</td>
<td style="text-align: right">54584</td>
</tr>
<tr>
<td>22</td>
<td style="text-align: center">49896</td>
<td style="text-align: right">82340</td>
</tr>
<tr>
<td>23</td>
<td style="text-align: center">93150</td>
<td style="text-align: right">124416</td>
</tr>
<tr>
<td><strong>24</strong></td>
<td style="text-align: center"><strong>196560</strong></td>
<td style="text-align: right"><strong>196560</strong></td>
</tr>
</tbody>
</table>
<p>As shown in the table, we only know the kissing number exactly in dimensions one through four, eight, and twenty-four. The eight and twenty-four dimensional cases follow from special lattice structures that are known to give optimal packings. In eight dimensions the kissing number is 240, given by the <a href="https://en.wikipedia.org/wiki/E8_lattice"><script type="math/tex">E_{8}</script> lattice</a>. In twenty-four dimensions the kissing number is 196560, given by the <a href="https://en.wikipedia.org/wiki/Leech_lattice">Leech lattice</a>. And not a single sphere more.</p>
<aside class="notice">
This post accompanies a talk given to high school students through Berkeley Splash. Thus intuition is prioritized over mathematical rigor, language is abused, and details are laborious spelled out. If you're interested in more rigorous treatments of the presented material, please feel free to contact me. Slides from the talk are available <a href="https://people.eecs.berkeley.edu/~khoury/talks/BerkeleySplash.pdf">here</a>.
</aside>Marc Khourykhoury@eecs.berkeley.eduOn Computable Functions2018-02-25T00:00:00+00:002018-02-25T00:00:00+00:00https://marckhoury.github.io/on-computable-functions<script type="text/javascript" async="" src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>What does it mean to be computable? A function is computable if for a given input its output can be calculated by a finite mechanical procedure. But can we pin this idea down with rigorous mathematics?</p>
<p>In 1928, David Hilbert (see [4]) proposed his famous Entscheidungsproblem, which asks if there is a general procedure for showing that a statement is provable from a given set of axioms. To solve this problem mathematicians first needed to define what it meant to be computable. The first attempt was through primitive recursive functions and was a combined effort by many researchers, including Kurt Gödel, Alonzo Church, Stephen Kleene, Wilhelm Ackermann, John Rosser, and Rózsa Péter.</p>
<h3 id="recursive-functions">Recursive Functions</h3>
<p>Primitive recursive functions are defined as a recursive type, starting with a few functions that we assume are computable, called founders, and operators that construct new functions from the founders, called constructors. The founders are the following three functions:</p>
<ul>
<li><strong>The constant zero function</strong>: a function that always returns zero</li>
<li><strong>The successor function</strong>: <script type="math/tex">S(n) = n+1</script></li>
<li><strong>The projection function</strong>: <script type="math/tex">\text{proj}_{n}^m</script> is an <script type="math/tex">m</script>-ary function that returns the <script type="math/tex">n</script>th argument</li>
</ul>
<p>Computability theory wasn’t going to get very far if these functions weren’t computable. Next, we have two operations for constructing new functions from old: composition and primitive recursion.</p>
<ul>
<li><strong>Composition</strong>: Given a primitive recursive <script type="math/tex">m</script>-ary function <script type="math/tex">h</script> and <script type="math/tex">m</script> <script type="math/tex">n</script>-ary functions <script type="math/tex">g_1,\ldots, g_m</script>, the function <script type="math/tex">f(\textbf{x}) = h(g_1(\textbf{x}),\ldots, g_m(\textbf{x}))</script> is primitive recursive.</li>
<li><strong>Primitive Recursion</strong>: Given primitive recursive functions <script type="math/tex">g,h</script> the function <script type="math/tex">% <![CDATA[
\begin{align*}
\nonumber
f(\textbf{x},0) &= g(\textbf{x})\\
f(\textbf{x}, y+1) &= h(\textbf{x},y,f(\textbf{x},y))
\end{align*} %]]></script> is primitive recursive.</li>
</ul>
<p>The set of primitive recursive functions is the set of functions constructed from our three initial functions and closed under composition and primitive recursion. Many familiar functions are primitive recursive: addition, multiplication, exponentiation, primes, max, min, and the logarithm function all fit the bill.</p>
<p>So are we done? Is every computable function also primitive recursive? Sadly, no: the Ackermann function would be proven in 1928 to be a counterexample.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\nonumber
A(m,n) =
\begin{cases}
n+1 & \text{if } m = 0 \\
A(m-1,1) & \text{if } m > 0 \text{ and } n = 0\\
A(m-1,A(m,n-1)) & \text{if } m > 0 \text{ and } n > 0
\end{cases}
\end{equation} %]]></script>
<p>The Ackermann function is a total (defined for all inputs) function that is clearly computable but not primitive recursive. Indeed, in 1928 Ackermann (see [1]) showed that his function bounds every primitive recursive function: it grows too fast to be primitive recursive.</p>
<p>Something was clearly wrong, but early computability theorists didn’t want to abandon primitive recursive functions entirely. What came next was a rather surprising idea at the time: perhaps computable functions need not be total! This was the key that unlocked computability theory: focusing on partial functions, those that may not be defined on all possible inputs.</p>
<p>The reason for focusing on partial functions is to allow an unbounded search operator. That is, we want to be able to search for the least input value that satisfies a condition and simply be undefined if no such input value exists. This operation is captured by Kleene’s <script type="math/tex">\mu</script>-operator.</p>
<ul>
<li><script type="math/tex">\mu</script>-<strong>recursion</strong>: <script type="math/tex">f(x) = (\mu y)(g(x,y) = 0)</script> returns the least <script type="math/tex">y</script> such that <script type="math/tex">g(x,y) = 0</script> and is undefined if no such <script type="math/tex">y</script> exists. The function <script type="math/tex">g(x,y')</script> must be defined for all <script type="math/tex">% <![CDATA[
y' < y %]]></script>.</li>
</ul>
<p>Taking the closure of the <script type="math/tex">\mu</script>-operator with all primitive recursive functions gives a class of <script type="math/tex">\mu</script>-recursive functions. In 1943, Kleene (see [5]) used his <script type="math/tex">\mu</script>-operator to provide an alternative, but equivalent, definition of general recursive functions. The original definition was given by Gödel in 1934 (see [3]), based on an observation by Jacques Herbrand. It would later be shown that <script type="math/tex">\mu</script>-recursive functions are the exact same class of functions defined by two competing approaches (see [6]).</p>
<h3 id="lambda-calculus"><script type="math/tex">\lambda</script>-Calculus</h3>
<p>Simultaneously, from 1931-1934, Church and Kleene were developing <script type="math/tex">\lambda</script>-calculus as an approach to computable functions. The syntax of <script type="math/tex">\lambda</script>-calculus defines certain expressions as valid statements, which are called <script type="math/tex">\lambda</script>-terms. A <script type="math/tex">\lambda</script>-term is built up from a collection of variables and two operators: abstraction and application.</p>
<p>Let’s start with a collection of variables <script type="math/tex">x,y,z,\ldots</script> and suppose <script type="math/tex">M, N</script> are valid <script type="math/tex">\lambda</script>-terms. The abstraction operator creates the term <script type="math/tex">\lambda x. M</script>, which is a function taking an argument <script type="math/tex">x</script> and returning <script type="math/tex">M</script> with each occurrence of <script type="math/tex">x</script> replaced with the argument. The application operator creates the term <script type="math/tex">M N</script>, which represents the application of a function <script type="math/tex">M</script> on input <script type="math/tex">N</script>.</p>
<p>The <script type="math/tex">\lambda</script>-term <script type="math/tex">\lambda x.M</script> represents a function <script type="math/tex">f(x) = M</script> and - like recursive functions - many familiar functions are <script type="math/tex">\lambda</script>-definable. The <script type="math/tex">\alpha</script>-conversion and <script type="math/tex">\beta</script>-reduction are classic examples of \emph{reductions}, which describe how <script type="math/tex">\lambda</script>-terms are evaluated. An <script type="math/tex">\alpha</script>-conversion captures the notion that the name of an argument is usually immaterial. For instance <script type="math/tex">\lambda x.x</script> and <script type="math/tex">\lambda y.y</script> both represent the identity function and are <script type="math/tex">\alpha</script>-equivalent. A <script type="math/tex">\beta</script>-reduction applies a function to its arguments. Take, as an example, the <script type="math/tex">\lambda</script>-term <script type="math/tex">(\lambda x.x)y</script>, which represents the identity function <script type="math/tex">(\lambda x.x)</script> applied to the input <script type="math/tex">y</script>. Substituting the argument <script type="math/tex">y</script> for the parameter <script type="math/tex">x</script>, the result of the function is <script type="math/tex">y</script>. So we say <script type="math/tex">(\lambda x.x)y</script> <script type="math/tex">\beta</script>-reduces to <script type="math/tex">y</script>.</p>
<p>In 1934 Church proposed that the term “effectively calculable” be identified with <script type="math/tex">\lambda</script>-definable. While Church’s formalization of computability would later be shown to be equivalent to Turing’s, Gödel was dissatisfied with Church’s work. In fairness, Gödel also was dissatisfied with his own work! Church would go on to advocate that “effectively calculable” should be identified with general recursive functions (which Gödel still rejected). In 1936 Church (see [2]) published his work proving that that the Entscheidungsproblem was undecidable: there is no general procedure for determining if a statement is provable from a given set of axioms.</p>
<h3 id="turing-machines">Turing Machines</h3>
<p>Meanwhile, after hearing about Hilbert’s Entscheidungsproblem, a 22 year old Cambridge student named Alan Turing began working on his own solution to the problem. Turing was unaware of Church’s work at the time, so his approach wasn’t influenced by <script type="math/tex">\lambda</script>-expressions (this wasn’t the first time Turing failed to perform a literature review). Instead, he envisioned an idealized human agent performing a computation, which he called a “computer”. To avoid confusion with the modern definition of computer, we’ll adopt the terminology of Robin Gandy and Wilfried Sieg and use the term “computor” to refer to an idealized human agent. The computor had infinite available memory called a tape, essentially an infinite strip of paper, that was divided into squares. The computor could read and write to a square, as well as move from one square to another.</p>
<p>Turing put several conditions on the computation that the computor could perform. The computor could only have finitely many states (of mind) and the tape could only hold symbols from a finite alphabet. Only a finite number of squares could be observed at a time and the computor could only move to a new square that was at most some finite distance away from an observed square. He also required that any operation must depend only on the current state and the observed symbols, and that there was at most one operation that could be performed per action (his machines were deterministic).</p>
<p>From this, Turing would go on to define his automatic machines - which would later come to be known as Turing machines - and show the equivalence of the two formalizations. He’d then show that “effectively calculable” implied computable by his idealized human agent, which in turn implied computable by such a machine. Turing then went on to show that the Entscheidungsproblem was undecidable. Shortly before publishing his work, he learned that Church had already shown that the Entscheidungsproblem was undecidable using <script type="math/tex">\lambda</script>-calculus. Turing quickly submitted his work in 1936 (see [7]) - six months after Church - along with a proof demonstrating the equivalence between his machines and <script type="math/tex">\lambda</script>-calculus.</p>
<p>After reading Turing’s seminal paper, Gödel was finally convinced that the correct notion of computability had been determined. It would later be shown that all three formalizations - Turing machines, <script type="math/tex">\mu</script>-recursion, and <script type="math/tex">\lambda</script>-calculus - actually define the same class of functions. That these three approaches all yielded the same class of functions suggested that mathematicians had captured the correct notion of computation, and supported what would come to be known as the Church-Turing Thesis.</p>
<p>Three years later, in 1939, Turing completed his Ph.D. at Princeton under the supervision of Church. In his thesis he’d state the following (see [8]): “We shall use the expression ‘computable function’ to mean a function calculable by a machine, and let ‘effectively calculable’ refer to the intuitive idea without particular identification with any one of these definitions.”</p>
<blockquote>
<p><strong>Church-Turing Thesis</strong>: Every effectively calculable function is a computable function.</p>
</blockquote>
<p>Church intended for his original thesis to be taken as a definition of what is computable. Likewise, even though he never stated it, Turing had the same intention. In fact, the term “Church’s Thesis” was coined by Kleene many years after Church had published his work. These days, many people take the Church-Turing Thesis as a definition of what is computable; less formally stating that a function is computable if and only if it can be computed by a Turing machine.</p>
<p>It’s important to stress that the Church-Turing Thesis is not a definition as many believe. It does not refer to any particular formalization that we’ve discussed and is not a statement that can be formally proven. It is a statement about the nature of computation. Everything that is “effectively calculable”, in the vague and intuitive sense, is a computable function.</p>
<h3 id="references">References</h3>
<ol>
<li>Wilhelm Ackermann; 1928; Zum hilbertschen aufbau der reellen zahlen; Mathematische Annalen, 99(1): 118–133.</li>
<li>Alonzo Church; 1936; An unsolvable problem of elementary number theory; American Journal of Mathematics; 58(2): 345–363.</li>
<li>Kurt Gödel; 1934; On Undecidable Propositions of Formal Mathematics Systems; Institute for Advanced Study.</li>
<li>David Hilbert; 1900; Mathematical problems; International Congress of Mathematicians.</li>
<li>Stephen C. Kleene; 1943; Recursive predicates and quantifiers; AMS; 53(1): 41-73; http://www.jstor.org/stable/1990131.</li>
<li>Stephen C. Kleene; 1952; Introduction to metamathematics; North-Holland Publishing Company.</li>
<li>Alan M. Turing; On computable numbers, with an application to the entscheidungsproblem; Proceedings of the London Mathematical Society; 2(42), 1936.</li>
<li>Alan M. Turing; Systems of logic based on ordinals; Proceedings of the London Mathematical Society; 2(1):161–228, 1939.</li>
</ol>
<aside class="notice">
This post was published in Eureka, a journal published annually by The Archimedeans, the Mathematical Society of Cambridge University. The published version can be found <a href="http://www.cs.berkeley.edu/~khoury/computable.pdf">here</a>.
</aside>Marc Khourykhoury@eecs.berkeley.edu