Iterative Linear Solvers

$ % macros % MathJax \newcommand{\bigtimes}{\mathop{\vcenter{\hbox{$\Huge\times\normalsize$}}}} % prefix with \! and postfix with \!\! % sized grouping symbols \renewcommand {\aa} [1] {\left\langle {#1} \right\rangle} % <> angle brackets \newcommand {\bb} [1] {\left[ {#1} \right]} % [] brackets \newcommand {\cc} [1] {\left\{ {#1} \right\}} % {} curly braces \newcommand {\mm} [1] {\lVert {#1} \rVert} % || || double norm bars \newcommand {\nn} [1] {\lvert {#1} \rvert} % || norm bars \newcommand {\pp} [1] {\left( {#1} \right)} % () parentheses % unit \newcommand {\unit} [1] {\bb{\mathrm{#1}}} % measurement unit % math \newcommand {\fn} [1] {\mathrm{#1}} % function name % sets \newcommand {\setZ} {\mathbb{Z}} \newcommand {\setQ} {\mathbb{Q}} \newcommand {\setR} {\mathbb{R}} \newcommand {\setC} {\mathbb{C}} % arithmetic \newcommand {\q} [2] {\frac{#1}{#2}} % quotient, because fuck \frac % trig \newcommand {\acos} {\mathrm{acos}} % \mathrm{???}^{-1} is misleading \newcommand {\asin} {\mathrm{asin}} \newcommand {\atan} {\mathrm{atan}} \newcommand {\atantwo} {\mathrm{atan2}} % at angle = atan2(y, x) \newcommand {\asec} {\mathrm{asec}} \newcommand {\acsc} {\mathrm{acsc}} \newcommand {\acot} {\mathrm{acot}} % complex numbers \newcommand {\z} [1] {\tilde{#1}} \newcommand {\conj} [1] {{#1}^\ast} \renewcommand {\Re} {\mathfrak{Re}} % real part \renewcommand {\Im} {\mathrm{I}\mathfrak{m}} % imaginary part % quaternions \newcommand {\quat} [1] {\tilde{\mathbf{#1}}} % quaternion symbol \newcommand {\uquat} [1] {\check{\mathbf{#1}}} % versor symbol \newcommand {\gquat} [1] {\tilde{\boldsymbol{#1}}} % greek quaternion symbol \newcommand {\guquat}[1] {\check{\boldsymbol{#1}}} % greek versor symbol % vectors \renewcommand {\vec} [1] {\mathbf{#1}} % vector symbol \newcommand {\uvec} [1] {\hat{\mathbf{#1}}} % unit vector symbol \newcommand {\gvec} [1] {\boldsymbol{#1}} % greek vector symbol \newcommand {\guvec} [1] {\hat{\boldsymbol{#1}}} % greek unit vector symbol % special math vectors \renewcommand {\r} {\vec{r}} % r vector [m] \newcommand {\R} {\vec{R}} % R = r - r' difference vector [m] \newcommand {\ur} {\uvec{r}} % r unit vector [#] \newcommand {\uR} {\uvec{R}} % R unit vector [#] \newcommand {\ux} {\uvec{x}} % x unit vector [#] \newcommand {\uy} {\uvec{y}} % y unit vector [#] \newcommand {\uz} {\uvec{z}} % z unit vector [#] \newcommand {\urho} {\guvec{\rho}} % rho unit vector [#] \newcommand {\utheta} {\guvec{\theta}} % theta unit vector [#] \newcommand {\uphi} {\guvec{\phi}} % phi unit vector [#] \newcommand {\un} {\uvec{n}} % unit normal vector [#] % vector operations \newcommand {\inner} [2] {\left\langle {#1} , {#2} \right\rangle} % <,> \newcommand {\outer} [2] {{#1} \otimes {#2}} \newcommand {\norm} [1] {\mm{#1}} \renewcommand {\dot} {\cdot} % dot product \newcommand {\cross} {\times} % cross product % matrices \newcommand {\mat} [1] {\mathbf{#1}} % matrix symbol \newcommand {\gmat} [1] {\boldsymbol{#1}} % greek matrix symbol % ordinary derivatives \newcommand {\od} [2] {\q{d #1}{d #2}} % ordinary derivative \newcommand {\odn} [3] {\q{d^{#3}{#1}}{d{#2}^{#3}}} % nth order od \newcommand {\odt} [1] {\q{d{#1}}{dt}} % time od % partial derivatives \newcommand {\de} {\partial} % partial symbol \newcommand {\pd} [2] {\q{\de{#1}}{\de{#2}}} % partial derivative \newcommand {\pdn} [3] {\q{\de^{#3}{#1}}{\de{#2}^{#3}}} % nth order pd \newcommand {\pdd} [3] {\q{\de^2{#1}}{\de{#2}\de{#3}}} % 2nd order mixed pd \newcommand {\pdt} [1] {\q{\de{#1}}{\de{t}}} % time pd % vector derivatives \newcommand {\del} {\nabla} % del \newcommand {\grad} {\del} % gradient \renewcommand {\div} {\del\dot} % divergence \newcommand {\curl} {\del\cross} % curl % differential vectors \newcommand {\dL} {d\vec{L}} % differential vector length [m] \newcommand {\dS} {d\vec{S}} % differential vector surface area [m^2] % special functions \newcommand {\Hn} [2] {H^{(#1)}_{#2}} % nth order Hankel function \newcommand {\hn} [2] {H^{(#1)}_{#2}} % nth order spherical Hankel function % transforms \newcommand {\FT} {\mathcal{F}} % fourier transform \newcommand {\IFT} {\FT^{-1}} % inverse fourier transform % signal processing \newcommand {\conv} [2] {{#1}\ast{#2}} % convolution \newcommand {\corr} [2] {{#1}\star{#2}} % correlation % abstract algebra \newcommand {\lie} [1] {\mathfrak{#1}} % lie algebra % other \renewcommand {\d} {\delta} % optimization %\DeclareMathOperator* {\argmin} {arg\,min} %\DeclareMathOperator* {\argmax} {arg\,max} \newcommand {\argmin} {\fn{arg\,min}} \newcommand {\argmax} {\fn{arg\,max}} % waves \renewcommand {\l} {\lambda} % wavelength [m] \renewcommand {\k} {\vec{k}} % wavevector [rad/m] \newcommand {\uk} {\uvec{k}} % unit wavevector [#] \newcommand {\w} {\omega} % angular frequency [rad/s] \renewcommand {\TH} {e^{j \w t}} % engineering time-harmonic function [#] % classical mechanics \newcommand {\F} {\vec{F}} % force [N] \newcommand {\p} {\vec{p}} % momentum [kg m/s] % \r % position [m], aliased \renewcommand {\v} {\vec{v}} % velocity vector [m/s] \renewcommand {\a} {\vec{a}} % acceleration [m/s^2] \newcommand {\vGamma} {\gvec{\Gamma}} % torque [N m] \renewcommand {\L} {\vec{L}} % angular momentum [kg m^2 / s] \newcommand {\mI} {\mat{I}} % moment of inertia tensor [kg m^2/rad] \newcommand {\vw} {\gvec{\omega}} % angular velocity vector [rad/s] \newcommand {\valpha} {\gvec{\alpha}} % angular acceleration vector [rad/s^2] % electromagnetics % fields \newcommand {\E} {\vec{E}} % electric field intensity [V/m] \renewcommand {\H} {\vec{H}} % magnetic field intensity [A/m] \newcommand {\D} {\vec{D}} % electric flux density [C/m^2] \newcommand {\B} {\vec{B}} % magnetic flux density [Wb/m^2] % potentials \newcommand {\A} {\vec{A}} % vector potential [Wb/m], [C/m] % \F % electric vector potential [C/m], aliased % sources \newcommand {\I} {\vec{I}} % line current density [A] , [V] \newcommand {\J} {\vec{J}} % surface current density [A/m] , [V/m] \newcommand {\K} {\vec{K}} % volume current density [A/m^2], [V/m^2] % \M % magnetic current [V/m^2], aliased, obsolete % materials \newcommand {\ep} {\epsilon} % permittivity [F/m] % \mu % permeability [H/m], aliased \renewcommand {\P} {\vec{P}} % polarization [C/m^2], [Wb/m^2] % \p % electric dipole moment [C m], aliased \newcommand {\M} {\vec{M}} % magnetization [A/m], [V/m] \newcommand {\m} {\vec{m}} % magnetic dipole moment [A m^2] % power \renewcommand {\S} {\vec{S}} % poynting vector [W/m^2] \newcommand {\Sa} {\aa{\vec{S}}_t} % time averaged poynting vector [W/m^2] % quantum mechanics \newcommand {\bra} [1] {\left\langle {#1} \right|} % <| \newcommand {\ket} [1] {\left| {#1} \right\rangle} % |> \newcommand {\braket} [2] {\left\langle {#1} \middle| {#2} \right\rangle} $

Table of Contents

Iterative Methods

The previous section demonstrated how to cast an ill-posed inverse problem into a well-posed inverse problem using complex least squares with regularization. However no attempt was made to solve the modified problem. Gauss-Jordon elimination could be used in principle, but has a time complexity of $O_t(N^3)$[], which is computationally prohibitive for very large sensing matrices. Iterative solvers become the only practical way to solve very large linear systems. Three iterative solvers are developed in this section; gradient descent on quadratic forms, conjugate gradient, and GMRES.

A large number of symbols are defined in this section. A summary is found in the following table:

Iterative Linear Solver Symbols
Symbol Name
$f$ cost function
$q$ quadratic form, special case of $f$
$\vec{x}$, $\vec{v}$, $\vec{w}$ arbitrary vectors
$\vec{x}_s$ solution vector
$\vec{x}_i$ iterate vector
$\vec{x}_0$ initial guess vector
$\vec{e}_i$ error vector
$\vec{d}_i$ search direction vector
$\vec{D}_i$ search direction vector augmented matrix
$\alpha_i$ search scale
$\mat{A}$ system matrix
$\vec{b}$ data vector
$\vec{r}_i$ residual vector

Iterative Optimization

Consider the general problem of optimizing a continuous real-valued function of complex variables $f : \setC^n \rightarrow \setR$. A solution vector $\vec{x}_s \in \setC^n$ at an extreme value of $f$ will necessarily be a critical point with zero CR gradient $$ \grad f(\vec{x}_s) = \vec{0} . $$ In general, iterative optimization generates a sequence of approximate solutions called iterate vectors $\vec{x}_i \in \setC^n$, starting with an initial guess vector $\vec{x}_0 \in \setC^n$. Each iteration the iterate vector $\vec{x}_i$ is updated by adding a search direction vector $\vec{d}_i \in \setC^n$ multiplied by a search scale $\alpha_i \in \setC$ $$ \label{eqn:iterate} \boxed{ \vec{x}_{i+1} = \vec{x}_i + \vec{d}_i \alpha_i } . $$ Algorithms have different strategies for choosing $\vec{d}_i$ and $\alpha_i$, which affects convergence. The recurrence relation can be fully expand from the initial guess as the summation $$ \label{eqn:iterate_sum} \vec{x}_i = \vec{x}_0 + \sum_{j=1}^{i-1} \vec{d}_j \alpha_j . $$

A statement can already be made about the nature of iterative optimization without supplying more details. Suppose on the $i$-th iteration a search direction vector $\vec{d}_i$ is chosen by an unspecified algorithm. A general condition between $\vec{d}_i$ and the gradient at the next iterate $\vec{x}_{i+1}$ is found by optimizing $f$ with respect to $\alpha_i$ $$ \label{eqn:optimum_condition} \pd{}{\alpha_i} f(\vec{x}_{i+1}) = \pd{f(\vec{x}_{i+1})}{\vec{x}_{i+1}} \pd{\vec{x}_{i+1}}{\alpha_i} + \pd{f(\vec{x}_{i+1})}{\conj{\vec{x}_{i+1}}} \pd{\conj{\vec{x}_{i+1}}}{\alpha_i} = \boxed{ \nabla f(\vec{x}_{i+1})^H \vec{d}_i = 0 } $$ where the CR chain rule has been applied and the second term eliminated because $\partial / \partial \alpha_i [\conj{\vec{x}}_i + \conj{\vec{d}}_i \conj{\alpha_i}] = \vec{0}$. Geometrically this equation says follow a search direction vector until the gradient is orthogonal when updating an iterate. At such a point $f$ is not changing along the search direction vector, and so a new search direction should be chosen to further optimize $f$.

Iterate vectors are approximate solutions, so an error vector $\vec{e}_i \in \setC^n$ can be defined as $$ \label{eqn:error} \boxed{\vec{e}_i \equiv \vec{x}_i - \vec{x}_s} $$ which is like the residual but for the domain of $f$. Error is typically unknown because it is defined in terms of a solution vector $\vec{x}_s$. Despite this limitation the concept is theoretically useful. The definition for the error vector can be used to rewrite the iterate recurrence relation Eq. $\eqref{eqn:iterate}$ as an error recurrence relation $$ \vec{e}_{i+1} = \vec{e}_i + \vec{d}_i \alpha_i $$ and the iterate summation Eq. $\eqref{eqn:iterate_sum}$ as an error summation $$ \label{eqn:error_sum} \vec{e}_i = \vec{e}_0 + \sum_{j=1}^{i-1} \vec{d}_j \alpha_j . $$

Quadratic Form

Application of complex least squares with $L^2$ regularization results in a linear system with a conjugate-symmetric positive-definite matrix. These properties can be used to design efficient gradient-based iterative minimization algorithms by considering the complex least squares cost function for the modified problem. Therefore consider the linear system $$ \mat{A} \vec{x} = \vec{b} $$ where the matrix $\mat{A} \in \setC^{n \times n}$ is conjugate-symmetric $$ \mat{A} = \mat{A}^H $$ and positive-definite $$ \aa{\vec{x}, \mat{A} \vec{x}} \geq 0 \quad\text{and}\quad \aa{\vec{x}, \mat{A} \vec{x}} = 0 \Leftrightarrow \vec{x} = \vec{0} $$ and thus invertible. The complex least squares cost function for invertible systems is $$ \begin{split} f(\vec{x}) &= \norm{\vec{b} - \mat{A} \vec{x}}^2 \\ &= \aa{\vec{b} - \mat{A} \vec{x}, \vec{b} - \mat{A} \vec{x}} \\ &= \aa{\vec{b}, \vec{b}} - \aa{\vec{b}, \mat{A} \vec{x}} - \aa{\mat{A} \vec{x}, \vec{b}} + \aa{\mat{A} \vec{x}, \mat{A} \vec{x}} \\ &= \norm{\vec{b}}^2 - 2 \Re(\vec{b}^H \mat{A} \vec{x}) + \vec{x}^H \mat{A}^H \mat{A} \vec{x} \\ &= c' - 2 \Re(\vec{b}^{\prime H} \vec{x}) + \vec{x}^H \mat{A}' \vec{x} \end{split} $$ where $$ c' = \norm{\vec{b}}^2 ,\quad \vec{b}' = \mat{A}^H \vec{b} ,\quad \mat{A}' = \mat{A}^H \mat{A} . $$ Note the cost function does not need regularization because matrix $\mat{A}$ is invertible. Matrix $\mat{A}'$ is both conjugate-symmetric $$ \mat{A}^{\prime H} = [\mat{A}^H \mat{A}]^H = \mat{A}^H \mat{A} = \mat{A}' $$ and positive-definite $$ \aa{\vec{x}, \mat{A}' \vec{x}} = \aa{\vec{x}, \mat{A}^H \mat{A} \vec{x}} = \aa{\mat{A} \vec{x}, \mat{A} \vec{x}} = \norm{\mat{A} \vec{x}}^2 \geq 0 . $$ There is an opportunity to be clever when $\mat{A}$ shares these properties. Define the expanded form of cost function $f$ as the quadratic form $q : \setC^n \rightarrow \setR$ $$ q(\vec{x}) \equiv \vec{x}^T \mat{A} \vec{x} - 2 \Re(\vec{b}^H \vec{x}) + c $$ where the primes have been suppressed for reasons that will become apparent in a moment. This is the most general way to write a real-valued quadratic equation of complex variables. Take the CR gradient of $q$ $$ \grad q(\vec{x}) = \grad \bb{\vec{x}^T \mat{A} \vec{x} - 2 \Re(\vec{b}^H \vec{x}) + c} $$ convert to summation notation to evaluate $$ \begin{split} \conj{\bb{\pd{q}{x_k}}} &= \conj{\bb{\pd{}{x_k} \bb{\sum_j \conj{x}_j \sum_i A_{ji} x_i - \sum_j \conj{b}_j x_j - \sum_j \conj{x}_j b_j + c}}} \\ &= \conj{\bb{\sum_j \sum_i A_{ji} \pd{}{x_k} \bb{\conj{x}_j x_i} - \sum_j \conj{b}_j \pd{x_j}{x_k} - \sum_j \pd{\conj{x}_j}{x_k} b_j + \pd{c}{x_k}}} \\ &= \conj{\bb{\sum_j \sum_i A_{ji} \bb{\pd{\conj{x}_j}{x_k} x_i + \conj{x}_j \pd{x_i}{x_k}} - \sum_j \conj{b}_j \delta_{jk} - \sum_j 0 b_j + 0}} \\ &= \conj{\bb{\sum_j \sum_i A_{ji} \bb{0 \,x_i + \conj{x}_j \delta_{ik}} - \conj{b}_k}} \\ &= \conj{\bb{\sum_j \bb{\sum_i A_{ji} \delta_{ik}} \conj{x}_j - \conj{b}_k}} \\ &= \conj{\bb{\sum_j A_{jk} \conj{x}_j - \conj{b}_k}} \\ &= \sum_j \conj{A}_{jk} x_j - b_k \\ &= \sum_j A_{kj} x_j - b_k \quad \text{(conjugate-symmetric)} \end{split} $$ and convert back again to matrix notation to be concise $$ \grad q(\vec{x}) = \mat{A} \vec{x} - \vec{b} . $$ Optimize $q$ by setting the gradient to zero $$ \grad q(\vec{x}_s) = \mat{A} \vec{x}_s - \vec{b} = \vec{0} $$ to derive the original system to be solved $$ \mat{A} \vec{x}_s = \vec{b} . $$ This is a remarkable result. Evidently the gradient of the quadratic form is the (negative) residual of the linear system to be solved, and when $q$ is optimized the system is solved. Since every conjugate-symmetric positive-definite matrix has a unique Cholesky decomposition $\mat{A} = \mat{L} \mat{L}^H$, where $\mat{L} \in \setC^{n \times n}$ is a lower-triangular conjugate-symmetric positive-definite matrix [], the complex least squares problem can be replaced with the equivalent problem of minimizing the quadratic form using matrix $\mat{A}$, data vector $\vec{b}$, and constant $c \in \setR$. It is now shown $q$ has a unique global minimum at $\vec{x}_s$. Consider evaluating $q$ at some solution vector $\vec{x}_s$ offset by non-zero error vector $\vec{e} \in \setC^n - \cc{0}$ $$ \begin{split} &= q(\vec{x}_s + \vec{e}) \\ &= [\vec{x}_s + \vec{e}]^H \mat{A} [\vec{x}_s + \vec{e}] - 2 \Re(\vec{b}^H [\vec{x}_s + \vec{e}]) + c \\ &= \vec{x}_s^H \mat{A} \vec{x}_s + \vec{x}_s^H \mat{A} \vec{e} + \vec{e}^H \mat{A} \vec{x}_s + \vec{e}^H \mat{A} \vec{e} - 2 \Re(\vec{b}^H \vec{x}_s) - 2\Re(\vec{b}^H \vec{e}) + c \\ &= [\vec{x}_s^H \mat{A} \vec{x}_s - 2 \Re(\vec{b}^H \vec{x}_s) + c] + \vec{e}^H \mat{A} \vec{e} + [\vec{x}_s^H \mat{A} \vec{e} + \vec{e}^H \mat{A} \vec{x}_s] - 2\Re(\vec{b}^H \vec{e}) \\ &= q(\vec{x}_s) + \vec{e}^H \mat{A} \vec{e} + 2 \Re(\vec{x}_s^H \mat{A} \vec{e}) - 2 \Re(\vec{b}^H \vec{e}) \\ &= q(\vec{x}_s) + \vec{e}^H \mat{A} \vec{e} \\ &> q(\vec{x}_s) \quad\text{for}\quad \vec{e} \neq \vec{0} \quad\text{(positive-definite)} \end{split} $$ therefore $q(\vec{x}_s)$ is a global minimum.

Due to the close connection between the gradient of a quadratic form and the residual of the associated linear system, it is helpful to use the definition of residual Eq. $\eqref{eqn:residual}$ to rewrite the iterate recurrence relation Eq. $\eqref{eqn:iterate}$ as a residual recurrence relation $$ \label{eqn:residual_recurrence} \boxed{ \vec{r}_{i+1} = \vec{r}_i - \mat{A} \vec{d}_i \alpha_i } $$ and the iterate summation Eq. \eqref{eqn:iterate_sum} as a residual summation $$ \label{eqn:residual_sum} \vec{r}_i = \vec{r}_0 - \sum_{j=1}^{i-1} \mat{A} \vec{d}_j \alpha_j . $$

Summarizing the connections between various quantities, it is found $$ \boxed{ \vec{r}_i = \vec{b} - \mat{A} \vec{x}_i = - \mat{A} \vec{e}_i = - \nabla q(\vec{x}_i) } . $$ With an analytic gradient for $q$, iterative optimization condition Eq. $\eqref{eqn:optimum_condition}$ can be evaluated $$ \pd{}{\alpha_i} q(\vec{x}_{i+1}) = \grad q(\vec{x}_{i+1})^H \vec{d}_i = -\vec{r}_{i+1}^H \vec{d}_i = -[\vec{r}_i - \mat{A} \vec{d}_i \alpha_i]^H \vec{d}_i = 0 $$ and the optimal search scale $\alpha_i$ is found to be $$ \label{eqn:search_scale} \boxed{ \alpha_i = \q{\vec{d}_i^H \vec{r}_i}{\vec{d}_i^H \mat{A} \vec{d}_i} } . $$ The only unaddressed detail for iterative optimization of a quadratic form is the choice of search direction $\vec{d}_i$. Two algorithms will be discussed; gradient descent, and the superior conjugate gradient method.

Gradient Descent

Gradient descent is a general and intuitive iterative minimization algorithm. The gradient of a function points in the direction of greatest increase, so choose as the $i$-th search direction vector $\vec{d}_i$ the negative gradient at the current iterate $\vec{x}_i$ $$ \boxed{ \vec{d}_i = - \nabla f(\vec{x}_i) } . $$ In the case of a quadratic form, the gradient is available as an easy to compute residual. Adding to the list of equivalences, it is found $$ \label{eqn:gd_relations} \vec{d}_i = - \nabla q (\vec{x}_i) = \vec{r}_i = \vec{b} - \mat{A} \vec{x}_i = - \mat{A} \vec{e}_i . $$

Gradient descent on a quadratic form is outlined in Algorithm $\ref{alg:gd}$. Given are a conjugate-symmetric positive-definite matrix $\mat{A} \in \setC^{n \times n}$, data vector $\vec{b} \in \setC^n$, initial guess vector $\vec{x}_0 \in \setC^n$, and tolerance $\epsilon \in \setR^+$. First compute the the initial residual $\vec{r}_0$. Then iterate until a maximum number of iterations $i_{max} \in \setZ^{+}$ is reached, or the residual norm is less than or equal to the tolerance. Each iteration the search scale is computed with Eq. $\eqref{eqn:search_scale}$, then the iterate is updated with Eq. $\eqref{eqn:iterate}$, and then the residual is updated with Eq. $\eqref{eqn:residual_recurrence}$.

// Gradient Descent on Quadratic Form
gd($$mat{A}$, $\vec{x_0}$ = $\vec{0}$) {
	$\vec{x}_1 = \vec{x_0}$
	$\vec{r}_1 = \vec{b} - \mat{A} \vec{x}_1$
	for ($i \in [1, i_{max}]$) {
		// tolerance test
		if ($\mm{\vec{r}_{i}}^2 \leq \epsilon^2$) {
		// compute search direction vector
		$\vec{d}_i = \vec{r}_i$
		// update
		$\alpha_i = \vec{d}_i^H \vec{r}_i / [\vec{d}_i^H \mat{A} \vec{d}_i]$
		$\vec{x}_{i+1} = \vec{x}_i + \vec{d}_i \alpha_i$
		$\vec{r}_{i+1} = \vec{r}_i - \mat{A} \vec{d}_i \alpha_i$
	return $\vec{x}_i$

The computational complexity of this algorithm can be analyzed as a function of vectorspace dimension $N \in \setZ^+$. Assume a single thread of execution. The space complexity is $O(N)$ per run, as only a small fixed number of vectors must be stored (the input is taken for granted). The time complexity can be analyzed per iteration and run. Each iteration is dominated by the matrix-vector multiplication $\mat{A} \vec{d}_i$ and so has time complexity $O(N^2)$. The time complexity per run is harder to assess[].

Gradient descent is easy to understand but should not be used for minimizing a quadratic form. Its main failing is poor convergence. Iterates tend to oscillate between subspaces given enough iterations. More advanced iterative solvers address this issue directly and to great effect by constructing solutions in orthogonal bases.

Conjugate Gradient

Continuing the discussion on the quadratic form $q$, the optimal search scale Eq. $\eqref{eqn:search_scale}$ can be written in terms of error using equation $\eqref{eqn:gd_relations}$ $$ \label{eqn:cg_search_scale} \alpha_i %= \q{\vec{d}_i^H \vec{r}_i}{\vec{d}_i^H \mat{A} \vec{d}_i} = -\q{\vec{d}_i^H \mat{A} \vec{e}_i}{\vec{d}_i^H \mat{A} \vec{d}_i} $$ and inserted into the error recurrence relation Eq. $\eqref{eqn:error_sum}$ $$ \vec{e}_i = \vec{e}_0 - \sum_{j=1}^{i-1} \vec{d}_j \q{\vec{d}_j^H \mat{A} \vec{e}_j}{\vec{d}_j^H \mat{A} \vec{d}_j} . $$ Observe that the summands are the negative vector-valued $\mat{A}$-projection of $\vec{e}_j$ on $\vec{d}_j$.

To elaborate, $\mat{A}$-projection indicates the underlying inner product is the $\mat{A}$ inner product defined to be $$ \aa{\vec{u}, \vec{v}}_{\mat{A}} \equiv \vec{u}^H \mat{A} \vec{v} $$ which is verified to satisfy the inner product axioms; the definition is positive-definite because $\mat{A}$ is positive-definite by assumption, the definition is conjugate-symmetric because $\mat{A}$ is conjugate-symmetric by assumption $$ \aa{\vec{u}, \vec{v}}_{\mat{A}} = \vec{u}^H \mat{A} \vec{v} = [\vec{v}^H \mat{A} \vec{u}]^H = \conj{\aa{\vec{v}, \vec{u}}}_{\mat{A}} , $$ and the definition is linear in one of its arguments $$ \aa{\vec{u}, \alpha \vec{v} + \beta \vec{w}}_{\mat{A}} = \vec{u}^H \mat{A} [\alpha \vec{v} + \beta \vec{w}] = \alpha \vec{u}^H \mat{A} \vec{v} + \beta \vec{u}^H \mat{A} \vec{w} = \alpha \aa{\vec{u}, \vec{v}}_{\mat{A}} + \beta \aa{\vec{u}, \vec{w}}_{\mat{A}} . $$ The vector-valued $\mat{A}$-projection of $\vec{v}$ on $\vec{u}$ is then $$ %\vec{u} \cdot \fn{proj}_{\mat{A}}(\vec{u}, \vec{v}) %\uvec{u} \aa{\uvec{u}, \vec{v}}_{\mat{A}} %= \q{\vec{u}}{\norm{\vec{u}}_{\mat{A}}} \aa{\q{\vec{u}}{\norm{\vec{u}}_{\mat{A}}}, \vec{v}}_{\mat{A}} %= \vec{u} \q{\aa{\vec{u}, \vec{v}}_{\mat{A}}}{\norm{\vec{u}}_{\mat{A}}^2} \vec{u} \q{\aa{\vec{u}, \vec{v}}_{\mat{A}}}{\aa{\vec{u}, \vec{u}}_{\mat{A}}} % = \q{\vec{u}}{\norm{\vec{u}}_{\mat{A}}}\aa{\q{\vec{u}}{\norm{\vec{u}}_{\mat{A}}}, \vec{v}}_{\mat{A}} = \vec{u} \q{\vec{u}^H \mat{A} \vec{v}}{\vec{u}^H \mat{A} \vec{u}} . $$

Equation $\eqref{eqn:cg_search_scale}$ suggests that the search direction vectors $\vec{d}_i$ should be chosen to be $\mat{A}$-orthogonal so each iteration can be interpreted as deconstructing an initial error vector in a basis one component per iteration. This is accomplished by taking the residual $\vec{r}_i$ as a seed search direction vector, since the negative gradient is a good guess otherwise, and making it $\mat{A}$-orthogonal to previous search directions by subtracting off its $\mat{A}$-projections $$ \label{eqn:cg_d} \vec{d}_i = \vec{r}_i - \sum_{j=1}^{i-1} \vec{d}_j \beta_{ji} $$ where the coefficients $\beta_{ij}$ are $$ \beta_{ij} = \frac{\vec{d}_j^H \mat{A} \vec{r}_i}{\vec{d}_j^H \mat{A} \vec{d}_j} $$ leading to the $\mat{A}$-orthogonality relation $$ \boxed{ \vec{d}_i^H \mat{A} \vec{d}_j = 0 \quad\text{when}\quad i \neq j } . $$ Iteration on quadratic forms with this choice of search direction vectors results in the conjugate gradient method. The word "conjugate" in this context refers to $\mat{A}$-orthogonality of the search direction vectors, and not to complex conjugation.

The naive conjugate gradient algorithm is presented as Algorithm $\ref{alg:ncg}$. It is a modified gradient descent. The only difference is that instead of assigning the search direction vector to be the residual, it is assigned to be the part of the residual that is $\mat{A}$-orthogonal to the previous search direction vectors.

$$ \begin{algorithm}[H] \SetAlgoLined \KwResult{$\vec{x}_i$} $\vec{x}_1 = \vec{x}_0$ \\ $\vec{r}_1 = \vec{b} - \mat{A} \vec{x}_1$ \\ \For{$i \in [1, i_{max}]$}{ // tolerance test \\ \If{$\mm{\vec{r}_i}^2 \leq \mathrm{tol}^2$}{ break \\ } // compute search direction vector \\ $\vec{d}_i = \vec{r}_i - \sum_{j=1}^{i-1} \vec{d}_j [\vec{d}_j^H \mat{A} \vec{r}_i] / [\vec{d}_j^H \mat{A} \vec{d}_j]$ \\ // update \\ $\alpha_i = \vec{d}_i^H \vec{r}_i / [\vec{d}_i^H \mat{A} \vec{d}_i]$ \\ $\vec{x}_{i+1} = \vec{x}_i + \alpha_i \vec{d}_i$ \\ $\vec{r}_{i+1} = \vec{r}_i - \alpha_i \mat{A} \vec{d}_i$ \\ } \caption{\label{alg:ncg} Naive Conjugate Gradient} \end{algorithm} $$

Algorithm $\ref{alg:ncg}$ is conceptually complete, but substantial simplifications can be made with further analysis and clever substitutions. The set of search directions $\vec{d}_i$ forms an $\mat{A}$-orthogonal basis. Consider the initial error vector expanded in this basis $$ \label{eqn:cg_e0} \vec{e}_0 = \sum_{j=1}^{n} \vec{d}_j \delta_j $$ with $\mat{A}$-projection expansion coefficients $\delta_j$ $$ \label{eqn:cg_delta} \delta_j = \q{\vec{d}_j^H \mat{A} \vec{e}_0}{\vec{d}_j^H \mat{A} \vec{d}_j} . $$ Insert the initial error expansion Eq. $\eqref{eqn:cg_e0}$ into the error summation Eq. $\eqref{eqn:error_sum}$ $$ \vec{e}_i = \sum_{j=1}^{n} \vec{d}_j \delta_j + \sum_{j=1}^{i-1} \vec{d}_j \alpha_j $$ and form the $\mat{A}$ inner product with direction vector $\vec{d}_i$ $$ \vec{d}_i^H \mat{A} \vec{e}_i = \sum_{j=1}^{n} \vec{d}_i^H \mat{A} \vec{d}_j \delta_j + \sum_{j=1}^{i-1} \vec{d}_i^H \mat{A} \vec{d}_j \alpha_j = \vec{d}_i^H \mat{A} \vec{d}_i \delta_i . $$ Solving for $\delta_i$ and comparing to Eq. $\eqref{eqn:cg_delta}$ and Eq. $\eqref{eqn:search_scale}$ implies the following relation holds for the first $i - 1$ coefficients of the error summation $$ \delta_j = \frac{\vec{d}_j^H \mat{A} \vec{e}_0}{\vec{d}_j^H \mat{A} \vec{d}_j} = \frac{\vec{d}_j^H \mat{A} \vec{e}_j}{\vec{d}_j^H \mat{A} \vec{d}_j} = -\alpha_j $$ and thus the error summation simplifies to $$ \vec{e}_i = \sum_{j=i}^{n} \vec{d}_j \delta_j . $$ This means the error vector converges after $n$ iterations. Furthermore the error vectors are $\mat{A}$-orthogonal to all previous search direction vectors, meaning residuals are orthogonal to all previous search direction vectors $$ \label{eqn:cg_ortho1} -\vec{d}_i^H \mat{A} \vec{e}_j = \vec{d}_i^H \vec{r}_j = 0 \quad\text{for}\quad i < j . $$

Additional orthogonality relationships and useful identities can be derived from Eq. $\eqref{eqn:cg_d}$ by taking an inner product with residual $\vec{r}_k$ $$ \vec{d}_i^H \vec{r}_k = \vec{r}_i^H \vec{r}_k - \sum_{j=1}^{i-1} \vec{d}_j^H \vec{r}_k \conj{\beta}_{ij} $$ and considering the various cases. Using Eq. $\eqref{eqn:cg_ortho1}$ it is seen $$ \vec{r}_i^H \vec{r}_k = 0 \quad\text{for}\quad i < k \quad\Rightarrow\quad \vec{r}_i^H \vec{r}_k = 0 \quad\text{for}\quad i \neq k $$ and $$ \vec{d}_i^H \vec{r}_i = \vec{r}_i^H \vec{r}_i \quad\text{for}\quad i = k . $$ When $k = i - 1$, the previous results further imply $$ \vec{d}_i^H \vec{r}_{i-1} = -\vec{d}_{i-1}^H \vec{r}_{i-1} \conj{\beta}_{i,i-1} . $$ These seemingly random equations are now used to make drastic simplifications.

Consider the inner product between two residuals, and expand one residual in terms of the residual summation Eq. $\eqref{eqn:residual_sum}$ $$ \vec{r}_i^H \vec{r}_j = \bb{\vec{r}_0 - \sum_{k=1}^{i-1} \mat{A} \vec{d}_k \alpha_k }^H \vec{r}_j = \vec{r}_0^H \vec{r}_j - \sum_{k=1}^{i-1} \vec{d}_k^H \mat{A} \vec{r}_j \conj{\alpha}_k . $$ This sum equals $0$ when $i \neq j$ because of the orthogonality relations. Let $j > 3$. Consider $i = 2$ and assume $\alpha_i \neq 0$ (otherwise the iteration has converged), then $\vec{d}_1^H \mat{A} \vec{r}_j = 0$. Next consider $i = 3$ and take the previous result, then $\vec{d}_2^H \mat{A} \vec{r}_j = 0$. This pattern repeats through $i = j - 1$. When $i = j$, the inner product is no longer $0$, so $\vec{d}_{i-1}^H \mat{A} \vec{r}_i \neq 0$. Thus $$ \vec{d}_{j}^H \mat{A} \vec{r}_i = 0 \quad\text{for}\quad j < i - 1 . $$ This is a remarkable result because $\beta_{ij} = 0$ except when $j = i - 1$. In other words, residual vectors are automatically $\mat{A}$-orthogonal to all previous search directions except the previous one. This means only storage for a single search direction vector is ever needed in the algorithm, and making the search directions $\mat{A}$-orthogonal is inexpensive. Thus redefine $\beta_{i-1} \equiv \beta_{i,i-1}$.

Clever substitutions help reduce the computational cost of computing quantities each iteration. The $\alpha_i$ coefficients can be rewritten as $$ \alpha_i = \q{\vec{d}_i^H \vec{r}_i}{\vec{d}_i^H A \vec{d}_i} = \q{\vec{r}_i^H \vec{r}_i}{\vec{d}_i^H A \vec{d}_i} $$ and the $\beta_{i-1}$ is rewritten as $$ \beta_{i-1} = - \q{\vec{d}_i^H \vec{r}_{i-1}}{\vec{d}_{i-1}^H \vec{r}_{i-1}} = - \q{\vec{d}_i^H \bb{\vec{r}_{i-1} - \alpha_{i-1} \mat{A} \vec{d}_{i-1}}}{\vec{d}_{i-1}^H \vec{r}_{i-1}} = - \q{\vec{d}_i^H \vec{r}_i}{\vec{d}_{i-1}^H \vec{r}_{i-1}} = - \q{\vec{r}_i^H \vec{r}_i}{\vec{r}_{i-1}^H \vec{r}_{i-1}} $$ which reuses the residual norm squared computed as a byproduct of the tolerance test.

Putting the above results together, an efficient version of the conjugate gradient algorithm is presented in Algorithm $\ref{alg:cg}$. Comparing to Algorithm $\ref{alg:ncg}$, the simplifications in the computation of the search direction vectors is obvious. Care has been taken use temporary variables wherever results can be reused.

$$ \begin{algorithm}[H] \SetAlgoLined \KwResult{$\vec{x}_i$} $i_{max} = \min(i_{max}, n)$ \\ $\vec{x}_1 = \vec{x}_0$ \\ $\vec{r}_1 = \vec{b} - \mat{A} \vec{x}_1$ \\ \For{$i \in [1, i_{max}]$}{ // tolerance test \\ rr = $\vec{r}_i^H \vec{r}_i$ \\ \If{$\mathrm{rr} \leq \mathrm{tol}^2$}{ break } // compute search direction vector \\ \eIf{$i == 1$}{ $\vec{d} = \vec{r}_1$ \\ }{ $\beta =$ rr / rrold \\ $\vec{d} = \vec{r}_i + \beta \vec{d}$ \\ } // update \\ $\vec{t} = \mat{A} \vec{d}$ \\ $\alpha_i = $ rr / $\vec{d}^H \vec{t}$ \\ $\vec{x}_{i+1} = \vec{x}_i + \alpha_i \vec{d}$ \\ $\vec{r}_{i+1} = \vec{r}_i - \alpha_i \vec{t}$ \\ rrold = rr \\ } \caption{\label{alg:cg} Conjugate Gradient} \end{algorithm} $$

A final piece of sophistication is revisiting the preconditioner. While a method to include left preconditioning was discussed in the complex least squares problem, it is better to include preconditioners directly into the algorithm to take advantage of analytical simplifications. The goal of preconditioning is not to change the solution of the problem, but to make the linear system have a better condition number. This makes the cost function more spherical, which aids in convergence[]. Let $\mat{E} \in \setC^{n \times n}$ be an invertible matrix, and consider the equivalent problem of solving $$ \mat{E}^{-1} \mat{A} \mat{E}^{-H} \vec{x}' = \mat{E}^{-1} \vec{b} $$ where $$ \vec{x}' = \mat{E}^H \vec{x} $$ It is easy to show the matrix $\mat{E}^{-1} \mat{A} \mat{E}^{-H}$ is conjugate-symmetric positive-definite, so the underlying assumptions about the problem have not been altered. Incorporating these changes into Algorithm $\ref{alg:cg}$ results in the following sequence of equations $$ \begin{cases} \vec{r}_1' = \mat{E}^{-1} \vec{b} - \mat{E}^{-1} \mat{A} \mat{E}^{-H} \vec{x}_0' \\ \vec{d}_1' = \vec{r}_1' \\ \beta_{i-1} = - \q{\vec{r}_i^{\prime H} \vec{r}_i'}{\vec{r}_{i-1}^{\prime H} \vec{r}_{i-1}'} \\ \vec{d}_i' = \vec{r}_i' - \beta_{i-1} \vec{d}_{i-1}' \\ \alpha_i = \q{\vec{r}_i^{\prime H} \vec{r}_i'}{\vec{d}_i^{\prime H} \mat{E}^{-1} \mat{A} \mat{E}^{-H} \vec{d}_i'} \\ \vec{x}_{i+1}' = \vec{x}_i' + \alpha_i \vec{d}_i' \\ \vec{r}_{i+1}' = \vec{r}_i' - \alpha_i \mat{E}^{-1} \mat{A} \mat{E}^{-H} \vec{d}_i' \\ \end{cases} $$ which seems to add a great deal of complexity. However the following substitutions $$ \begin{cases} \vec{r}_i' = \mat{E}^{-1} \vec{r}_i \\ \vec{d}_i' = \mat{E}^H \vec{d}_i \\ \vec{x}_i' = \mat{E}^H \vec{x}_i \\ \mat{E}^{-H} \mat{E}^{-1} = \mat{P}^{-1} \\ \end{cases} $$ show only modest changes must be made to the algorithm $$ \begin{cases} \vec{r}_1 = \vec{b} - \mat{A} \vec{x}_0 \\ \vec{d}_1 = \mat{P}^{-1} \vec{r}_1 \\ \beta_{i-1} = - \q{\vec{r}_i^H \mat{P}^{-1} \vec{r}_i}{\vec{r}_{i-1}^H \mat{P}^{-1} \vec{r}_{i-1}} \\ \vec{d}_i = \mat{P}^{-1} \vec{r}_i - \beta_{i-1} \vec{d}_{i-1} \\ \alpha_i = \q{\vec{r}_i^H \mat{P}^{-1} \vec{r}_i}{\vec{d}_i^H \mat{A} \vec{d}_i} \\ \vec{x}_{i+1} = \vec{x}_i + \alpha_i \vec{d}_i \\ \vec{r}_{i+1} = \vec{r}_i - \alpha_i \mat{A} \vec{d}_i \\ \end{cases} $$ The resulting preconditioned conjugate gradient method is presented in Algorithm $\ref{alg:pcg}$. Note, the preconditioner is sometimes specified in its factored form $\mat{P}^{-1} = \mat{E}^{-H} \mat{E}^{-1}$.

$$ \begin{algorithm}[H] \SetAlgoLined \KwResult{$\vec{x}_i$} $i_{max} = \max(i_{min}, n)$ \\ $\vec{x}_1 = \vec{x}_0$ \\ $\vec{r}_1 = \vec{b} - \mat{A} \vec{x}_1$ \\ \For{$i \in [1, i_{max}]$}{ // tolerance test \\ \If{$\norm{\vec{r}_i}^2 \leq \mathrm{tol}^2$}{ break } // compute preconditioned search direction vector \\ $\vec{z}_i = \mat{P}^{-1} \vec{r}_i$ \\ rz = $\vec{r}_i^H \vec{z}_i$ \\ \eIf{$i == 1$}{ $\vec{d} = \vec{z}_1$ \\ }{ $\beta =$ rz / rzold \\ $\vec{d} = \vec{z}_i + \beta \vec{d}$ \\ } // update \\ $\vec{t} = \mat{A} \vec{d}$ \\ $\alpha_i = $ rz / $\vec{d}^H \vec{t}$ \\ $\vec{x}_{i+1} = \vec{x}_i + \alpha_i \vec{d}$ \\ $\vec{r}_{i+1} = \vec{r}_i - \alpha_i \vec{t}$ \\ rzold = rz \\ } \caption{\label{alg:pcg} Preconditioned Conjugate Gradient} \end{algorithm} $$

The space complexity of conjugate gradient is $O(N)$ rather than $O(N^2)$ because only the previous search direction vector is needed when making the new search direction $\mat{A}$-orthogonal to all the previous search directions. The time complexity each iteration is dominated by two matrix-vector products, and so is $O(N^2)$ per iteration. The algorithm is guaranteed to converge after $n$ iterations, so the time complexity per run is $O(N^3)$. This doesn't seem like an improvement over Gauss-Jordan elimination, however conjugate gradient gives a sequence of approximate answers. Convergence is shown to be quite fast, and so conjugate gradient is usually terminated long before $n$ iterations with good results.


An alternative to gradient-based iterative methods is General Minimal Residual (GMRES), which iteratively minimizes the complex least squares cost function directly. GMRES only assumes that matrix $\mat{A}$ is invertible, and is thus a more general solver than the gradient-based methods presented in the previous sections which assumed a conjugate-symmetric positive-definite matrix. However, reformulating an ill-posed problem as a well-posed complex least squares problem with regularization produces a conjugate-symmetric positive-definite matrix, so this isn't much of an advantage for image reconstruction.

Recall the complex least squares cost function is $$ \vec{x}_s = \argmin_{\vec{x} \in \setC^n} \norm{\vec{b} - \mat{A} \vec{x}}^2 . $$ Iterate summation Eq. $\eqref{eqn:iterate_sum}$ can be written in matrix notation as $$ \vec{x}_i = \vec{x}_0 + \mat{D}_i \vec{\alpha}_i $$ where $\mat{D}_i \in \setC^{n \times i}$ is a search direction augmented matrix and $\vec{\alpha}_i \in \setC^i$ is a vector of search scales. Combine these equations to yield $$ \label{eqn:gmres_cost1} \mm{\vec{r}_i}^2 = \mm{\vec{b} - \mat{A} \vec{x}_i}^2 = \mm{\vec{b} - \mat{A} [\vec{x}_0 + \mat{D}_i \vec{\alpha}_i]}^2 = \mm{\vec{r}_1 - \mat{A} \mat{D}_i \vec{\alpha}_i}^2 . $$ The search directions must be specified to proceed. As with conjugate gradient, a judicious choice of search direction vectors greatly simplifies the problem.

A deeper understanding of iterative methods can be achieved by thinking in terms of Krylov subspaces. The $i$-th Krylov subspace of matrix $\mat{A}$ and vector $\vec{u}$ is the vectorspace spanned by the sequence of vectors generated by repeatedly applying $\mat{A}$ to $\vec{u}$ up to $i$ times $$ K_i(\mat{A}, \vec{u}) \equiv \fn{span}\pp{\bigcup_{j=0}^i \cc{\mat{A}^j \vec{u}}} = \fn{span}\pp{\cc{\vec{u}, \mat{A} \vec{u}, \mat{A}^2 \vec{u}, \dots, \mat{A}^i \vec{u}}} . $$ Elements of a Krylov subspace are linearly independent up to some positive integer which can be smaller than or equal the dimension of the underlying vectorspace. For both gradient descent and conjugate gradient the search direction vector $\vec{d}_i$ is an element of the $i$-th Krylov subspace of the first search direction vector $\vec{d}_1 = \vec{r}_1$ $$ \label{eqn:krylov_d} \vec{d}_i \in K_i(\mat{A}, \vec{r}_1) . $$ This is called the Arnoldi iteration[]. Furthermore, conjugate gradient built an $\mat{A}$-orthogonal basis out of the search direction vectors which guaranteed the algorithm would converge after at most $n$ iterations.

GMRES takes a similar tack to conjugate gradient and builds an orthonormal set of Krylov subspace vectors from Eq. $\eqref{eqn:krylov_d}$ such that $$ \big\langle \uvec{d}_i, \uvec{d}_j \big\rangle = \delta_{ij} . $$ The seed search direction is taken to be the initial residual $\vec{r}_1$, although the residual can't be interpreted as a gradient in the case of GMRES. This is acceptable because the non-zero seed is arbitrary, and is a good choice for the subset of conjugate-symmetric positive-definite systems that GMRES can solve. In the case this Krylov subspace fails to span the underlying vectorspace, GMRES will converge in fewer than $N$ iterations. With this choice of search direction vectors, the search direction augmented matrix satisfies the recurrence relation $$ \label{eqn:gmres_addh} \mat{A} \mat{D}_i = \mat{D}_{i+1} \mat{H}_{i+1,i} $$ where $\mat{H}_{i+1, i} \in \setC^{[i+1] \times i}$ is a matrix of expansion coefficients. Because of the Arnoldi iteration, matrix $\mat{H}_{i+1,i}$ is zero below the first sub-diagonal (like an upper Hessenburg matrix), and is computed as a byproduct of taking projections when making the search directions orthonormal. Furthermore, the matrix $\mat{H}_{i+1, i}$ is related to the next matrix $\mat{H}_{i+2, i+1}$ by the recurrence relation $$ \label{eqn:gmres_h_reccurence} \mat{H}_{i+2, i+1} = \begin{bmatrix} \mat{H}_{i+1, i} & \vec{h}_{i+1} \\ \vec{0}^H & H_{i+2, i+1} \end{bmatrix} $$ where vector $\vec{h}_{i+2} \in \setC^{i + 2}$ is a new column. The base case is matrix $\mat{H}_{2,1} \in \setC^{2 \times 1}$.

Returning to the complex least squares cost function Eq. $\eqref{eqn:gmres_cost1}$, use Eq. $\eqref{eqn:gmres_addh}$ and the identity $\uvec{d}_1 = \vec{r}_1 / \norm{\vec{r}_1}$ to write $$ \label{eqn:gmres_cost2} \begin{split} \mm{\vec{r}_i}^2 &= \mm{\vec{r}_1 - \mat{A} \mat{D}_i \vec{\alpha}_i}^2 \\ &= \mm{\mm{\vec{r}_1} \uvec{d}_1 - \mat{D}_{i+1} \mat{H}_{i+1,i} \vec{\alpha}_i]}^2 \\ &= \mm{\mat{D}_{i+1}[\mm{\vec{r}_1} \uvec{e}_1 - \mat{H}_{i+1,i} \vec{\alpha}_i]}^2 \\ &= \mm{\mm{\vec{r}_1} \uvec{e}_1 - \mat{H}_{i+1,i} \vec{\alpha}_i}^2 \end{split} $$ where $\uvec{e} \in \setC^{i+1}$ is the first canonical basis unit vector. The final step follows because $\mat{D}_{i+1}$ is an augmented matrix of orthonormal vectors $$ \label{eqn:gmres_orthonorm} \norm{\vec{D}_i \vec{u}}^2 = \aa{\vec{D}_i \vec{u}, \vec{D}_i \vec{u}} = \vec{u}^H \vec{D}_i^H \vec{D}_i \vec{u} = \vec{u}^H \mat{I} \vec{u} = \vec{u}^H \vec{u} = \aa{\vec{u}, \vec{u}} = \norm{\vec{u}}^2 . $$ The problem of minimizing the residual norm with respect to $\vec{x}$ is replaced by the equivalent problem of minimizing the residual norm with respect to $\vec{\alpha}_i$ $$ \vec{\alpha}_i = \argmin_{\vec{\alpha}_i' \in \setC^i} \mm{\mm{\vec{r}_1} \uvec{e}_1 - \mat{H}_{i+1,i} \vec{\alpha}_i'}^2 . $$ This equation can be efficiently solved with some manipulation.

Every matrix has a QR decomposition that factors the matrix into a unitary matrix multiplying a right triangular matrix. The existence of such a decomposition is easy to establish by considering orthonormalization applied to the columns of a matrix. For the task at hand consider the QR decomposition of $\mat{H}_{i+1,i}$ $$ \mat{H}_{i+1,i} = \mat{Q}_{i+1} \mat{R}_{i+1,i} $$ for unitary matrix $\mat{Q}_{i+1} \in \setC^{[i+1]\times[i+1]}$ and right triangular matrix $\mat{R}_{i+1, i} \in \setC^{[i+1]\times i}$. Recall the defining property of a unitary matrix is $\mat{Q}^{-1} = \mat{Q}^H$, which implies the columns are orthonormal. Thus the right triangular matrix in the previous equation can be solved for $$ \label{eqn:gmres_qhr} \mat{Q}_{i+1}^H \mat{H}_{i+1,i} = \mat{R}_{i+1,i}= \begin{bmatrix} \mat{R}_i \\ \vec{0}^H \\ \end{bmatrix} . $$ Matrix $\mat{H}_{i+1,i}$ is transformed into matrix $\mat{R}_{i+1,i}$ when multiplied by matrix $\mat{Q}_{i+1}^H$. Right triangular matrices are trivial to solve by backsubstitution, so this is a highly desirable transformation. Rewrite Eq. \ref{eqn:gmres_cost2} using the proposed QR decomposition $$ \begin{split} \norm{\vec{r}_i}^2 &= \norm{\norm{\vec{r}_0} \uvec{e}_1 - \mat{H}_{i+1, i} \vec{\alpha}_i}^2 \\ &= \norm{\mat{Q}_{i+1}^H [\mm{\vec{r}_0} \uvec{e}_1 - \mat{H}_{i+1,1} \vec{\alpha}_i]}^2 \\ &= \norm{\norm{\vec{r}_0} \mat{Q}_{i+1}^H \uvec{e}_1 - \mat{R}_{i+1,i} \vec{\alpha}_i}^2 \\ &= \norm{ \begin{bmatrix} \vec{g}_i \\ g_{i+1} \end{bmatrix} - \begin{bmatrix} \mat{R}_i \vec{\alpha}_i \\ 0 \end{bmatrix} }^2 . \end{split} $$ The residual norm is minimized when $$ \label{eqn:gmres_arg} \vec{\alpha}_i = \vec{R}_i^{-1} \vec{g}_i $$ which implies $g_{i+1} = \norm{\vec{r}_i}$.

The QR decomposition is only useful if it is inexpensive to compute. This is shown to be the case. From recurrence relation Eq. $\eqref{eqn:gmres_h_reccurence}$ only the right-most column of $\mat{H}_{i+1,i}$ needs to be computed each iteration, which is done as a byproduct of making the search direction vector orthonormal to previous search direction vectors. Recursion is then used to compute $\mat{R}_i$. Consider the base case of Eq. \eqref{eqn:gmres_qhr} when $i=1$ $$ \begin{bmatrix} c & -s \\ \conj{s} & \conj{c} \end{bmatrix}^H \begin{bmatrix} H_{1,1} \\ H_{2,1} \end{bmatrix} = \begin{bmatrix} R_{1,1} \\ 0 \end{bmatrix} $$ where the $\mat{Q}_2$ takes the form of a (Givens) "rotation" $$ \mat{Q}_2 = \begin{bmatrix} c & -s \\ \conj{s} & \conj{c} \end{bmatrix}\quad c = \q{H_{1,1}}{R_{1,1}} \quad s = \q{\conj{H}_{2,1}}{R_{1,1}} \quad R_{1,1} = \sqrt{\nn{H_{1,1}}^2 + \nn{H_{2,1}}^2} . $$ To derive a recurrence relation for the QR decomposition, start with the recurrence relation Eq. $\eqref{eqn:gmres_h_reccurence}$ and incorporate the previous QR decomposition to get the partial result $$ \label{eqn:gmres_recurrence} \begin{bmatrix} \mat{Q}_{i+1}^H & \vec{0} \\ \vec{0}^H & 1 \end{bmatrix} \begin{bmatrix} \mat{H}_{i+1, i} & \vec{h}_{i+1} \\ \vec{0}^H & H_{i+2, i+1} \end{bmatrix} = \begin{bmatrix} \mat{R}_{i+1,i} & \mat{Q}_{i+1}^H \vec{h}_{i+1} \\ \vec{0}^H &p; H_{i+2, i+1} \end{bmatrix} = \begin{bmatrix} \mat{R}_{i} & \vec{r}_i \\ \vec{0}^H & \rho_i \\ \vec{0}^H & \sigma_i \end{bmatrix} . $$ The right hand side is almost right triangular. All that needs to be done is another rotation on the last two rows. Generalizing from the base case, construct the matrix $\mat{G}_{i+2} \in \setC^{[i+2]\times[i+2]}$ such that $$ \mat{G}_{i+2} = \begin{bmatrix} \mat{I}_{i} & \vec{0} & \vec{0} \\ \vec{0}^H & c_i & -s_i \\ \vec{0}^H & \conj{s}_i & \conj{c}_i \\ \end{bmatrix} \quad\text{where}\quad c_i = \q{\rho_i}{\tau_i} \quad s_i = \q{\conj{\sigma}_i}{\tau_i} \quad \tau_i = \sqrt{\nn{\rho_i}^2 + \nn{\sigma_i}^2} $$ and left multiply Eq. $\eqref{eqn:gmres_recurrence}$ to derive the QR right triangular matrix recurrence relation $$ \begin{bmatrix} \mat{I}_{i} &\vec{0} & \vec{0} \\ \vec{0}^H & \conj{c}_i & s_i \\ \vec{0}^H & -\conj{s}_i & c_i \end{bmatrix} \begin{bmatrix} \mat{R}_{i} & \vec{r}_i \\ \vec{0}^H & \rho_i \\ \vec{0}^H & \sigma_i \end{bmatrix} = \begin{bmatrix} \mat{R}_i & \vec{r}_i \\ \vec{0}^H & \tau_i \\ \vec{0}^H & 0 \end{bmatrix} = \mat{R}_{i+2,i+1} . $$ It is seen that only the right-most column of must be computed each iteration. Furthermore this implies the QR unitary matrix recurrence relation $$ \mat{Q}^H_{i+2} = \mat{G}_{i+2}^H \begin{bmatrix} \mat{Q}_{i+1}^H & \vec{0} \\ \vec{0}^H & 1 \end{bmatrix} . $$ Taken together, only the right-most column of $\mat{H}_{i+1,i}$ and $\mat{R}_{i+1,i}$ need to be computed each iteration, the column of $\mat{H}_{i+1,i}$ is computed during orthonormalization, and the column of $\H_{i+1,i}$ is transformed into the column of $\mat{R}_{i+1,i}$ by applying $\mat{Q}_{i+1}$ which is equivalent to a sequence of rotations between pairs of rows.

The GMRES algorithm is now presented in Algorithm $\ref{alg:gmres}$. Due to the solution being constructed in a basis, at most $N$ iterations are needed for convergence. Seed the Arnoldi process with the residual $\vec{r}_1$ and compute $\vec{g}_1$ from the equivalent cost function. For each iteration several steps must happen. First compute a new search direction based on the Arnoldi iteration (apply $\mat{A}$ to the previous search direction vector) and make it orthonormal to the previous search direction vectors using a stabilized algorithm, saving the expansion coefficients as the new column of $\mat{H}$. Next transform the $\mat{H}$ column by applying the sequence of rotations up to the last rotation. Then compute the current rotation and complete the transformation of the $\mat{H}$ column into an $\mat{R}$ column. Examining the equivalent cost function, vector $\mat{g}$ must be transformed by the rotations along with $\mat{H}$, but unlike the entirely new column of $\mat{H}$, only the most recent rotation needs to be applied to vector $\vec{g}$ to update it. Finally, check the tolerance for early convergence. After the loop the solution vector is constructed by solving Eq. \eqref{eqn:gmres_arg} for $\vec{\alpha}_i$ by backsubstitution, and then evaluating the matrix form of the iterate summation Eq. \eqref{eqn:gmres_recurrence}. Preconditioning can be incorporated if needed[].

The computational complexity of GMRES should be considered. Due to the need to store all previous search direction vectors to apply orthonormalization, the spatial complexity is $O(N^2)$. This is a major short-coming of GMRES. Many modifications to the basic algorithm try to circumvent this by periodically restarting the algorithm, however in this case convergence isn't guaranteed. The time complexity per iteration is dominated by a matrix-vector product, and is so $O(N^2)$. The time complexity per run if allowed to iterate a full $n$ times is $O(N^3)$. Like conjugate gradient, this seems to be the same as Gauss-Jordon elimination, however GMRES provides a sequence of approximate solution vectors that converge quickly, and can often be terminated long before a full $n$ iterations. Convergence is more difficult to assess[].

$$ \begin{algorithm}[H] \linespread{1.0}\selectfont \SetAlgoLined \KwResult{$\vec{x}_i$} $i_{max} = \min(i_{max}, n)$ \\ $\vec{r}_1 = \vec{b} - \mat{A} \vec{x}_0$ \\ $\vec{d}_1 = \vec{r}_1 / \norm{\vec{r}_1}$ \\ $\vec{g}_1 = \norm{\vec{r}_1} \uvec{e}_1$ \\ \For{$i \in [1, i_{max}]$}{ // stabilized arnoldi process \\ $\vec{d}_{i+1} = \mat{A} \uvec{d}_i$ \\ \For{$j \in [1. i]$}{ $H_{j,i} = \langle \uvec{d}_j, \vec{d}_{i+1} \rangle$ \\ $\vec{d}_{i+1} = \vec{d}_{i+1} - \uvec{d}_j H_{j,i}$ \\ } $H_{i+1,i} = \norm{\vec{d}_{i+1}}$ \\ $\uvec{d}_{i+1} = \vec{d}_{i+1} / \norm{\vec{d}_{i+1}}$ \\ // transform column of H to (almost) column of R \\ \For{$j \in [1, i - 1]$}{ $\begin{bmatrix} H_{j,i} \\ H_{j+1,i} \end{bmatrix} = \begin{bmatrix} \conj{c}_j & s_j \\ -\conj{s}_j & c_j \end{bmatrix} \begin{bmatrix} H_{j,i} \\ H_{j+1,i} \end{bmatrix}$ \\ } // compute rotation \\ $\tau = \sqrt{\nn{H_{i,i}}^2 + \nn{H_{i+1,i}}^2}$ \\ $c_i = H_{i,i} / \tau$ \\ $s_i = \conj{H}_{i+1,i} / \tau$ \\ // complete transform to R \\ $\begin{bmatrix}H_{i,i} \\ H_{i+1, i}\end{bmatrix} = \begin{bmatrix} \tau \\ 0 \end{bmatrix}$ \\ // transform g \\ $\begin{bmatrix} g_{i} \\ g_{i+1} \end{bmatrix} = \begin{bmatrix} \conj{c}_j & s_j \\ -\conj{s}_j & c_j \end{bmatrix} \begin{bmatrix} g_{i} \\ g_{i+1} \end{bmatrix}$ \\ // tolerance test \\ \If{$g_{i+1} \leq \mathrm{tol}$}{ break \\ } } $\begin{bmatrix} \mat{R}_i \\ \vec{0}^H \end{bmatrix} = \mat{H}_{i+1, i}$ \\ $\vec{\alpha}_i = \mat{R}_i^{-1} \vec{g}_i$ \\ $\vec{x}_i = \vec{x}_0 + \mat{D}_i \vec{\alpha}_i$ \\ \Return{$\vec{x}_i$} \\ \caption{\label{alg:gmres}General Minimal Residual} \end{algorithm} $$


This chapter presented different ways to invert the linear forward model to reconstruct an image. Algorithms and their complexities were derived.

The inverse problem is almost always ill-posed. Complex least squares with $L^2$ regularization was developed to form a well-posed linear inverse problem with a unique solution, which results in a conjugate-symmetric positive-definite matrix. While the well-posed problem can be solved directly with Gauss-Jordon elimination, this is computationally expensive for very large linear systems and potentially unstable.

Iterative methods become the only viable method for inverting very large linear systems. These algorithms generate a sequence of approximate solutions, and can be stopped short of an exact solution. The basic iterative algorithm framework was discussed, and applied to three algorithms. Gradient descent is conceptually simple, but has terrible convergence. Conjugate gradient dramatically improved upon gradient descent by carefully choosing a basis to expand a solution in with remarkable properties. These two algorithms are based on gradients and require the linear system to be conjugate-symmetric and positive-definite, which the complex least squares matrix is. GMRES was presented as an alternative to gradient-based algorithms, and can be applied as long as the linear system is invertible.