Iterative Shrinkage and Thresholding Algorithm

Our description is based on [Ela10, DDDM04, ZE10].

ISTA can be used to solve problems of the form:

(1)\[\widehat{x} = \text{arg} \min_{x} \frac{1}{2}\| b - A x \|_2^2 + \lambda \| x \|_1\]

and its variants (with different regularizations).

Here the objective function is:

(2)\[f(x) = \frac{1}{2}\| b - A x \|_2^2 + \lambda \| x \|_1\]

Derivation

For an identity \(A\), the problem reduces to:

(3)\[\widehat{x} = \text{arg} \min_{x} \frac{1}{2}\| b - x \|_2^2 + \lambda \| x \|_1\]

The term on the R.H.S. :

(4)\[f(x) = \frac{1}{2}\| b - x \|_2^2 + \lambda \| x \|_1 = \sum_{i=1}^n \left [\frac{1}{2} | b_i - x_i|^2 + \lambda |x_i| \right ]\]

is separable in the components of \(x\).

The scalar function

(5)\[g(\tau) = \frac{1}{2} | \gamma - \tau|^2 + \lambda |\tau|\]

has a minimizer given by the soft thresholding function:

(6)\[\begin{split}\tau^* = \mathbf{ST}_{\lambda}(\gamma) = \begin{cases} \gamma\left ( 1 - \frac{\lambda}{|\gamma|} \right ) & \text{for} & |\gamma| \gt \lambda\\ 0 & \text{for} & |\gamma| \le \lambda \end{cases}\end{split}\]

[ZE10] show that a similar solution emerges when \(A\) is unitary too.

[DDDM04] introduced a surrogate term:

(7)\[\text{dist}(x, x_0) = \frac{c}{2} \| x - x_0 \|_2^2 - \frac{1}{2} \| A x - A x_0 \|_2^2\]

For this function to be strictly convex w.r.t. \(x\), its Hessian must be positive definite. This holds if:

(8)\[c > \| A^H A \|_2 = \lambda_{\max}(A^H A)\]

the largest eigen value \(A^H A\).

The new surrogate objective function is defined as:

(9)\[\widetilde{f}(x, x_0) = f(x) + \text{dist}(x, x_0) = \frac{1}{2}\| b - A x \|_2^2 + \lambda \| x \|_1 + \frac{c}{2} \| x - x_0 \|_2^2 - \frac{1}{2} \| A x - A x_0 \|_2^2\]

Introducing:

(10)\[\nu_0 = x_0 + \frac{1}{c} A^H (b - A x_0)\]

We can rewrite the surrogate objective as:

(11)\[\widetilde{f}(x, x_0) = \text{Const} + \lambda \| x \|_1 + \frac{c}{2} \| x - \nu_0 \|_2^2\]

As discussed above, this surrogate objective can be minimized using the soft thresholding operator:

(12)\[x^* = \mathbf{ST}_{\frac{\lambda}{c} }(\nu_0) = \mathbf{ST}_{\frac{\lambda}{c} }\left (x_0 + \frac{1}{c} A^H (b - A x_0) \right )\]

Starting with some \(x_0\), we can propose an itarative method over a sequence \(x_i = \{x_0, x_1, x_2, \dots \}\) as:

(13)\[x_{i+1} = \mathbf{ST}_{\frac{\lambda}{c} }\left (x_i + \frac{1}{c} A^H (b - A x_i)\right )\]

This is the IST algorithm.

By changing the regularization in (1), we can derive different IST algorithms with different thresholding functions. The version below considers a generalized thresholding function which depends on the regularizer.

(14)\[x_{i+1} = \mathbf{T}_{\frac{\lambda}{c} }\left (x_i + \frac{1}{c} A^H (b - A x_i)\right )\]

Sparsifying Basis

Often, the signal \(x\) (e.g. an image) may not be sparse or compressible but it has a sparse representation in some basis \(B\). We have

(15)\[\alpha = B^H x\]

as the representation of \(x\) in the basis \(B\).

The regularization is then applied to the representation \(\alpha\). (1) becomes:

(16)\[\widehat{x} = \text{arg} \min_{x} \frac{1}{2}\| b - A x \|_2^2 + \lambda \| B^H x \|_1\]

We can rewrite this as:

(17)\[\widehat{\alpha} = \text{arg} \min_{\alpha} \frac{1}{2}\| b - A B \alpha \|_2^2 + \lambda \| \alpha \|_1\]

(14) changes to:

(18)\[\alpha_{i+1} = \mathbf{T}_{\frac{\lambda}{c} }\left (\alpha_i + \frac{1}{c} B^H A^H (b - A B \alpha_i)\right )\]

By substituting \(\alpha = B^H x\) and \(x = B \alpha\), we get:

(19)\[\alpha_{i+1} = \mathbf{T}_{\frac{\lambda}{c} }\left (B^H x_i + \frac{1}{c} B^H A^H (b - A x_i)\right )\]

Simplifying further:

(20)\[x_{i+1} = B \mathbf{T}_{\frac{\lambda}{c} }\left ( B^H \left (x_i + \frac{1}{c} A^H (b - A x_i)\right ) \right )\]

This is the version of IST algorithm with an operator \(A\) and a basis \(B\).

Implementation

We introduce the current residual:

(21)\[r = b - A x\]

and a step size parameter \(\alpha = \frac{1}{c}\). We also assume that a thresholding function \(\mathbf{T}\) will be user defined.

This simplifies the iteration to:

(22)\[x_{i+1} = B \mathbf{T}\left ( B^H \left (x_i + \alpha A^H r_i \right ) \right )\]

Algorithm state

Current state of the algorithm is described by following quantities:

Term

Description

x

Current estimate of \(x\)

r

Current residual \(r\)

r_norm_sqr

Squared norm of the residual \(\| r \|_2^2\)

x_change_norm

Change in the norm of \(x\) given by \(\|x_{i+1} - x_{i} \|_2\)

iterations

Number of iterations of algorithm run so far

Algorithm initialization

We initialize the algorithm with:

  • \(x \leftarrow x_0\) an initial estimate of solution given by user. It can be 0.

  • \(r \leftarrow b - A x_0\)

  • Compute r_norm_sqr

  • Give a very high value to x_change_norm (since there is no \(x_{-1}\)).

  • iterations = 0

Algorithm iteration

Following steps are involved in each iteration. These are directly based on (20).

  • Compute gradient: \(g \leftarrow \alpha A^H r\)

  • Update estimate: \(x \leftarrow x + g\)

  • Transform \(\alpha \leftarrow B^H x\)

  • Threshold: \(\alpha \leftarrow \mathbf{T} (\alpha)\)

  • Inverse transform \(x \leftarrow B \alpha\)

  • Update residual: \(r \leftarrow b - A x\)