Iterative Shrinkage and Thresholding Algorithm

Our description is based on [DDDM04, Ela10, ZE10].

ISTA can be used to solve problems of the form:

(1)\[\widehat{\bx} = \text{arg} \min_{x} \frac{1}{2}\| \bb - \bA \bx \|_2^2 + \lambda \| \bx \|_1\]

and its variants (with different regularizations).

Here the objective function is:

(2)\[f(\bx) = \frac{1}{2}\| \bb - \bA \bx \|_2^2 + \lambda \| \bx \|_1\]

Derivation

For an identity \(\bA\), the problem reduces to:

(3)\[\widehat{\bx} = \text{arg} \min_{x} \frac{1}{2}\| \bb - \bx \|_2^2 + \lambda \| \bx \|_1\]

The term on the R.H.S. :

(4)\[f(\bx) = \frac{1}{2}\| \bb - \bx \|_2^2 + \lambda \| \bx \|_1 = \sum_{i=1}^n \left [\frac{1}{2} | b_i - x_i|^2 + \lambda |x_i| \right ]\]

is separable in the components of \(\bx\).

The scalar function

(5)\[g(\tau) = \frac{1}{2} | \gamma - \tau|^2 + \lambda |\tau|\]

has a minimizer given by the soft thresholding function:

(6)\[\begin{split}\tau^* = \mathbf{ST}_{\lambda}(\gamma) = \begin{cases} \gamma\left ( 1 - \frac{\lambda}{|\gamma|} \right ) & \text{for} & |\gamma| \gt \lambda\\ 0 & \text{for} & |\gamma| \le \lambda \end{cases}\end{split}\]

Zibulevsky and Elad in [ZE10] show that a similar solution emerges when \(\bA\) is unitary also.

Daubechies et al. in [DDDM04] introduced a surrogate term:

(7)\[\text{dist}(\bx, \bx_0) = \frac{c}{2} \| \bx - \bx_0 \|_2^2 - \frac{1}{2} \| \bA \bx - \bA \bx_0 \|_2^2\]

For this function to be strictly convex w.r.t. \(\bx\), its Hessian must be positive definite. This holds if:

(8)\[c > \| \bA^H \bA \|_2 = \lambda_{\max}(\bA^H \bA)\]

the largest eigen value \(\bA^H \bA\).

The new surrogate objective function is defined as:

(9)\[\widetilde{f}(\bx, \bx_0) = f(\bx) + \text{dist}(\bx, \bx_0) = \frac{1}{2}\| \bb - \bA \bx \|_2^2 + \lambda \| \bx \|_1 + \frac{c}{2} \| \bx - \bx_0 \|_2^2 - \frac{1}{2} \| \bA \bx - \bA \bx_0 \|_2^2\]

Introducing:

(10)\[\nu_0 = \bx_0 + \frac{1}{c} \bA^H (\bb - \bA \bx_0)\]

we can rewrite the surrogate objective as:

(11)\[\widetilde{f}(\bx, \bx_0) = \text{Const} + \lambda \| \bx \|_1 + \frac{c}{2} \| \bx - \nu_0 \|_2^2\]

As discussed above, this surrogate objective can be minimized using the soft thresholding operator:

(12)\[\bx^* = \mathbf{ST}_{\frac{\lambda}{c} }(\nu_0) = \mathbf{ST}_{\frac{\lambda}{c} }\left (\bx_0 + \frac{1}{c} \bA^H (\bb - \bA \bx_0) \right )\]

Starting with some \(\bx_0\), we can propose an iterative method over a sequence \(\{ \bx_i \} = \{\bx_0, \bx_1, \bx_2, \dots \}\) as:

(13)\[\bx_{i+1} = \mathbf{ST}_{\frac{\lambda}{c} }\left (\bx_i + \frac{1}{c} \bA^H (\bb - \bA \bx_i)\right )\]

This is the IST algorithm.

By changing the regularization in (1), we can derive different IST algorithms with different thresholding functions. The version below considers a generalized thresholding function which depends on the regularizer.

(14)\[\bx_{i+1} = \mathbf{T}_{\frac{\lambda}{c} }\left (\bx_i + \frac{1}{c} \bA^H (\bb - \bA \bx_i)\right )\]

Sparsifying Basis

Often, the signal \(\bx\) (e.g. an image) may not be sparse or compressible but it has a sparse representation in some basis \(\bB\). We have

(15)\[\ba = \bB^H \bx\]

as the representation of \(\bx\) in the basis \(\bB\).

The regularization is then applied to the representation \(\ba\). (1) becomes:

(16)\[\widehat{\bx} = \text{arg} \min_{\bx} \frac{1}{2}\| \bb - \bA \bx \|_2^2 + \lambda \| \bB^H \bx \|_1\]

We can rewrite this as:

(17)\[\widehat{\ba} = \text{arg} \min_{\ba} \frac{1}{2}\| \bb - \bA \bB \ba \|_2^2 + \lambda \| \ba \|_1\]

(14) changes to:

(18)\[\ba_{i+1} = \mathbf{T}_{\frac{\lambda}{c} }\left (\ba_i + \frac{1}{c} \bB^H \bA^H (\bb - \bA \bB \ba_i)\right )\]

By substituting \(\ba = \bB^H \bx\) and \(\bx = \bB \ba\), we get:

(19)\[\ba_{i+1} = \mathbf{T}_{\frac{\lambda}{c} }\left (\bB^H \bx_i + \frac{1}{c} \bB^H \bA^H (\bb - \bA \bx_i)\right )\]

Simplifying further:

(20)\[\bx_{i+1} = \bB \mathbf{T}_{\frac{\lambda}{c} } \left ( \bB^H \left (\bx_i + \frac{1}{c} \bA^H (\bb - \bA \bx_i)\right ) \right )\]

This is the version of IST algorithm with an operator \(\bA\) and a sparsifying basis \(\bB\).

Implementation

We introduce the current residual:

(21)\[\br = \bb - \bA \bx\]

and a step size parameter \(\alpha = \frac{1}{c}\). We also assume that a thresholding function \(\mathbf{T}\) will be user defined.

This simplifies the iteration to:

(22)\[\bx_{i+1} = \bB \mathbf{T}\left ( \bB^H \left (\bx_i + \alpha \bA^H \br_i \right ) \right )\]

Algorithm state

Current state of the algorithm is described by following quantities:

Term

Description

x

Current estimate of \(\bx\)

r

Current residual \(\br\)

r_norm_sqr

Squared norm of the residual \(\| \br \|_2^2\)

x_change_norm

Change in the norm of \(x\) given by \(\|\bx_{i+1} - \bx_{i} \|_2\)

iterations

Number of iterations of algorithm run so far

Algorithm initialization

We initialize the algorithm with:

  • \(\bx \leftarrow \bx_0\) an initial estimate of solution given by user. It can be \(\bzero\).

  • \(\br \leftarrow \bb - \bA \bx_0\)

  • Compute r_norm_sqr

  • Give a very high value to x_change_norm (since there is no \(\bx_{-1}\)).

  • iterations = 0

Algorithm iteration

Following steps are involved in each iteration. These are directly based on (20).

  • Compute gradient: \(\bg \leftarrow \alpha \bA^H \br\)

  • Update estimate: \(\bx \leftarrow \bx + \bg\)

  • Transform \(\alpha \leftarrow \bB^H \bx\)

  • Threshold: \(\alpha \leftarrow \mathbf{T} (\alpha)\)

  • Inverse transform \(\bx \leftarrow \bB \alpha\)

  • Update residual: \(\br \leftarrow \bb - \bA \bx\)