Introduction to Sparse Subspace Clustering

Consider a dataset of \(S\) points in the ambient data space \(\mathbb{R}^M\) which have been assembled in a matrix \(Y\) of shape \(M \times S\).

In many applications, it often occurs that if we group or segment the data set \(Y\) into multiple disjoint subsets (clusters): \(Y = Y_1 \cup \dots \cup Y_K\), then each subset can be modeled sufficiently well by a low dimensional subspace \(\mathbb{R}^D\) where \(D \ll M\). Some of the applications include: motion segmentation [BB91, CK98, Gea98, Kan01, PK97, TK91, TK92], face clustering [BJ03, HYL+03, LHK05] and handwritten digit recognition [ZSWL12].

Subspace clustering is a clustering framework which assumes that the data-set can be segmented into clusters where points in different clusters are drawn from different subspaces. Subspace clustering algorithms are able to simultaneously segment the data into clusters corresponding to different subspaces as well as estimate the subspaces from the data itself. A comprehensive review of subspace clustering can be found in [Vid10]. Several state of the art algorithms are based on building subspace preserving representations of individual data points by treating the data set itself as a (self expressive) dictionary. For creating subspace preserving representations, one resorts to using sparse coding algorithms developed in sparse representations and compressive sensing literature.

Two common algorithms are Sparse Subspace Clustering using \(\ell_1\) regularization (SSC-\(\ell_1\)):cite:elhamifar2009sparse, elhamifar2013sparse and Sparse Subspace Clustering using Orthogonal Matching Pursuit (SSC-OMP) [DSB13, YRV16, YV15]. While SSC-\(\ell_1\) is guaranteed to give correct clustering under broad conditions (arbitrary subspaces and corrupted data), it requires solving a large scale convex optimization problem. On the other hand, SSC-OMP is computationally efficient but its clustering accuracy is poor (especially at low density of data points per subspace).

The dataset \(Y\) is modeled as being sampled from a collection or arrangement \(\mathcal{U}\) of linear (or affine) subspaces \(\mathcal{U}_k \subset \mathbb{R}^M\) : \(\mathcal{U} = \{ \mathcal{U}_1 , \dots , \mathcal{U}_K \}\). The union of the subspaces is denoted as \(Z_{\mathcal{U}} = \mathcal{U}_1 \cup \dots \cup \mathcal{U}_K\).

Let the data set be \(\{ y_j \in \mathbb{R}^M \}_{j=1}^S\) drawn from the union of subspaces under consideration. \(S\) is the total number of data points being analyzed simultaneously. We put the data points together in a data matrix as

(1)\[Y \triangleq \begin{bmatrix} y_1 & \dots & y_S \end{bmatrix}.\]

Let the vectors be drawn from a set of \(K\) (linear or affine) subspaces, The subspaces are indexed by a variable \(k\) with \(1 \leq k \leq K\). The \(k\)-th subspace is denoted by \(\mathcal{U}_k\). Let the (linear or affine) dimension of \(k\)-th subspace be \(\dim(\mathcal{U}_k) = D_k\) with \(D_k \leq D \ll M\).

The vectors in \(Y\) can be grouped (or segmented or clustered) as submatrices \(Y_1, Y_2, \dots, Y_K\) such that all vectors in \(Y_k\) are drawn from the subspace \(\mathcal{U}_k\). Thus, we can write

(2)\[Y^* = Y \Gamma = \begin{bmatrix} y_1 & \dots & y_S \end{bmatrix} \Gamma = \begin{bmatrix} Y_1 & \dots & Y_K \end{bmatrix}\]

where \(\Gamma\) is an \(S \times S\) unknown permutation matrix placing each vector to the right subspace.

Let there be \(S_k\) vectors in \(Y_k\) with \(S = S_1 + \dots + S_K\). Let \(Q_k\) be an orthonormal basis for subspace \(\mathcal{U}_k\). Then, the subspaces can be described as

(3)\[\mathcal{U}_k = \{ y \in \mathbb{R}^M : y = \mu_k + Q_k \alpha \}, \quad 1 \leq k \leq K\]

For linear subspaces, \(\mu_k = 0\).

A dataset where each point can be expressed as a linear combination of other points in the dataset is said to satisfy self-expressiveness property. The self-expressive representation of a point \(y_s\) in \(Y\) is given by

(4)\[y_s = Y c_s, \; c_{ss} = 0, \text{ or } Y = Y C, \quad \text{diag}(C) = 0\]

where \(C = \begin{bmatrix}c_1, \dots, c_S \end{bmatrix} \in \mathbb{R}^{S \times S}\) is the matrix of representation coefficients.

Let \(y_s\) belong to \(k\)-th subspace \(\mathcal{U}_k\). Let \(Y^{-s}\) denote the dataset \(Y\) excluding the point \(y_s\) and \(Y_k^{-s}\) denote the set of points in \(Y_k\) excluding \(y_s\). If \(Y_k^{-s}\) spans the subspace \(\mathcal{U}_k\), then a representation of \(y_s\) can be constructed entirely from the points in \(Y_k^{-s}\). A representation is called subspace preserving if it consists of points within the same subspace.

If \(c_i\) is a subspace preserving representation of \(y_i\) and \(y_j\) belongs to a different subspace, then \(c_{ij} = 0\). Thus, if \(C\) consists entirely of subspace preserving representations, then \(C_{ij} = 0\) whenever \(y_i\) and \(y_j\) belong to different subspaces. In other words, if \(Y_{-k}\) denotes the set of points from all subspaces excluding the subspace \(Y_k\) corresponding to the point \(y_i\), then points in \(Y_{-k}\) do not participate in the representation \(c_i\).

In the cr.sparse.cluster.ssc package, we provide a version of OMP which can be used to construct the sparse self expressive representations \(C\) of \(Y\). Once the representation has been constructed, we compute an affinity matrix \(W = |C| + |C^T|\).

We then apply spectral clustering on \(W\) to complete SSC-OMP. For this, we have written a JAX version of spectral clustering in cr.sparse.cluster.spectral package. In particular, it uses our own Lanczos Bidiagonalization with Partial Orthogonalization (LANBPRO) algorithm to compute the \(K\) largest singular values of the normalized affinity matrix in as few iterations as possible. The intermediate variables \(C\), \(W\) are maintained in the experimental sparse matrices stored in BCOO format. The LANBPRO algorithm also works on sparse matrices directly. Thus, even though \(C\) is of size \(S \times S\), it can be stored efficiently in \(O(DS)\) storage. This enables us to process hundreds of thousands of points efficiently.