Orthogonal Matching Pursuit
=============================


.. rubric:: Speed benchmarks for JAX implementation

Each row of the following table describes:

* problem type and configuration (M x N is dictionary size, K is sparsity level)
* Average time taken in CPU/GPU configurations
* Speed improvement ratios

.. rubric:: System used

* All benchmarks have been generated on Google Colab
* CPU and GPU configurations Google Colab have been used

.. list-table::
    :header-rows: 1

    * - M
      - N
      - K 
      - CPU
      - CPU + JIT
      - CPU / CPU + JIT
      - GPU 
      - GPU + JIT
      - GPU / GPU + JIT
      - CPU + JIT / GPU + JIT
    * - 256
      - 1024
      - 16
      - 148 ms
      - 8.27 ms
      - 17.9x
      - 139 ms
      - 1.28 ms
      - 108x
      - 6.46x

.. rubric:: Observations

* JIT (Just In Time) compilation seems to give significant performance improvements 
  in both CPU and GPU architectures
* Current implementation seems to be slower on GPU vs CPU with JIT. 
* GPU speed gain over CPU (with JIT on) is relatively meager. 
  On TensorFlow, people regularly report 30x improvements between CPU to GPU 
  for neural networks implemented using Keras. 


.. rubric:: Possible deficiencies

* There is opportunity to improve parallelization in the OMP implementation.
* Cholesky update based implements depends heavily on solving triangular systems.
* GPUs may not be great at solving triangular systems.