# Benchmarking - results (2/2)

This is the continuation of the post Benchmarking - motivation and
tools where we
argued why we chose `pytest-benchmark`

for the benchmarks. Here we will present some of
the results obtained. In particular, we will compare how the python packages NumPy,
SciPy, QuTiP and TensorFlow perform for basic linear algebra operations. Hence, we will
have 6 different objects to represent data:

- NumPy:
`ndarray`

- SciPy:
`sparse.csr_matrix`

- QuTip 5.0 (still in development):
`Dense`

(should be compared to ndarray) - QuTip 5.0 (still in development):
`CSR`

(should be compared to SciPy’s`sparse.csr_matrix`

) - TensorFlow:
`Tensor`

(operates using a GPU and should be compared to`Dense`

and`ndarray`

).

The operations will be performed in two different matrix types, dense and sparse, with
variable input shape (NxN) with up to $N=2^{10}$ that would represent a simulation with 10
qubits. For the dense matrices we will use a random Hermitian matrix whereas for the
sparse case we will use a tridiagonal Hermitian matrix. The idea here is that we expect
TensorFlow to beat QuTiP’s `Dense`

and NumPy’s `ndarray`

only for sufficiently large N
as `Tensor`

operations suffer from a memory transfer overhead. I include both dense and
sparse input matrices as both have applications in QuTiP (which is one of the reasons
why QuTiP 5.0 will include them) and I was curious to see how they compared with
`ndarray`

and `Dense`

in the sparse matrix case.

The hardware where these tests were run is:

- CPU: Intel Core i7-6700
- RAM: 16GB
- GPU: GTX 970
- Python 3.9.5
- Numpy 1.19.5
- TensorFlow 2.5.0
- SciPy 1.7.0
- QuTiP 5.0.0.dev (date: 06-07-2021)
- CUDA version: 11.3

## Element-wise addition.

The first operations that I will show is the addition operation. This is an
element-wise operation that should be easily parallelized in a GPU. In the figure below
we plot the time it takes to perform 100 additions of matrices of shape $N\times N$. We
see that for small matrices, NumPy has the best performance. The apparent difference in
operation time for small N is mainly due to the instantiation of the class that contains
the array, which is slower for QuTiP than for NumPy. One of the reasons for this is that
`Qobj`

, the object we operate with, is a wrapper around the class `Data`

that can be
either `Dense`

or `CSR`

whereas the `ndarray`

of NumPy is not wrapped by anything else.
However, it is quite surprising that the `CSR`

case performs noticeably faster that then
SciPy’s case. This is due to safety checks that `sparse.csr_matrix`

performs when
creating a new array. `CSR`

does not include some of these checks as it has very tight
control of the input and output data. To understand the TensorFlow case, we need to
mention that the benchmarked operation includes the time to move data from the GPU to
the CPU. This is why we benchmarked 100 addition operations together instead of only one
addition operation as we will usually perform multiple operations before retrieving the
data from the GPU. We see that TensorFlow performs worse than the rest for small matrix
size but for approximately $N>2^{8}$, it performs faster as memory overhead is no longer
the limitation in the operation.

For the case where we operate with a sparse (tridiagonal) matrix, both SciPy’s `CSR`

and
specially QuTiP’s `CSR`

data representations show a faster operation time than the rest.
This is because the number of operations with an sparse representation of the data
scales with the number of non-zero elements, which for a tridiagonal matrix means that
the `CSR`

representations require $\mathcal{O}(N)$ operations (compared to the
$\mathcal{O}(N^2)$ operations for the “dense” data representations employed in QuTiP’s
`Dense`

, TensorFlow and NumPy).

All this data suggests that there is some speed-up to benefit from when using the GPU for linear algebra operations. However, it should be noted that element-wise operations such addition represent the best case scenario. I will now show how the considered Python packages perform when considering more complex operations such as matrix multiplication, matrix exponentiation or eigenvalue solver.

## Matrix multiplication

When we benchmark the matrix multiplication operation we obtain the results shown below.
This operation is performed 20 times which reduces the impact of the memory transfer
overhead in GPU and represents a more realistic use-case of the matrix multiplication.
Contrary to what happens in the previous case for matrix addition, TensorFlow does not perform faster for
$N>2^{8}$. This quite a surprising result as, although matrix multiplication is a more
involved operation than element-wise addition, I would naively expect to still be
relatively easily parallelizable. Furthermore, unlike element-wise addition, matrix
multiplication is a compute-bound operation^{1}.

There are several articles where they address this operation using different hardware
^{2} ^{3}. In those articles they find that indeed, GPU performs faster than CPU matrix
multiplication. There are currently two reasons I think this may not happen with my
hardware. One is that there may be bug in the code. You can find the
code employed for these results here. The
other reason that I am considering is that the GPU I use is too old and does not include
the latest hardware improvements. In particular, it turns out that current RTX GPUs include tensor
cores which are supposed to bring a noticeable improvement in the matrix multiplication
performance.

## Matrix exponentiation and eigenvalue decomposition.

The last two operations that have been benchmarked are matrix exponentiation and
eigenvalue decomposition. These are run a single time (although they are repeated 5
times to obtain a statistical average) as they are already time consuming by themselves.
The results are shown below and they are similar to what I showed previously for matrix
multiplication. However, the results for matrix exponentiation are a little bit
surprising. It turns out that there is a range of matrix sizes for which TensorFlow is
*faster*. For any other value, greater or lower than this range, TensorFlow is slower. I
do not fully understand why this is the
case but I think is related with the CPU core utilization as the jump in performance for
the CPU implementations occurs when multiple cores are used instead of a single one.

## Conclusions.

The results presented in this blog are somewhat discouraging although not disastrous.
They show that, with *my* personal computer hardware it may not be straightforward to obtain a
significant speed-up in matrix operations. It is yet to be seen if using an RTX series
graphics card could provide a significant improvement due to its tensor cores.
Nevertheless, there are a few other features, such us auto differentiation and seamless
integration between TensorFlow and QuTiP that still motivate the development of
qutip-tensorflow. Furthermore, qutip-tensorflow would allow to use Google Colab’s GPUs
which I found to have similar performance to my CPU operations using NumPy.
This is specially interesting for researches/students that only have
access to a laptop and still want to run computation heavy simulations in QuTiP.

For both addition and matrix multiplication the memory requirements scale as $\mathcal{O}(N^2)$, with $N\times N$ the shape of the matrix. However, the number of basic (multiplication or addition of a complex number) operations scales a $\mathcal{O}(N^2)$ for addition and $\mathcal{O}(N^3)$ for matrix multiplication. This is why addition is considered memory-bound while matrix multiplication is considered compute-bound. ↩︎

The main page of CuPy shows a benchmark with the relative speedup of a GPU matrix multiplication vs a CPU matrix multiplication. The original blog post is this and unfortunately they do not provide the code for their benchmarks. ↩︎

Huang, Zhibin & Ma, Ning & Wang, Shaojun & Peng, Yu. (2019). GPU computing performance analysis on matrix multiplication. The Journal of Engineering. 2019. 10.1049/joe.2018.9178. ↩︎