Sparse _backward_dot is slow

saswatac · August 30, 2018, 9:31pm

Sparse _backward_dot operator is slow and it seems to depend of the dimensions and not number of non zero entries.

Example script= https://gist.github.com/saswatac/d362ae03f3e23342217faa88698ee1fe

The number of non-zero entries is kept constant and the feature dimension is varied.

with num_features = 1,000,000 the results are -
34%20PM

with num_features = 100,000,000 the results are -

03%20PM

As can be seen from the results, there is no appreciable difference in the time for dot, but time for backward dot and adam update increases.
The benchmarking is done on CPU on a mac notebook.

Is there a scope of improvement for backward_dot operator? I tried to follow the code in https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/dot-inl.h, but could not follow well enough why backward dot is dependent upon the dimension size.

safrooze · September 19, 2018, 6:12pm

A one-line summary of the dot operator implementation is that, the dot operator first computes the prefix sum of all indices of the row_sparse output, then use the prefix sum to generate the data and sorted indices of the output. Note that for CPU the approach is almost the same, except that GPU leverages some available kernels in NVidia’s cub(http://nvlabs.github.io/cub/ ).

Profiling result shows that the computation for prefix sum is taking about 50% of the time. Let’s note the feature_dim of csr as N, number of non-zero elements in csr as K. The time complexity for prefix sum and sorted indices generation is O(N) for our current implementation, which is why the speed slows down for the case you mentioned.

In an alternative implementation, instead of computing the prefix sum, we can sort csr.indices and compute the unique indices based on the sorted result. The time complexity is O(KlogK), which means it only depends on number of non-zeros in the input csr, instead of the feature_dim. Unfortunately this implementation in our benchmark was about 5x slower than the current implementation for feature_dim = 2M and batch_size = 1K. The problem is that sorting is much less cache-friendly comparing to prefix sum. The alternative approach would only be faster than the baseline only if K is very small and N is very large. Faster sorting implementations such as bitonic sort might help reduce the cost of sorting.

Topic		Replies	Views
Mxnet.nd.sum and dot ~10x slower than numpy? Performance	3	1292	June 19, 2018
Is it possible to speed up fullyconnected calculation for sparse input? Performance	10	1996	November 8, 2020
NDArray "cold start" on GPU? Performance	2	1011	December 21, 2018
Mx.nd.argmax slow on GPU with high reduction dimensions Performance	2	617	June 19, 2018
Forward pass performance (for one image) is quite slow. Concerns mxnet 0.11.0 Performance	2	1049	January 23, 2018

Sparse _backward_dot is slow

Related Topics