Fastest way to compute cosine similarities of ndarrays

Here is the solution in python.

You can reproduce it in Scala using the Scala API:
Here are some useful tutorials:

import mxnet as mx
import time

tic = time.time()
first_term = mx.nd.random.uniform(shape=(10000,2048), ctx=mx.gpu())
second_term = mx.nd.random.uniform(shape=(10000,2048), ctx=mx.gpu())

first_term_normalized = first_term / mx.nd.norm(first_term, axis=1, keepdims=1)
second_term_normalized = second_term / mx.nd.norm(second_term, axis=1, keepdims=1)

cosine_similarity = mx.nd.batch_dot(first_term_normalized.expand_dims(axis=1), second_term_normalized.expand_dims(axis=2)).squeeze()
mx.nd.waitall()
print(time.time()-tic)
print(cosine_similarity)

(it takes about ~10ms on GPU)