Puzzling performance issue

I compared mxnet performance between Mathematica and Python and observe more than an order of magnitude performance differences.

My NN is an MLP for regression, with 3 float inputs, 8, 16, 24, 8 neurons layers and 2 float output, Sigmoid is used everywhere except on the input and output neurons. The optimizer used in Mathematica is Adam so I used this too in Python with the same parameters. The training dataset contains 4215 records mapping xyY colors to Munsell Hue and Chroma.

Mathematica is version 11.2 released in 2017 and Mathematica uses mxnet under the hood for deep learning tasks. On the Python side, I use the latest release with mxnet-mkl and I checked that it is enabled.

Mathematica licence runs on a MS Surface Pro notebook with Windows 10, i7-7660U, 2.5Ghz, 2 cores, 4 hyperthreads, AVX2. I ran Python on this computer for comparison.

Here are the times for learning loops of 32768 epochs and 128, 256, 512, 1024, 2048 and 4096 batch sizes:
Mathematica: 8m12s, 5m14s, 3m34s, 2m57s, 3m4s, 3m48s
PythonMxNet: 286m, 163m, 93m, 65m, 49m, 47m

I tried the mxnet environment variables optimization tricks suggested by Intel but only got 120% slower times.

How to interpret those results?
What may be the cause of those performance differences?
Is there anything I can do to gain some performance under Python?
Or is this simply the unavoidable overhead imposed by the Python interpreter?