The distributed training mode would require the shared files from opencv and openblas to be installed on all nodes and be available using the LD_LIBRARY_PATH
This is something I am trying to avoid, because it doesn’t look like a good solution to me. What would you guys suggest alternatively?
Would distributing the .so files using the --files option in spark submit and loading at runtime from the code be a good idea.
What do you guys suggest to be able to run the distributed version across different nodes with yarn.
One solution would be to statically link these libraries in the MXNet library.
The blas library (intelmkl) for example comes with the pip installation of MXNet.
OpenCV comes statically link with libmxnet.so as well.
Are you building from source?
ldd libmxnet.so
linux-vdso.so.1 => (0x00007ffe42173000)
libcudart.so.9.2 => /usr/local/cuda/lib64/libcudart.so.9.2 (0x00007fc781a0b000)
libcublas.so.9.2 => /usr/local/cuda/lib64/libcublas.so.9.2 (0x00007fc77df96000)
libcurand.so.9.2 => /usr/local/cuda/lib64/libcurand.so.9.2 (0x00007fc77a058000)
libcusolver.so.9.2 => /usr/local/cuda/lib64/libcusolver.so.9.2 (0x00007fc772bf2000)
libmklml_intel.so => /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/./libmklml_intel.so (0x00007fc76af8f000)
libiomp5.so => /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/./libiomp5.so (0x00007fc76abb3000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fc76a9ab000)
libmkldnn.so.0 => /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/./libmkldnn.so.0 (0x00007fc769f87000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc769d83000)
libgfortran.so.3 => /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/./libgfortran.so.3 (0x00007fc769a5c000)
libcufft.so.9.2 => /usr/local/cuda/lib64/libcufft.so.9.2 (0x00007fc764402000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc764080000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc763d77000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc763b61000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc763944000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc76357a000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc7acdc4000)
libquadmath.so.0 => /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/./libquadmath.so.0 (0x00007fc76333a000)
@ThomasDelteil
Yes I am building from source.
I’ve used the following flags to try to build the static library
CC='gcc -static-libstdc++'
ADD_LDFLAGS+= $(pkg-config --libs --static opencv) $(pkg-config --libs --static openblas)
ADD_CFLAGS+= -Wall $(pkg-config --cflags opencv) $(pkg-config --cflags openblas)
However I don’t think the output mxnet.so file is statically linked.
Here’s the output of the ldd command on libmxnet.so
ldd libmxnet.so
linux-vdso.so.1 => (0x00007ffcf9fbd000)
libopenblas.so.0 => /usr/local/lib/libopenblas.so.0 (0x00007f0e8db1d000)
librt.so.1 => /lib64/librt.so.1 (0x00007f0e8d90d000)
libopencv_calib3d.so.2.4 => /usr/local/lib/libopencv_calib3d.so.2.4 (0x00007f0e8d631000)
libopencv_contrib.so.2.4 => /usr/local/lib/libopencv_contrib.so.2.4 (0x00007f0e8d33f000)
libopencv_core.so.2.4 => /usr/local/lib/libopencv_core.so.2.4 (0x00007f0e8ce7f000)
libopencv_features2d.so.2.4 => /usr/local/lib/libopencv_features2d.so.2.4 (0x00007f0e8cbd0000)
libopencv_flann.so.2.4 => /usr/local/lib/libopencv_flann.so.2.4 (0x00007f0e8c959000)
libopencv_highgui.so.2.4 => /usr/local/lib/libopencv_highgui.so.2.4 (0x00007f0e8c4fe000)
libopencv_imgproc.so.2.4 => /usr/local/lib/libopencv_imgproc.so.2.4 (0x00007f0e8c00f000)
libopencv_legacy.so.2.4 => /usr/local/lib/libopencv_legacy.so.2.4 (0x00007f0e8bced000)
libopencv_ml.so.2.4 => /usr/local/lib/libopencv_ml.so.2.4 (0x00007f0e8ba66000)
libopencv_nonfree.so.2.4 => /usr/local/lib/libopencv_nonfree.so.2.4 (0x00007f0e8b829000)
libopencv_objdetect.so.2.4 => /usr/local/lib/libopencv_objdetect.so.2.4 (0x00007f0e8b5a6000)
libopencv_ocl.so.2.4 => /usr/local/lib/libopencv_ocl.so.2.4 (0x00007f0e8b1c2000)
libopencv_photo.so.2.4 => /usr/local/lib/libopencv_photo.so.2.4 (0x00007f0e8afa3000)
libopencv_stitching.so.2.4 => /usr/local/lib/libopencv_stitching.so.2.4 (0x00007f0e8ad39000)
libopencv_superres.so.2.4 => /usr/local/lib/libopencv_superres.so.2.4 (0x00007f0e8aafb000)
libopencv_video.so.2.4 => /usr/local/lib/libopencv_video.so.2.4 (0x00007f0e8a89e000)
libopencv_videostab.so.2.4 => /usr/local/lib/libopencv_videostab.so.2.4 (0x00007f0e8a65f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0e8a441000)
libm.so.6 => /lib64/libm.so.6 (0x00007f0e8a1bd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f0e89fb9000)
libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x00007f0e89caf000)
libgomp.so.1 => /usr/local/lib64/libgomp.so.1 (0x00007f0e89aa1000)
libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x00007f0e8988b000)
libc.so.6 => /lib64/libc.so.6 (0x00007f0e894f6000)
/lib64/ld-linux-x86-64.so.2 (0x000055f2f4112000)
Here’s the Dockerfile to show you the exact set up steps that I’m doing.
What do I need to change to get the statically linked library?
@adwivedi Have a look at this script that shows how to build the static library that we use to release on pip
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more - apache/incubator-mxnet