Horovod has arrived?

Hi,

I see no recent activity in Confluence nor in this thread, and I see an MXNet section in horovod’s github… Shall I conclude that horovod support for MXNet has arrived?

Hi @olivcruche,

Yes, Horovod support was added in MXNet v1.4 (with the exception of MKLDNN support, which has been added since). A blog post on the subject is coming out soon, you’re just ahead of the curve! :slight_smile:

2 Likes

omg this is HUGE news! can’t wait to benchmark this

1 Like
3 Likes

Dear all,

do we need to install mxnet from source to make horovod run? I have installed via pip latest version (1.5.something) and I cannot make it run (yet) on my hpc environment.

All the best

Sorry as I am asking a noob question maye but what is “Hovorod”?

Horovod is an open-source distributed deep learning framework created at Uber. It leverages efficient inter-GPU and inter-node communication methods such as NVIDIA Collective Communications Library (NCCL) and Message Passing Interface (MPI) to distribute and aggregate model parameters between workers. It optimizes the use of network bandwidth and scales very well with dense deep neural network models. It currently supports several mainstream deep learning frameworks such as MXNet, Tensorflow, Keras, and PyTorch.

You should be able to get things running with the packages from PyPI.

pip install mxnet
pip install horovod

And have you installed OpenMPI too? What errors are you seeing there?

1 Like

Hi @thomelane, thank you very much. I’ve been trying several things, that I will mention here. I just deleted everything, and reinstalled everything from scratch, to start experimenting with possible solutions.

Environment: HPC cluster, linux environment:

uname -r
4.4.140-94.42-default

mxnet pip installation (mxnet_cu92-1.5.0b20190407). First installed mxnet, then installed horovod. Prior installing horovod I loaded the following modules in my environment:

module load nccl/2.3.7-cuda92
module load openmpi/3.1.2-sharp
module load cudnn/v7.5.0-cuda92
module load gcc/8.3.0

this fixed installation issues I had with horovod.

Attempt 1:
I used the following submit job file:

#!/bin/bash

#SBATCH --job-name="HVDTest"

#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --gres=gpu:4
#SBATCH --mem=256gb

#SBATCH --mail-type=ALL
##SBATCH --mail-user=foivos.diakogiannis@data61.csiro.au

echo " "
echo " Nodelist = " $SLURM_JOB_NODELIST
echo " Number of nodes = " $SLURM_JOB_NUM_NODES
echo " "



#### Load modules 
module load nccl/2.3.7-cuda92
module load cudnn/v7.5.0-cuda92
module load openmpi/3.1.2-sharp
module load hpc-x

module list

####    Use MPI for communication with Horovod

export HOROVOD_GPU_ALLREDUCE=MPI
export HOROVOD_GPU_ALLGATHER=MPI
export HOROVOD_GPU_BROADCAST=MPI

####   Produce a timeline for debugging purposes
export HOROVOD_TIMELINE=./timeline.json
export NCCL_DEBUG=DEBUG


ulimit -s 20480 ####### Horovod recommends 10240 for this

echo "Running on multiple nodes and GPU devices"
echo ""
echo "Run started at:- "
date

##### Actual executable 
mpirun -np 8 -H server1:4,server2:4  -bind-to none -map-by slot -x HOROVOD_TIMELINE  -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH  -mca pml ob1 -mca btl ^openib python ./mxnet_mnist.py

echo "Run completed at:- "
date

this is the output of my run:

 
 Nodelist =  b[003,006]
 Number of nodes =  2
 
Currently Loaded Modulefiles:
  1) SC                    5) intel-fc/16.0.4.258   9) nccl/2.3.7-cuda92
  2) slurm/17.11.8         6) cuda/9.2.88          10) cudnn/v7.5.0-cuda92
  3) cuda-driver/current   7) hpc-x/2.2.0          11) gcc/8.3.0
  4) intel-cc/16.0.4.258   8) openmpi/3.1.2-sharp
Running on multiple nodes and GPU devices

Run started at:- 
Mon Apr  8 17:54:58 AEST 2019
--------------------------------------------------------------------------
There are no allocated resources for the application:
  python
that match the requested mapping:
  -host: server1:4,server2:4

Verify that you have mapped the allocated resources properly for the
indicated specification.
--------------------------------------------------------------------------
Run completed at:- 
Mon Apr  8 17:54:58 AEST 2019

so it seems it doesn’t like the server1:4, server2:4 commands somehow.

Attempt 2:
I removed the -H server1:4 server2:4 arguments, now my run command is:

mpirun -np 8   -bind-to none -map-by slot -x HOROVOD_TIMELINE  -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH  -x HOROVOD_MPI_THREADS_DISABLE=1 -mca pml ob1 -mca btl ^openib python ./mxnet_mnist.py

output:

Run started at:- 
Mon Apr  8 18:07:07 AEST 2019
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8 slots
that were requested by the application:
  python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

Attempt 3:

now I added the following line in my submit job file:

#SBATCH --ntasks-per-node=4

and the run command is (everything else the same in my submit.job file):

mpirun -np 8   -bind-to none -map-by slot -x HOROVOD_TIMELINE  -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH  -x HOROVOD_MPI_THREADS_DISABLE=1 -mca pml ob1 -mca btl ^openib python ./mxnet_mnist.py

With this, this is my latest error:

Nodelist =  b[004,006]
 Number of nodes =  2

Currently Loaded Modulefiles:
  1) SC                    5) intel-fc/16.0.4.258   9) hpc-x/2.2.0
  2) slurm/17.11.8         6) cuda/9.2.88          10) openmpi/3.1.2-sharp
  3) cuda-driver/current   7) nccl/2.3.7-cuda92
  4) intel-cc/16.0.4.258   8) cudnn/v7.5.0-cuda92
Running on multiple nodes and GPU devices

Run started at:-
Mon Apr  8 18:35:26 AEST 2019
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
[b004:0:54084 - context.c:485] INFO job (ID: 4163633153) resource request quota: ( osts:32 user_data_per_ost:256 max_groups:4 max_qps:4 max_group_channels:1, num_trees:1)
[b004:0:54084 - context.c:628] INFO tree_info: tree idx:0 quota: ( osts:32 user_data_per_ost:256 max_groups:4 max_qps:4 max_group_channels:1)
[b006:8433][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b006:8432][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b006:8431][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b006:8434][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b004:54087][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b004:54084][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b004:54085][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b004:54088][bcol_basesmuma_component.c:501:hmca_bcol_basesmuma_init_query] BCOL-BASESMUMA Failed to create rcache for KNEM device
[b004:0:54084 - comm.c:417] INFO [group#:0] group id: 0 tree idx:0 rail_idx:0 group size:2 quota: ( osts:8 user_data_per_ost:256 ) mgid: ( subnet prefix: 0xff12a01bfe800000 interface id: 0xa70a00000000 ) mlid:c010
[b004:3:54088 unique id 1] WARN No available groups in sharp_alloc_groups_info.

[b004:3:54088 - comm.c:242] WARN sharp_alloc_groups_info failed: No available groups(-11)
[b004:54088:3][common_sharp.c:360:comm_sharp_coll_comm_init] SHArP: sharp group create failed:SHArP Group alloc error(-4)
[b004:54088:3][common_sharp.c:365:comm_sharp_coll_comm_init] SHArP: Fallback disabled, exiting..
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
[b004:54088] *** Process received signal ***
[b004:54088] Signal: Aborted (6)
[b004:54088] Signal code:  (-6)
[b006:8434:7][common_sharp.c:365:comm_sharp_coll_comm_init] SHArP: Fallback disabled, exiting..
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
[b004:54088] [ 0] [b006:08434] *** Process received signal ***
/lib64/libpthread.so.0(+0x10c10)[0x7ffff7bcec10]
[b004:54088] [ 1] [b006:08434] Signal: Aborted (6)
[b006:08434] Signal code:  (-6)
/lib64/libc.so.6(gsignal+0x37)[0x7ffff784df67]
[b004:54088] [ 2] /lib64/libc.so.6(abort+0x13a)[0x7ffff784f33a]
[b004:54088] [ 3] [b006:08434] [ 0] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7fffa9f393df]
[b004:54088] [ 4] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0x9cb16)[0x7fffa9f37b16]
[b004:54088] [ 5] /lib64/libpthread.so.0(+0x10c10)[0x7ffff7bcec10]
[b006:08434] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ffff784df67]
[b006:08434] [ 2] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0x9bf91)[0x7fffa9f36f91]
[b004:54088] [ 6] /lib64/libc.so.6(abort+0x13a)[0x7ffff784f33a]
[b006:08434] [ 3] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x33e)[0x7fffa9f3779d]
[b004:54088] [ 7] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7fffa9f393df]
[b006:08434] [ 4] /data/dia021/Software/anaconda3/bin/../lib/libgcc_s.so.1(+0xcf56)[0x7fffa9e6cf56]
[b004:54088] [ 8] /data/dia021/Software/anaconda3/bin/../lib/libgcc_s.so.1(_Unwind_RaiseException+0xe6)[0x7fffa9e6d244]
[b004:54088] [ 9] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(__cxa_throw+0x42)[0x7fffa9f37d1b]
[b004:54088] [10] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0x9cb16)[0x7fffa9f37b16]
[b006:08434] [ 5] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0x9bf91)[0x7fffa9f36f91]
[b006:08434] [ 6] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(_ZSt20__throw_system_errori+0x73)[0x7fffa9f533bd]
[b004:54088] [11] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x33e)[0x7fffa9f3779d]
[b006:08434] [ 7] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(_ZNSt6thread4joinEv+0x25)[0x7fffa9f53541]
[b004:54088] [12] /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0x640)[0x7fff734e5e30]
[b004:54088] [13] /data/dia021/Software/anaconda3/bin/../lib/libgcc_s.so.1(+0xcf56)[0x7fffa9e6cf56]
[b006:08434] [ 8] /data/dia021/Software/anaconda3/bin/../lib/libgcc_s.so.1(_Unwind_RaiseException+0xe6)[0x7fffa9e6d244]
[b006:08434] [ 9] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(__cxa_throw+0x42)[0x7fffa9f37d1b]
[b006:08434] [10] /lib64/libc.so.6(+0x37869)[0x7ffff7850869]
[b004:54088] [14] /lib64/libc.so.6(+0x378b5)[0x7ffff78508b5]
[b004:54088] [15] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(comm_sharp_coll_comm_init+0x446)[0x7fff72cc23d6]
[b004:54088] [16] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hmca_coll_ml_hierarchy_discovery+0x1138)[0x7fff72d51168]
[b004:54088] [17] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(_ZSt20__throw_system_errori+0x73)[0x7fffa9f533bd]
[b006:08434] [11] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(+0x2c326)[0x7fff72ccd326]
[b004:54088] [18] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hmca_coll_ml_comm_query+0x27d)[0x7fff72cd2f2d]
[b004:54088] [19] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hcoll_get_context_from_cache+0x7a9)[0x7fff72d4f8e9]
[b004:54088] [20] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hcoll_create_context+0xa5)[0x7fff72d4c415]
[b004:54088] [21] /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(_ZNSt6thread4joinEv+0x25)[0x7fffa9f53541]
[b006:08434] [12] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(mca_coll_hcoll_comm_query+0x13c)[0x7fff730916ec]
[b004:54088] [22] /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0x640)[0x7fff734e5e30]
[b006:08434] [13] /lib64/libc.so.6(+0x37869)[0x7ffff7850869]
[b006:08434] [14] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(mca_coll_base_comm_select+0x173)[0x7fff7307e003]
[b004:54088] [23] /lib64/libc.so.6(+0x378b5)[0x7ffff78508b5]
[b006:08434] [15] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(comm_sharp_coll_comm_init+0x446)[0x7fff72cc23d6]
[b006:08434] [16] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hmca_coll_ml_hierarchy_discovery+0x1138)[0x7fff72d51168]
[b006:08434] [17] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(+0x6696c)[0x7fff7302796c]
[b004:54088] [24] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(+0x2c326)[0x7fff72ccd326]
[b006:08434] [18] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hmca_coll_ml_comm_query+0x27d)[0x7fff72cd2f2d]
[b006:08434] [19] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(+0x688f6)[0x7fff730298f6]
[b004:54088] [25] /apps/openmpi/3.1.2-sharp/lib/libopen-pal.so.40(opal_progress+0x24)[0x7fff7065ebf4]
[b004:54088] [26] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hcoll_get_context_from_cache+0x7a9)[0x7fff72d4f8e9]
[b006:08434] [20] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(ompi_comm_activate+0xd1)[0x7fff730273f1]
[b004:54088] /apps/hpc-x/2.2.0/hcoll/lib/libhcoll.so.1(hcoll_create_context+0xa5)[0x7fff72d4c415]
[b006:08434] [21] [27] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(mca_coll_hcoll_comm_query+0x13c)[0x7fff730916ec]
[b006:08434] [22] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(ompi_comm_split+0xc2c)[0x7fff730244fc]
[b004:54088] [28] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(mca_coll_base_comm_select+0x173)[0x7fff7307e003]
[b006:08434] [23] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(+0x6696c)[0x7fff7302796c]
[b006:08434] [24] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(PMPI_Comm_split+0x11)[0x7fff7305b6c1]
[b004:54088] [29] /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x36962)[0x7fff734df962]
[b004:54088] *** End of error message ***
/apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(+0x688f6)[0x7fff730298f6]
[b006:08434] [25] /apps/openmpi/3.1.2-sharp/lib/libopen-pal.so.40(opal_progress+0x24)[0x7fff7065ebf4]
[b006:08434] [26] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(ompi_comm_activate+0xd1)[0x7fff730273f1]
[b006:08434] [27] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(ompi_comm_split+0xc2c)[0x7fff730244fc]
[b006:08434] [28] /apps/openmpi/3.1.2-sharp/lib/libmpi.so.40(PMPI_Comm_split+0x11)[0x7fff7305b6c1]
[b006:08434] [29] /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x36962)[0x7fff734df962]
[b006:08434] *** End of error message ***
[b004:0:54084 - comm.c:417] INFO [group#:0] group id: 2 tree idx:0 rail_idx:0 group size:2 quota: ( osts:8 user_data_per_ost:256 ) mgid: ( subnet prefix: 0xff12a01bfe800000 interface id: 0xa70a00000002 ) mlid:c010
[b004:1:54085 - comm.c:417] INFO [group#:0] group id: 1 tree idx:0 rail_idx:0 group size:2 quota: ( osts:8 user_data_per_ost:256 ) mgid: ( subnet prefix: 0xff12a01bfe800000 interface id: 0xa70a00000001 ) mlid:c010
[b004:2:54087 - comm.c:417] INFO [group#:0] group id: 3 tree idx:0 rail_idx:0 group size:2 quota: ( osts:8 user_data_per_ost:256 ) mgid: ( subnet prefix: 0xff12a01bfe800000 interface id: 0xa70a00000003 ) mlid:c010
Context of running::gpu(0)
Context of running::gpu(1)
Context of running::gpu(2)
Context of running::gpu(0)
Context of running::gpu(1)
Context of running::gpu(2)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 8434 on node b006 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Run completed at:-
Mon Apr  8 18:35:32 AEST 2019

it seems the problem is openmpi related? I’ve noticed these lines:

 BCOL-BASESMUMA Failed to create rcache for KNEM device

I am in the process of making it work, I want to do some tests with large batch size. Any pointers most welcome.

All the best,
Foivos

update: So I changed the libraries loaded during the run to:

#!/bin/bash

#SBATCH --job-name="HVDTest"

#SBATCH --nodes=2
#SBATCH --time=00:10:00
#SBATCH --ntasks-per-node=4 ##### This should be EQUAL to the number of GPUs for the MPI, specifiying the gres=gpu:4 only doesn't work
#SBATCH --gres=gpu:4
#SBATCH --mem=256gb

#SBATCH --mail-type=ALL
##SBATCH --mail-user=foivos.diakogiannis@data61.csiro.au

echo " "
echo " Nodelist = " $SLURM_JOB_NODELIST
echo " Number of nodes = " $SLURM_JOB_NUM_NODES
echo " "



#### Load modules 
module load cuda/9.2.88
module load cudnn/v7.5.0-cuda92
module load gcc/8.3.0
module load openmpi/4.0.0-simple-gcc
module load hpc-x

module list

####    Use MPI for communication with Horovod

export HOROVOD_GPU_ALLREDUCE=MPI
export HOROVOD_GPU_ALLGATHER=MPI
export HOROVOD_GPU_BROADCAST=MPI

####   Produce a timeline for debugging purposes
export HOROVOD_TIMELINE=./timeline.json
export NCCL_DEBUG=DEBUG


ulimit -s 20480 ####### Horovod recommends 10240 for this

echo "Running on multiple nodes and GPU devices"
echo ""
echo "Run started at:- "
date

##### Actual executable 
mpirun -np 8   -bind-to none -map-by slot -x HOROVOD_TIMELINE  -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH  -x HOROVOD_MPI_THREADS_DISABLE=1 -mca pml ob1 -mca btl ^openib python ./mxnet_mnist.py
./mxnet_mnist.py

echo "Run completed at:- "
date

This time I got a mxnet bug - I think. So this is my error file now (am running the mxnet_mnist.py example provided by the official mxnet/horovod repository)

  1) SC                         6) cuda/9.2.88
  2) slurm/17.11.8              7) cudnn/v7.5.0-cuda92
  3) cuda-driver/current        8) gcc/8.3.0
  4) intel-cc/16.0.4.258        9) openmpi/4.0.0-simple-gcc
  5) intel-fc/16.0.4.258       10) hpc-x/2.2.0
Running on multiple nodes and GPU devices

Run started at:-
Mon Apr  8 19:52:22 AEST 2019
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
/data/dia021/Software/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
INFO:root:Namespace(batch_size=256, dtype='float32', epochs=5, lr=0.01, momentum=0.9, no_cuda=False)
Context of running::gpu(3)
Context of running::gpu(2)
Context of running::gpu(1)
Context of running::gpu(0)
Context of running::gpu(1)
Context of running::gpu(2)
Context of running::gpu(3)
Context of running::gpu(0)
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-6/mnist.zip successfully
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-3/mnist.zip successfully
[19:52:36] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:36] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
[19:52:37] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:37] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-1/mnist.zip successfully
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-2/mnist.zip successfully
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-5/mnist.zip successfully
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-7/mnist.zip successfully
[19:52:39] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:40] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:40] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:40] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
[19:52:40] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:40] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
[19:52:40] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
[19:52:40] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
[b006:28759:0] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
[b008:49825:0] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x14580) [0x7fff70fd1580]
    1  /usr/lib64/libucs.so.0(+0x149d2) [0x7fff70fd19d2]
    2  /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x62607) [0x7fff7350b607]
    3  /data/dia021/Software/mxnet/libmxnet.so(+0x30c099e) [0x7fffc41aa99e]
    4  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4aba) [0x7fffc41aeaba]
    5  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4d2e) [0x7fffc41aed2e]
    6  /data/dia021/Software/mxnet/libmxnet.so(+0x30c104b) [0x7fffc41ab04b]
    7  /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fffa9f53678]
    8  /lib64/libpthread.so.0(+0x8724) [0x7ffff7bc6724]
    9  /lib64/libc.so.6(clone+0x6d) [0x7ffff7905e8d]
===================
[b008:49825:0] Process frozen...
[b008:49826:0] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x14580) [0x7fff70fd1580]
    1  /usr/lib64/libucs.so.0(+0x149d2) [0x7fff70fd19d2]
    2  /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x62607) [0x7fff7350b607]
    3  /data/dia021/Software/mxnet/libmxnet.so(+0x30c099e) [0x7fffc41aa99e]
    4  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4aba) [0x7fffc41aeaba]
    5  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4d2e) [0x7fffc41aed2e]
    6  /data/dia021/Software/mxnet/libmxnet.so(+0x30c104b) [0x7fffc41ab04b]
    7  /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fffa9f53678]
    8  /lib64/libpthread.so.0(+0x8724) [0x7ffff7bc6724]
    9  /lib64/libc.so.6(clone+0x6d) [0x7ffff7905e8d]
===================
[b008:49826:0] Process frozen...
[b008:49823:0] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x14580) [0x7fff70fd1580]
    1  /usr/lib64/libucs.so.0(+0x149d2) [0x7fff70fd19d2]
    2  /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x62607) [0x7fff7350b607]
    3  /data/dia021/Software/mxnet/libmxnet.so(+0x30c099e) [0x7fffc41aa99e]
    4  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4aba) [0x7fffc41aeaba]
    5  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4d2e) [0x7fffc41aed2e]
    6  /data/dia021/Software/mxnet/libmxnet.so(+0x30c104b) [0x7fffc41ab04b]
    7  /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fffa9f53678]
    8  /lib64/libpthread.so.0(+0x8724) [0x7ffff7bc6724]
    9  /lib64/libc.so.6(clone+0x6d) [0x7ffff7905e8d]
===================
[b006:28757:0] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x14580) [0x7fff70fd1580]
    1  /usr/lib64/libucs.so.0(+0x149d2) [0x7fff70fd19d2]
    2  /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x62607) [0x7fff7350b607]
    3  /data/dia021/Software/mxnet/libmxnet.so(+0x30c099e) [0x7fffc41aa99e]
    4  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4aba) [0x7fffc41aeaba]
    5  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4d2e) [0x7fffc41aed2e]
    6  /data/dia021/Software/mxnet/libmxnet.so(+0x30c104b) [0x7fffc41ab04b]
    7  /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fffa9f53678]
    8  /lib64/libpthread.so.0(+0x8724) [0x7ffff7bc6724]
    9  /lib64/libc.so.6(clone+0x6d) [0x7ffff7905e8d]
===================
[b006:28757:0] Process frozen...
[b006:28758:0] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x14580) [0x7fff70fd1580]
    1  /usr/lib64/libucs.so.0(+0x149d2) [0x7fff70fd19d2]
    2  /data/dia021/Software/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x62607) [0x7fff7350b607]
    3  /data/dia021/Software/mxnet/libmxnet.so(+0x30c099e) [0x7fffc41aa99e]
    4  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4aba) [0x7fffc41aeaba]
    5  /data/dia021/Software/mxnet/libmxnet.so(+0x30c4d2e) [0x7fffc41aed2e]
    6  /data/dia021/Software/mxnet/libmxnet.so(+0x30c104b) [0x7fffc41ab04b]
    7  /data/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fffa9f53678]
    8  /lib64/libpthread.so.0(+0x8724) [0x7ffff7bc6724]
    9  /lib64/libc.so.6(clone+0x6d) [0x7ffff7905e8d]
===================
[b006:28758:0] Process frozen...
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-4/mnist.zip successfully
[19:52:46] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:46] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
[b008:49822:0] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
INFO:root:downloaded http://data.mxnet.io/mxnet/data/mnist.zip into data-0/mnist.zip successfully
[19:52:49] src/io/iter_mnist.cc:113: MNISTIter: load 7500 images, shuffle=1, shape=[256,1,28,28]
[19:52:50] src/io/iter_mnist.cc:113: MNISTIter: load 10000 images, shuffle=1, shape=[256,1,28,28]
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 49823 on node b008 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Run completed at:-
Mon Apr  8 19:58:07 AEST 2019

I think this is related to this issue #985?