I recently set up a new computer and I can’t run my code on this computer now. Every time, I do the calculation of loss and back propagation, I encounter this error.
‘’’
KernelRestarter: restarting kernel (4/5), keep random ports
kernel c795ac26-b3b1-4c3c-94fe-34cd67a934a4 restarted
Traceback (most recent call last):
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py”, line 16, in
app.launch_new_instance()
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/traitlets/config/application.py”, line 657, in launch_instance
app.initialize(argv)
File “”, line 2, in initialize
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/traitlets/config/application.py”, line 87, in catch_config_error
return method(app, *args, **kwargs)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py”, line 467, in initialize
self.init_sockets()
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py”, line 239, in init_sockets
self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py”, line 181, in _bind_socket
s.bind(“tcp://%s:%i” % (self.ip, port))
File “zmq/backend/cython/socket.pyx”, line 547, in zmq.backend.cython.socket.Socket.bind
File “zmq/backend/cython/checkrc.pxd”, line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
‘’’
When I finish running the code, I get another error
‘’’
Segmentation fault: 11
Stack trace returned 4 entries:
[bt] (0) /home/tianweiy/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x382eea) [0x7f4c8261eeea]
[bt] (1) /home/tianweiy/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31a3d76) [0x7f4c8543fd76]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f4cee389f20]
[bt] (3) [0x55c700a80e20]
‘’’
My code is just basic CNN example, I use jupyter notebook to debug and I found that every time the program runs
‘’’
AutoGrad
with ag.record():
output = [net(X) for X in data]
loss = [loss_fn(yhat, y) for yhat, y in zip(output, label)]
# Backpropagation
for l in loss:
l.backward()
‘’’
the kernel fail and I get the error message above.
Moreover, I try the MXNet mnist example
‘’’
label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
outputs =
with ag.record():
for x, y in zip(data, label):
print(1)
z = net(x) # this line cause the error
‘’’
System Setup.
ubuntu 18.04
cuda 9.2, cuda 9.1, cuda 10.0 installed(I activate the use of cuda 9.2 by creating an env file with path to specific cuda version)
cudnn v7.3.1 for linux
I have tested my cuda and cudnn using nvidia samples.
GPU: RTX 2080
CPU is AMD Ryzen 2600
RAM: 16GB
I think memory and cpu use is not the cause as the memory use is only 50% and I set the batch size and num_worker to vey small value but the kernel still failed.