Kernel dies when classifying test set and saving result

frankmei · March 11, 2019, 11:58pm

Has anyone encountered the issue that the notebook kernel just dies when trying run the last block to classify test set? This is really frustrating because it also clears our trained network and hours of work is completely gone…
If anyone knows how to resolve it, it’s much appreciated. Thanks!

jesbu1 · March 12, 2019, 1:14am

Yeah this happens to me too, specifically it’s when it’s processing training set data between 222,000 and 223,000.

Might be one bad example or something in the test set that’s killing the kernel.

@ryantheisen @gold_piggy

jesbu1 · March 12, 2019, 1:19am

Here’s the error message:

"terminate called after throwing an instance of ‘cv::Exception’
what(): OpenCV(3.4.2) /home/travis/build/dmlc/mxnet-distro/deps/opencv-3.4.2/modules/imgcodecs/src/loadsave.cpp:737 error: (-215:Assertion failed) !buf.empty()&& buf.isContinuous() in function ‘imdecode_’

Aborted (core dumped)"

jesbu1 · March 12, 2019, 1:28am

Found the file in question: it’s test file “223065.png”, We’ll probably just have to guess a random prediction for this one

EDIT: More issues at 223066, 223067. I think the test set got screwed up past 223065, please fix the test set

EDIT 2: It’s not a test set problem, nvm. Perhaps an mxnet bug? Other people don’t have issues with the same test set on the kaggle challenge.
@ryantheisen @gold_piggy @smolix @mli

Seebarsh7 · March 12, 2019, 3:45am

Same problem here, stuck for nearly two days. I think it is probably because of the memory limitation. When use AWS, everything seems fine(until now).

frankmei · March 12, 2019, 3:58am

I think a possible solution is to break up the saving parts so that it appends to submission.csv one batch at a time. Haven’t fully tested this out yet because I’m re-training my model:(

jesbu1 · March 12, 2019, 4:39am

Doesn’t work, I tried doing that and it will consistently crash after 223065. Tried starting from 224000, still crashes.

It’s not a memory limitation, I htop’d the system it was running on and there was like 50 GB RAM free at crash, and 6GB free on the GPU at crash.

jesbu1 · March 12, 2019, 5:03am

Seems to be an issue in MXNET Gluon, similar to this:

gold_piggy · March 12, 2019, 5:29am

Hey, were you running on GPU? Did it run through the last cell successfully before?

I did not enter into any issue after running through all the notebook…

jesbu1 · March 12, 2019, 5:37am

Yes, running on GPU. The crash occurs before the data is even loaded onto the GPU, however. It’s from the test_iter loading the image and applying the transforms.

frankmei · March 12, 2019, 5:40am

@gold_piggy I was running on GPU and mine crashed at 227328. I tried to save the result from each batch one at a time, but as @jesbu1 pointed out, it did not work.

Not sure what you mean by “run through the last cell successfully before”. If you mean by the original code, it runs on the tiny demo dataset so there isn’t any issue.

ryantheisen · March 12, 2019, 5:51am

Can you email me code producing the error so I can see if I get the same on my machine?

jesbu1 · March 12, 2019, 5:56am

Found a fix:

Install pytorch from here: pytorch.org

replace transform_test with

import torch import torchvision.transforms as transforms import torchvision transform_test = transforms.Compose([transforms.ToTensor(), transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010]) ])

Replace test_ds with
test_ds = torchvision.datasets.ImageFolder('PATH-TO-FILES, transform=transform_test_torch)

Replace test_iter with
test_iter = torch.utils.data.DataLoader(test_ds, batch_size = batch_size, shuffle=False,)

In the last block, make the first two lines of the for loop through test_iter this:

for X, _ in test_iter: X = nd.array(X.numpy())

This is a Gluon bug where it doesn’t catch an OpenCV Error in C/C++ on some of the garbage test images (Kaggle generates 290,000 garbage test images since there’s only 10,000 real test images), however for some reason the Pytorch dataloader works fine.

Andrew · March 13, 2019, 5:20am

I was able to run the last cell without any errors (using mxnet) so YMMV

Topic		Replies	Views
Fixed error	2	385	December 4, 2018
Trying to modify SSD lesson to work with Pascal dataset. Got low training loss but terrible prediction result Courses	7	721	October 31, 2019
Do the C++ APIs work at all Discussion	1	645	October 9, 2018
Testing a trained network	3	520	September 6, 2018
Gradient of Parameter `ssd1_batchnorm0_beta` on context gpu(0) has not been updated by backward since last `step`	2	329	October 22, 2020

Kernel dies when classifying test set and saving result

Related Topics