In a research problem we’re using very small sets of images to fine-tune a resnet with model.fit and some things start to happen that I don’t understand - in particular in the case when I have number of images matching the batch size (i.e. 256 images and batch size 256) I find the following
- If I shuffle the dataset differently (shuffle lst file, create rec file and not shuffle during training), I get a discrepancy in training and validation error (lowest throughout epoch) up to 10 %, example points: 0.88 train accuracy, 0.81 validation accuracy vs. 0.73 train accuracy, 0.71 validation accuracy. If I have one batch, logically it shouldn’t make a difference whether or not and how I shuffle? What could be causing the randomness here? This is trained on one GPU.
- If I train with different number of gpus on the same rec file (without shuffling) I also get a discrepancy of about 5-10%
Another fact is that if I run training on the same file with fixed the number of gpus multiple times I do not get any variation.
Does anyone have an idea what could be behind this behavior and how one could mitigate that? I was going to try to just always have a sufficient number of batches (i.e. in this case reduce batch size) but it would be nice if it’s possible to train with one batch as well if required?