Training with one batch gives different training/validation accuracies when shuffled

In a research problem we’re using very small sets of images to fine-tune a resnet with model.fit and some things start to happen that I don’t understand - in particular in the case when I have number of images matching the batch size (i.e. 256 images and batch size 256) I find the following

  1. If I shuffle the dataset differently (shuffle lst file, create rec file and not shuffle during training), I get a discrepancy in training and validation error (lowest throughout epoch) up to 10 %, example points: 0.88 train accuracy, 0.81 validation accuracy vs. 0.73 train accuracy, 0.71 validation accuracy. If I have one batch, logically it shouldn’t make a difference whether or not and how I shuffle? What could be causing the randomness here? This is trained on one GPU.
  2. If I train with different number of gpus on the same rec file (without shuffling) I also get a discrepancy of about 5-10%

Another fact is that if I run training on the same file with fixed the number of gpus multiple times I do not get any variation.

Does anyone have an idea what could be behind this behavior and how one could mitigate that? I was going to try to just always have a sufficient number of batches (i.e. in this case reduce batch size) but it would be nice if it’s possible to train with one batch as well if required?

That is indeed odd. What is also odd is that your training accuracy is worse than validation accuracy. Is it possible for you to share a (simplified) code sample that reproduces this problem?

@fanny can you provide the code that is producing these results?