Determine training sample influence

Recently there has been increased interest in efficient ways to determine the influence of individual training samples to cleanse the training data. An efficient, automated way is important, if you train on a huge amount of training samples, and would like to find out, which training samples are responsible for the false classification of important test samples.

I have been reading “Efficient Estimation of Influence of a Training Instance”. It describes a seemingly easy and efficient way to gain some insight into the influence of individual training samples. It describes a method which uses a fixed dropout mask per training sample. This of course would require some sample id to be known in the dropout layer to generate the fixed dropout mask during inference. Does anybody know how this could be implemented on top of mxnet? Also I don’t understand how this would work with the usual mini-batch, because the dropout masks must be per sample, and not per batch. Paper can be found here: https://aclanthology.org/2020.sustainlp-1.6.pdf

Is there any other way to determine which training samples negatively influences the classification of test-samples, which works on training set sizes of > 100.000 imges?

Any best practice advice for the mxnet user would be highly welcome!