Load multiple rec files for shuffling and training

Hi, I have a very large dataset which has 500 GB data. I would like to pack them into multiple rec files. My questions are:

  1. Is it better to pack them into a single rec or multiple recs?
  2. To my best knowledge, MXNet only has support for loading one rec. Does it supports loading multiple recs and do shuffling and training?

Thank you for your answers in advance.

hi,

  1. imo the whole point of having a rec files is to store your data into one file file for faster reading so that should be preferred over multiple recordIO files.

  2. you can have multiple rec files by creating your custom Dataset class that extends gluon.data.Dataset and implements __getitem__ and __len__

For example:

class CustomCombinedDataset(gluon.data.Dataset):
    """
    A dataset that accepts several dataset and serves
    them as one
    """

    def __init__(self, datasets):

        self.datasets = datasets

        self.lengths = []
        start = 0
        for d in datasets:
            end = start + len(d)
            self.lengths.append((start, end))
            start = end

        self.length = sum([len(d) for d in datasets])

    def __getitem__(self, idx):
        current_running = 0
        for i, (start, end) in enumerate(self.lengths):
            print(start, end, idx)
            if idx >= end:
                current_running += end
            else:
                return self.datasets[i][idx - current_running]

    def __len__(self):
        return self.length

where each dataset in datasets is a gluon.data.RecordFileDataset

1 Like

Appreciate your answer. I get your points: building a abstract dataset and providing the APIs required by Mxnet data loader. Thanks again.

hi, Do you implement this feature now? Can you show your code? I also encountered the same problem. thanks!