There are logical partitions (“chunks”) and the shuffle_chunk_size
argument specifies the size of the chunk in the shuffle. It is set to 64 MB by default in ImageRecordIOParser. This allows pre-fetching to occur. See the data loading architecture doc which describes this in detail and discusses ThreadIterators which have queues and their own threads to pre-fetch ahead of time.
The splitting occurs as an InputSplit (also in the data loading doc) and can logically span multiple files as below:
Hope that helps!
Vishaal
what exactly is the shuffle capability over a dataset of .rec files? Is the shuffle by chunk you mention shuffling (1) file order without shuffling in-file records or (2) shuffling in files, without shuffling file read order or (3) shuffling both file order and in-file records order?
In (2) Replace “file” by “part”.
The files are amalgamated and logically partitioned into parts of size defined by the chunk size. If the chunk size is 10M, the parts are 10M. The parts may start and end mid-instance as in the picture.
The parts are read sequentially, but within those parts the images are shuffled. I imagine there is some intelligent coding to manage the partial instances.
Vishaal