Read .rec into memory and get data stats


I have a .rec file storing my training data. I’d like to know how many examples it contains. How can I do that? It sounds trivial but 30min of reading through the documentation did not provide the answer.


MXRecordIO is essentially an iterator so the only way of determining the number of elements is to .read() them until you finish and keep count. If you use an MXIndexedRecordIO, you can access the keys member variable and do a length on that to determine the number of record.

However, usually you’d want to pass your data to a data loader that would provide batches to you. For each batch you’d be able to inspect the first axis of the shape to determine the batch length. For example,

epochs = 5
for epoch in range(epochs):
    # training loop (with autograd and trainer steps, etc.)
    cumulative_train_loss = mx.nd.zeros(1, ctx=ctx)
    training_samples = 0
    for batch_idx, (data, label) in enumerate(train_data_loader):
        data = data.as_in_context(ctx).reshape((-1, 784)) # 28*28=784
        label = label.as_in_context(ctx)
        with autograd.record():
            output = net(data)
            loss = criterion(output, label)
        cumulative_train_loss += loss.sum()
        training_samples += data.shape[0]      ### this is the number of samples ###
    train_loss = cumulative_train_loss.asscalar()/training_samples