I find this a bit confusing.
Several places across the MXNet documentation read: Any data iterator that can read/write data from a local drive can also read/write data from S3
.
Probably I am missing something, but I guess this is true ONLY IF the data has been pre-packaged into a .rec
file.
Reading raw JPGs from S3 does not work for me.
In an ideal world I would like to run the following
data_iter = mxnet.gluon.data.vision.ImageFolderDataset("s3:/my-bucket-containing-jpg-images")
which evidently does not work.
Am I somewhat obliged to turn all my dataset into .rec
and .lst
before streaming it from S3?
No raw format supported?
Thanks
I think there’s confusion here because data iterators (from Module API) and data sets (from Gluon API) are different. S3 support is in Module API as far as I can see, but with Gluon API it’s relatively easy to implement.
Are you sure you want to be reading individual files from an S3 bucket to create each batch though? I would expect this to get expensive given the number of requests to S3. It seems to me like the best practice would just be to download the dataset from S3 once and then load as usual. And think about scaling up disk space if you’re dealing with a really large dataset (i.e. on AWS increase EBS volume size).
But it’s totally possible to create a custom Dataset to do this. ImageFolderDataset
only supports local file systems (quite a few os
commands feature in it’s implementation, e.g. os.listdir(path)
) but switching these out for boto3
calls would give you something like:
import cv2
import boto3
import mxnet as mx
from pathlib import Path
import numpy as np
class S3ImageFolderDataset(mx.gluon.data.Dataset):
def __init__(self, bucket_name, prefix):
"""
Use same folder format as ImageFolderDataset
"""
self._s3_bucket_name = bucket_name
self._s3_prefix = prefix
self._s3 = boto3.resource('s3')
self._s3_bucket = self._s3.Bucket(bucket_name)
self._s3_objects = []
for object in self._s3_bucket.objects.filter(Prefix=prefix):
self._s3_objects.append(object.key[len(prefix)+1:])
self.synset = list(set([o.split('/')[0] for o in self._s3_objects]))
self.synset.sort()
self._label_idx_map = {o: i for i, o in enumerate(self.synset)}
def __getitem__(self, idx):
s3_object = self._s3_objects[idx]
s3_object_key = self._s3_prefix + '/' + s3_object
obj = self._s3.Object(self._s3_bucket_name, s3_object_key)
contents = obj.get()['Body'].read()
data_arr = np.frombuffer(contents, dtype='uint8')
data = cv2.imdecode(data_arr, -1)
label = self._label_idx_map[s3_object.split('/')[0]]
return data, label
def __len__(self):
return len(self._s3_objects)
dataset = S3ImageFolderDataset(bucket_name='test_bucket', prefix='test/upload')
1 Like
Oh, and you might also find these useful to test things out. I wrote some code to upload 100 samples of CIFAR10 to an S3 bucket in the format required by ImageFolderDataset.
# save files to local disk
samples = 100
dataset = mx.gluon.data.vision.CIFAR10()
for idx in range(samples):
sample = dataset[idx]
filepath = Path('./test/upload/class{}/sample{}.jpeg'.format(sample[1], idx))
filepath.parent.mkdir(parents=True, exist_ok=True)
cv2.imwrite(str(filepath), sample[0].asnumpy())
# upload files to S3
folder = Path('./test/upload')
files = [f for f in folder.glob('**/*.jpeg')]
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket('test_bucket')
s3_bucket
for file in files:
s3_bucket.upload_file(str(file), str(file))
You rock as usual man! Thanks a lot