Action recognition performance drops when using ROI-Pooling

Recently I have been trying to implement a deep learning method for action recognition in single images.

I thus use a model based on Resnet 50 that has two variations:

  • a vanilla version
  • a version that incorporates ROI-Pooling as the first layer of the network

For the vanilla resnet50 version of the model

I perform training using:

  • the Training partition of the action recognition task of the PASCAL VOC 2012 Dataset
  • from each image, I crop the primary regions that contain persons each of whom is assigned an action label

I test the model by using:

  • the Validation partition of the action recognition task of the PASCAL VOC 2012 Dataset
  • from each image, I crop the primary regions that contain persons each of whom is assigned an action label

All images are rescaled to 224x224 from the data iterator
No Data Augmentation takes place
The loss function being used is softmax cross entropy
The batch size was set to 64 images
The performance I get by feeding the crops of persons to the network using the vanilla Resnet-50 is 71.51% mAP

Then I introduce ROI Pooling to the Network.

  • As I mentioned ROI Pooling is introduced as the first layer of the network
  • The dataset partitions I use to train and test are the same
  • Images are provided to the data iterator without cropping taking place and are rescaled to 224x224
  • ROI Pooling outputs image crops for the primary regions (depicting persons) that have a size of
    224x224
  • Action Labels are assigned properly for the ROI pooling crops
  • The number of ROI crops varies per image.
  • In every batch, I loaded 10 images and each image had a varying number of ROIs
  • No Data augmentation was performed
  • The loss function was again softmax cross entropy

The performance I get by introducing ROI-Pooling to the network Resnet-50 is 65.93% mAP

Why is there a difference of approximately 6% mAP in my results?

p.s the ROI-Pooling network seems to overfit a lot and really quickly…

Here is a subset of my code:

number_of_images_per_batch =10
num_classes = 11
init_lr = 0.0001
step_epochs = [3]
schedule_lr = LR_Schedule(init_lr)

train_iter, val_iter, num_samples = get_image_iterators(number_of_images_per_batch,num_classes)

resnet = vision.resnet50_v2(pretrained=True, ctx=ctx)

net = vision.resnet50_v2(classes=num_classes, ctx=ctx)



net_cl = nn.HybridSequential(prefix='resnetv20')
with net_cl.name_scope():
    for l in xrange(4):
        net_cl.add(resnet.classifier._children[l])
    net_cl.add(nn.Dense(num_classes,  in_units=resnet.classifier._children[-1]._in_units))

net.classifier = net_cl
net.classifier[-1].collect_params().initialize(mx.init.Xavier(rnd_type='gaussian', factor_type="in", magnitude=2), ctx=ctx)
net.features = resnet.features
net.collect_params().reset_ctx(ctx)

epoch_size = int(math.ceil(float(num_samples) / batch_size))
steps = [epoch_size * x for x in step_epochs]
lr_scheduler = mx.lr_scheduler.MultiFactorScheduler(steps, factor=0.1)
trainer = gluon.Trainer(net.collect_params(), optimizer='sgd', optimizer_params={'learning_rate': init_lr,
                                                                                 'momentum':0.9,
                                                                                 'lr_scheduler': lr_scheduler,
                                                                                 'wd': 0.0005})
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
epochs = 60000
smoothing_constant, moving_loss_tr, moving_loss_val = .01, 0.0, 0.0
patience, lr_drops = 0, 0

best_mAP = 0

batch_primary_regions=[]

for e in range(epochs):
    train_iter.reset()
    val_iter.reset()
    start_time = timeit.default_timer()

    predicts, labels = [], []

    for i, batch in enumerate(train_iter):
        batch_primary_regions = mx.nd.array(batch.data[2]).as_in_context(ctx)
        batch_primary_labels = mx.nd.array(batch.data[3]).as_in_context(ctx)
        data = mx.nd.ROIPooling(mx.nd.array(batch.data[0]).as_in_context(ctx), batch_primary_regions, pooled_size=(224, 224), spatial_scale=1.0)

        with mx.autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, batch_primary_labels)
        loss.backward()
        trainer.step(data.shape[0])
        prob_predictions = nd.softmax(output)

        curr_loss = nd.mean(loss).asscalar()
        moving_loss_tr = (curr_loss if ((i == 0) and (e == 0))
                       else (1 - smoothing_constant) * moving_loss_tr + smoothing_constant * curr_loss)


        predicts.extend(prob_predictions.asnumpy())

        labels.extend(nd.one_hot(batch_primary_labels, num_classes).asnumpy())

    predicts, labels = np.array(predicts), np.array(labels)
    train_accuracy = accuracy_score(np.argmax(labels, axis=1), np.argmax(predicts, axis=1))
    train_mAP, train_APs = evaluate_mAP(labels, predicts)


    predicts_val, labels_groundtruth_val = [], []
    for i, batch in enumerate(val_iter):

        batch_primary_regions = mx.nd.array(batch.data[2]).as_in_context(ctx)
        batch_primary_labels = mx.nd.array(batch.data[3]).as_in_context(ctx)
        data = mx.nd.ROIPooling(mx.nd.array(batch.data[0]).as_in_context(ctx), batch_primary_regions, pooled_size=(224, 224), spatial_scale=1.0)

        output = net(data)
        prob_predictions = nd.softmax(output)
        loss = softmax_cross_entropy(output, batch_primary_labels)
        curr_loss = nd.mean(loss).asscalar()
        moving_loss_val = (curr_loss if ((i == 0) and (e == 0))
                       else (1 - smoothing_constant) * moving_loss_val + smoothing_constant * curr_loss)

        predicts_val.extend(prob_predictions.asnumpy())
        labels_groundtruth_val.extend(nd.one_hot(batch_primary_labels, num_classes).asnumpy())

    predicts_val, labels_groundtruth_val = np.array(predicts_val), np.array(labels_groundtruth_val)
    test_accuracy = accuracy_score(np.argmax(labels_groundtruth_val, axis=1), np.argmax(predicts_val, axis=1))
    val_mAP, val_APs = evaluate_mAP(labels_groundtruth_val, predicts_val)

I want to clarify this parameter. When you pass pooled_size, but your primary_regions are smaller than pooled_size, the regions of interests are resized to match the pooled_size. Is that what you expect to achieve?

My hypothesis is that your regions of interests are actually smaller than 224 x 224 and when resizing is happening it adds noise to the data and it decreases mAP.

If I reduce the pooled size performance drops even further…