Action recognition performance drops when using ROI-Pooling

obelix · June 20, 2018, 7:30pm

Recently I have been trying to implement a deep learning method for action recognition in single images.

I thus use a model based on Resnet 50 that has two variations:

a vanilla version
a version that incorporates ROI-Pooling as the first layer of the network

For the vanilla resnet50 version of the model

I perform training using:

the Training partition of the action recognition task of the PASCAL VOC 2012 Dataset
from each image, I crop the primary regions that contain persons each of whom is assigned an action label

I test the model by using:

the Validation partition of the action recognition task of the PASCAL VOC 2012 Dataset
from each image, I crop the primary regions that contain persons each of whom is assigned an action label

All images are rescaled to 224x224 from the data iterator
No Data Augmentation takes place
The loss function being used is softmax cross entropy
The batch size was set to 64 images
The performance I get by feeding the crops of persons to the network using the vanilla Resnet-50 is 71.51% mAP

Then I introduce ROI Pooling to the Network.

As I mentioned ROI Pooling is introduced as the first layer of the network
The dataset partitions I use to train and test are the same
Images are provided to the data iterator without cropping taking place and are rescaled to 224x224
ROI Pooling outputs image crops for the primary regions (depicting persons) that have a size of
224x224
Action Labels are assigned properly for the ROI pooling crops
The number of ROI crops varies per image.
In every batch, I loaded 10 images and each image had a varying number of ROIs
No Data augmentation was performed
The loss function was again softmax cross entropy

The performance I get by introducing ROI-Pooling to the network Resnet-50 is 65.93% mAP

Why is there a difference of approximately 6% mAP in my results?

p.s the ROI-Pooling network seems to overfit a lot and really quickly…

Here is a subset of my code:

number_of_images_per_batch =10
num_classes = 11
init_lr = 0.0001
step_epochs = [3]
schedule_lr = LR_Schedule(init_lr)

train_iter, val_iter, num_samples = get_image_iterators(number_of_images_per_batch,num_classes)

resnet = vision.resnet50_v2(pretrained=True, ctx=ctx)

net = vision.resnet50_v2(classes=num_classes, ctx=ctx)



net_cl = nn.HybridSequential(prefix='resnetv20')
with net_cl.name_scope():
    for l in xrange(4):
        net_cl.add(resnet.classifier._children[l])
    net_cl.add(nn.Dense(num_classes,  in_units=resnet.classifier._children[-1]._in_units))

net.classifier = net_cl
net.classifier[-1].collect_params().initialize(mx.init.Xavier(rnd_type='gaussian', factor_type="in", magnitude=2), ctx=ctx)
net.features = resnet.features
net.collect_params().reset_ctx(ctx)

epoch_size = int(math.ceil(float(num_samples) / batch_size))
steps = [epoch_size * x for x in step_epochs]
lr_scheduler = mx.lr_scheduler.MultiFactorScheduler(steps, factor=0.1)
trainer = gluon.Trainer(net.collect_params(), optimizer='sgd', optimizer_params={'learning_rate': init_lr,
                                                                                 'momentum':0.9,
                                                                                 'lr_scheduler': lr_scheduler,
                                                                                 'wd': 0.0005})
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
epochs = 60000
smoothing_constant, moving_loss_tr, moving_loss_val = .01, 0.0, 0.0
patience, lr_drops = 0, 0

best_mAP = 0

batch_primary_regions=[]

for e in range(epochs):
    train_iter.reset()
    val_iter.reset()
    start_time = timeit.default_timer()

    predicts, labels = [], []

    for i, batch in enumerate(train_iter):
        batch_primary_regions = mx.nd.array(batch.data[2]).as_in_context(ctx)
        batch_primary_labels = mx.nd.array(batch.data[3]).as_in_context(ctx)
        data = mx.nd.ROIPooling(mx.nd.array(batch.data[0]).as_in_context(ctx), batch_primary_regions, pooled_size=(224, 224), spatial_scale=1.0)

        with mx.autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, batch_primary_labels)
        loss.backward()
        trainer.step(data.shape[0])
        prob_predictions = nd.softmax(output)

        curr_loss = nd.mean(loss).asscalar()
        moving_loss_tr = (curr_loss if ((i == 0) and (e == 0))
                       else (1 - smoothing_constant) * moving_loss_tr + smoothing_constant * curr_loss)


        predicts.extend(prob_predictions.asnumpy())

        labels.extend(nd.one_hot(batch_primary_labels, num_classes).asnumpy())

    predicts, labels = np.array(predicts), np.array(labels)
    train_accuracy = accuracy_score(np.argmax(labels, axis=1), np.argmax(predicts, axis=1))
    train_mAP, train_APs = evaluate_mAP(labels, predicts)


    predicts_val, labels_groundtruth_val = [], []
    for i, batch in enumerate(val_iter):

        batch_primary_regions = mx.nd.array(batch.data[2]).as_in_context(ctx)
        batch_primary_labels = mx.nd.array(batch.data[3]).as_in_context(ctx)
        data = mx.nd.ROIPooling(mx.nd.array(batch.data[0]).as_in_context(ctx), batch_primary_regions, pooled_size=(224, 224), spatial_scale=1.0)

        output = net(data)
        prob_predictions = nd.softmax(output)
        loss = softmax_cross_entropy(output, batch_primary_labels)
        curr_loss = nd.mean(loss).asscalar()
        moving_loss_val = (curr_loss if ((i == 0) and (e == 0))
                       else (1 - smoothing_constant) * moving_loss_val + smoothing_constant * curr_loss)

        predicts_val.extend(prob_predictions.asnumpy())
        labels_groundtruth_val.extend(nd.one_hot(batch_primary_labels, num_classes).asnumpy())

    predicts_val, labels_groundtruth_val = np.array(predicts_val), np.array(labels_groundtruth_val)
    test_accuracy = accuracy_score(np.argmax(labels_groundtruth_val, axis=1), np.argmax(predicts_val, axis=1))
    val_mAP, val_APs = evaluate_mAP(labels_groundtruth_val, predicts_val)

Sergey · June 21, 2018, 7:02pm

I want to clarify this parameter. When you pass pooled_size, but your primary_regions are smaller than pooled_size, the regions of interests are resized to match the pooled_size. Is that what you expect to achieve?

My hypothesis is that your regions of interests are actually smaller than 224 x 224 and when resizing is happening it adds noise to the data and it decreases mAP.

obelix · June 21, 2018, 7:21pm

If I reduce the pooled size performance drops even further…

Topic		Replies	Views
Change fast rcnn's backbone from resnet to densenet, model cannot converage Discussion	1	718	August 15, 2018
mx.nd.ROIPooling in R	0	290	December 12, 2021
Output Threshold at Action Recognition	0	326	July 7, 2020
Training with one batch gives different training/validation accuracies when shuffled	2	580	October 18, 2017
Number of predicted bounding boxes are 0 and box loss goes down Discussion	4	430	January 21, 2019

Action recognition performance drops when using ROI-Pooling

Related Topics