Faster R-CNN, YOLO and SSD performance

I trained the three algorithms in a custom dataset, using the scripts provided on the tutorial page. The problem is that when I run the script below I get the following output:

Validation set size: 110
SSD
Total running time: 2781.013 ms
Running time per example: 25.282 ms
Frames per second: 39.554

Faster R-CNN
Total running time: 2437.742 ms
Running time per example: 22.161 ms
Frames per second: 45.124

YOLOv3
Total running time: 2469.476 ms
Running time per example: 22.450 ms
Frames per second: 44.544

I run the script many times, and the results always follow this pattern.
According to this results, Faster R-CNN is running faster than the other two algorithms, what is really strange since FRCNN is supposed to be the slowest.

Is there any detail on the implementations that might lead to this results, or there is something wrong with my code?

I made a minimal and repetitive version of the code:

from gluoncv import model_zoo
from gluoncv.utils import viz
from traindet.utils import Dataset
import mxnet as mx
from gluoncv.data.transforms import presets
import matplotlib.pyplot as plt
from mxnet import nd
import time

ctx = mx.gpu()

root = '/path/to/dataset'
val_ds = Dataset(root, train=False)
print('Validation set size:', len(val_ds))

ssd = model_zoo.get_model('ssd_512_resnet50_v1_coco', ctx=ctx)
ssd.initialize(force_reinit=True, ctx=ctx)
ssd.reset_class(classes=val_ds.classes)
ssd.load_parameters('checkpoints_ssd/_best.params', ctx=ctx)

frcnn = model_zoo.get_model('faster_rcnn_resnet50_v1b_coco', ctx=ctx)
frcnn.initialize(force_reinit=True, ctx=ctx)
frcnn.reset_class(classes=val_ds.classes)
frcnn.load_parameters('checkpoints_frcnn/_best.params', ctx=ctx)

yolo = model_zoo.get_model('yolo3_darknet53_coco', ctx=ctx)
yolo.initialize(force_reinit=True, ctx=ctx)
yolo.reset_class(classes=val_ds.classes)
yolo.load_parameters('checkpoints_yolo416/yolo3_darknet53_custom_best.params', ctx=ctx)

ssd.hybridize(static_alloc=True)
frcnn.hybridize(static_alloc=True)
yolo.hybridize(static_alloc=True)

ssd_times = []
tic = time.time()
for img, label in val_ds:
    tic1 = time.time()
    x, npimg = presets.ssd.transform_test(img, short=512)
    ids, scores, bboxes = ssd(x.as_in_context(ctx))
    ssd_times.append(time.time() - tic1)
tac = time.time()
ssd_total = tac - tic
print('SSD')
print(f'Total running time: {ssd_total * 1000:.3f} ms')
print(f'Running time per example: {ssd_total/len(val_ds)* 1000:.3f} ms')
print(f'Frames per second: {len(val_ds)/ssd_total:.3f}')
print('\n')

rcnn_times = []
tic = time.time()
for img, label in val_ds:
    tic1 = time.time()
    x, npimg = presets.rcnn.transform_test(img, short=600)
    ids, scores, bboxes = frcnn(x.as_in_context(ctx))
    rcnn_times.append(time.time() - tic1)
tac = time.time()
frcnn_total = tac - tic
print('Faster R-CNN')
print(f'Total running time: {frcnn_total * 1000:.3f} ms')
print(f'Running time per example: {frcnn_total/len(val_ds) * 1000:.3f} ms')
print(f'Frames per second: {len(val_ds)/frcnn_total:.3f}')
print('\n')

yolo_times = []
tic = time.time()
for img, label in val_ds:
    tic1 = time.time()
    x, npimg = presets.yolo.transform_test(img, short=416)
    ids, scores, bboxes = yolo(x.as_in_context(ctx))
    yolo_times.append(time.time() - tic1)
tac = time.time()
yolo_total = tac - tic
print('YOLOv3')
print(f'Total running time: {yolo_total * 1000:.3f} ms')
print(f'Running time per example: {yolo_total/len(val_ds) * 1000:.3f} ms')
print(f'Frames per second: {len(val_ds)/yolo_total:.3f}')
print('\n')

MXNet’s backend is running asynchronously. That means operations get queued and the call immediately returns. To get proper timings you need to add nd.waitall to your code. This forces the Python call to wait until the operation has been executed in the backend. Your code should look like the following:

for img, label in val_ds:
    tic1 = time.time()
    x, npimg = presets.ssd.transform_test(img, short=512)
    ids, scores, bboxes = ssd(x.as_in_context(ctx))
    mx.nd.waitall()
    ssd_times.append(time.time() - tic1)
tac = time.time()

For more detail have a look on this thread: Whether or when to use ndarray.waitall or on this tutorial https://mxnet.incubator.apache.org/versions/master/tutorials/python/profiler.html

Thank you, now the results make sense.