I am using gluon’s I3D and SlowFast models for activity recognition on my own dataset (training and testing). Both models work ok out of the box, but there is something I don’t understand and would like to clear it out.
Both I3D and SlowFast are supposed to be two-stream models, where in case of I3D, color and flow modality is used, while in case of SlowFast, one stream works on lower number of frames sampled from time while another on a higher number of images, but using less complex architecture.
I guess the gluon’s implementations are onestream, and one have to manually combine two instances in order to obtain two-stream models (as it is in the original papers)? Is there any example of that?
What would be the implementation of original SlowFast, that uses two different architectures (one per stream)?