I have been coding and testing Neural Networks for a while but as of now I have only used IMAGE datasets. (i.e. I have M training images and N testing images).
Some datasets are video datasets. The UCSD Peds dataset for example has the following variants:
Peds 1: 34 training videos (of 200 frames each) and 36 testing videos (of 200 frames each) Peds 2: 16 training videos (of varying number of frames) and 12 testing videos (of varying number of frames)
So basically, in this case, I have M SETS of training images and N SETS of testing images.
I know the inputs to a neural network is arbitrary and anything can be fed, but I do not really understand how to feed a set of videos.
Will we merge all the 34 x 200 = 6800 training frames together and 36 x 200 = 7200 testing frames together and use the resulting set just like we use MNIST, etc?
How do we feed a set of training videos to a neural network?
What if I want to detect anomalies in the test videos? For example a neural network is trained using videos from UCSD dataset involving only pedestrians. And tested on videos which also have cars and bikes. So these cars and bikes are anomalies.
I'd want to classify an entire video on whether it contains just normal elements or some anomalous elements as well.