A Bilinear Model for Person Detection in Multi-Modal Data
We propose a bilinear model for processing multimodal data using a deep neural network architecture. The multimodal data includes low resolution videos and associated seismic sensor time sequence data that captures targets, such as people, moving in the sensor field. We consider the problem of detecting moving targets in this sensor field. The camera sensors have overlapping field of view with the seismic sensors, and in general may cover more than one seismic sensor. The primary challenge is to work with the low resolution videos that have fewer than 50 pixels on the objects of interest. The proposed approach consists of deep learning convolutional neural networks (CNN) that are first trained on the individual sensor modalities. The bilinear model consists of taking the vector outer product of the resulting CNN descriptors that maintain the spatial and temporal localization information implicit in the individual CNN outputs. We propose a pooling method to further compress the resulting large dimensional feature vectors from the vector outer product. This is followed by a final training step where a 3D CNN is trained for detection. Experimental results demonstrate a significant performance enhancement in fusing the multimodal data.