A Bilinear Model for Person Detection in Multi-Modal Data

Abstract

We propose a bilinear model for processing multimodal data using a deep neural network architecture. TheĀ  multimodal data includes low resolution videos and associated seismic sensor time sequence data that captures targets, such as people, moving in the sensor field. We consider the problem of detecting moving targets in this sensor field. The camera sensors have overlapping field of view with the seismic sensors, and in general may cover more than one seismic sensor. The primary challenge is to work with the low resolution videos that have fewer than 50 pixels on the objects of interest. The proposed approach consists of deep learning convolutional neural networks (CNN) that are first trained on the individual sensor modalities. The bilinear model consists of taking the vector outer product of the resulting CNN descriptors that maintain the spatial and temporal localization information implicit in the individual CNN outputs. We propose a pooling method to further compress the resulting large dimensional feature vectors from the vector outer product. This is followed by a final training step where a 3D CNN is trained for detection. Experimental results demonstrate a significant performance enhancement in fusing the multimodal data.

[PDF] [BibTex]
Oytun Ulutan, Benjamin Riggan, Nasser Nasrabadi, B.S. Manjunath,
IEEE Winter Conf. on Applications of Computer Vision (WACV 2018), March 12-14, 2018, Lake Tahoe, NV/CA, Lake Tahoe, CA, Mar. 2018.
Node ID: 711 , Lab: VRL , Target: Conference