Activity Analysis in Unconstrained Surveillance Videos


We detect seven activities defined by TRECVID SED task such as CellToEar, Embrace, ObjectPut, PeopleMeet, PeopleSplitUp, PersonRuns, and Pointing. We employ two different strategies to detect these activities based on their characteristics. Activities like CellToEar, Embrace, ObjectPut, and Pointing are the results of articulated motion of human parts. Therefore, we employ local spatio-temporal interest point (STIP) feature based bag of words strategy for these activities. Visual vocabularies are constructed from the STIP features and each activity is described by the histograms of visual words. We also construct activity probability map for each camera-activity pair that reflects the spatial distribution of an activity in a camera. We train a discriminative SVM classifier using Gaussian kernel for each camera-activity pair. During evaluation we employ sliding window based technique. We slide spatio-temporal cuboids in both spatial and temporal direction to find a likely activity. The cuboid is also described by the histograms of visual words and final decision is made using the SVM classifier and the activity probability map. For the activities like PeopleMeet, PeopleSplitUp, and PersonRuns, the characteristics of trajectories of persons of interest in the activities are discriminative. For instance, trajectories of PeopleMeet converge along time while those of PeopleSplitUp diverge along time. Therefore, we use track-based string of feature graph (SFG) to recognize these activities. Results of our experimental runs on the evaluation videos are comparable with other participants. Our performances in all the activities are among the top five teams.
Mahmudul Hasan, Yingying Zhu, Santhoshkumar Sunderrajan, Niloufar Pourian, B.S. Manjunath and Amit Roy Chowdhury,
“Activity Analysis in Unconstrained Surveillance Videos”,
Trecvid Technical Report, 2012.
Node ID: 598 , DB ID: 408 , Lab: VRL , Target: Proceedings