Activity Analysis in Unconstrained Surveillance Videos
We detect seven activities deﬁned by TRECVID SED task such as CellToEar, Embrace, ObjectPut, PeopleMeet, PeopleSplitUp, PersonRuns, and Pointing. We employ two different strategies to detect these activities based on their characteristics. Activities like CellToEar, Embrace, ObjectPut, and Pointing are the results of articulated motion of human parts. Therefore, we employ local spatio-temporal interest point (STIP) feature based bag of words strategy for these activities. Visual vocabularies are constructed from the STIP features and each activity is described by the histograms of visual words. We also construct activity probability map for each camera-activity pair that reﬂects the spatial distribution of an activity in a camera. We train a discriminative SVM classiﬁer using Gaussian kernel for each camera-activity pair. During evaluation we employ sliding window based technique. We slide spatio-temporal cuboids in both spatial and temporal direction to ﬁnd a likely activity. The cuboid is also described by the histograms of visual words and ﬁnal decision is made using the SVM classiﬁer and the activity probability map. For the activities like PeopleMeet, PeopleSplitUp, and PersonRuns, the characteristics of trajectories of persons of interest in the activities are discriminative. For instance, trajectories of PeopleMeet converge along time while those of PeopleSplitUp diverge along time. Therefore, we use track-based string of feature graph (SFG) to recognize these activities. Results of our experimental runs on the evaluation videos are comparable with other participants. Our performances in all the activities are among the top ﬁve teams.
“Activity Analysis in Unconstrained Surveillance Videos”,
Trecvid Technical Report, 2012.
Node ID: 598 , DB ID: 408 , Lab: VRL , Target: Proceedings