The TVSeries Dataset is a realistic, large-scale dataset for action detection. It consists of 16 hours of videos from six recent TV series. Thirty action classes are defined and all occurences are marked with start and end time. For every action instance, we provide some metadata (single person, occluded, part of the action missing...) that can be used to analyze the performance of a method on specific difficult cases.
Download
The annotations of the TVSeries Dataset can be downloaded here. The provided files contain detailed information on all actions of the 30 defined action classes in the selected TV series episodes: their start and end time, as well as the metadata.
The split of the dataset over training, validation and test set is also included. If you encounter any problem with the annotations, do not hesitate to send us an e-mail.
The dataset consists of the first few episodes of six recent TV series. The names of the series can be found in the paper. We encourage everyone to buy the first season of the series on DVD and rip the annotated episodes to obtain the video material. Our annotations can then be used with your video files. Please note that the exact content of a DVD can depend on the region. The position of episode six of Modern Family in particular can vary. We encourage everyone to check a few annotations for every episode to make sure no problems exist.
If it is not possible to buy the DVDs, however, please print, fill out and sign this form (in which you promise to use the video data only for research purposes). Send us an e-mail with a scan of the completed form in attachment.
The CNN and LSTM models used for online action detection on this dataset, can be found on GitHub.
The TVSeries dataset is used in our paper on Online Action Detection. If you use this dataset or the models, please refer to our ECCV paper (read it on arXiv):
De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C. & Tuytelaars, T. (2016). Online Action Detection. ECCV 2016.
Action classes
We defined 30 action classes (see Table 2). All episodes are manually annotated; in total, 6231 action instances are found. The start and end frames of all instances are known, not the spatial position. Actions can be overlapping in time.
Table 2: Action classes of the TVSeries Dataset and their number of occurences.
#instances
#instances
Pick something up
937
Go up stairway
119
Point
557
Throw something
119
Drink
440
Get in/out of car
112
Stand up
411
Hang up phone
105
Run
395
Eat
98
Sit down
314
Answer phone
96
Read
302
Clap
95
Smoke
290
Dress up
95
Drive car
248
Undress
95
Open door
237
Kiss
79
Give something
211
Fall/trip
77
Use computer
169
Wave
71
Write
149
Pour
62
Go down stairway
124
Punch
53
Close door
121
Fire weapon
50
Total
6231
Metadata
Every action instance is annotated with some extra metadata. This metadata can be used to compare the performance of different methods on specific difficult cases. We provide the following metdata.
Single person: is there only one person visible during the action?
Atypical: is the action performed in an atypical, unusual way?
Shotcut: is there a shotcut during the action?
Moving camera: is the camera moving during the action?
Small or background: is the person performing the action very small or in the background?
Frontal viewpoint: is the action captured from a frontal viewpoint?
Side viewpoint: is the action captured from the side?
Special viewpoint: is the action captured from a special, unusual viewpoint?
Occlusion: is the action (partly) occluded by another person/object/...?
Spatial truncation: is the action truncated by the frame border, i.e., does it spatially extend beyond the frame?
Beginning missing: is the beginning of the action not visible or not recorded?
End missing: is the end of the action not visible or not recorded?