Multi-view Tracking, Re-ID, and Social Network Analysis

of a Flock of Visually Similar Birds in an Outdoor Aviary

Shiting Xiao* Yufu Wang* Ammon Perkes Bernd G. Pfrommer Marc F. Schmidt Kostas Daniilidis Marc Badger
We present a pipeline for cowbird tracking and recognition. (A) A synchronized set of raw videos from multiple views are processed in a frame-by-frame manner. (B) Segmentation masks of bird instances are obtained using a Mask R-CNN network and background subtraction. (C) Pointclouds are reconstructed by multi-view matching, triangulation, and clustering. (D) Tracking, which is implemented using a Lagrangian Particle Tracking (LPT) algorithm, links pointclouds in time to form tracklets. Re-tracking associate 3D tracklets to generate longer 3D tracks. (E) Individual identity recognition using the FastReID framework. (F) Output from the pipeline can then be used for social network analysis.

Abstract

The ability to capture detailed interactions among individuals in a social group is foundational to our study of animal behavior and neuroscience. Recent advances in deep learning and computer vision are driving rapid progress in methods that can record the actions and interactions of multiple individuals simultaneously. Many social species, such as birds, however, live deeply embedded in a three-dimensional world. This world introduces additional perceptual challenges such as occlusions, orientation-dependent appearance, large variation in apparent size, and poor sensor coverage for 3D reconstruction, that are not encountered by applications studying animals that move and interact only on 2D planes. Here we introduce a system for studying the behavioral dynamics of a group of songbirds as they move throughout a 3D aviary. We study the complexities that arise when tracking a group of closely interacting animals in three dimensions and introduce a novel dataset for evaluating multi-view trackers. Finally, we analyze captured ethogram data and demonstrate that social context affects the distribution of sequential interactions between birds in the aviary.


Dataset

Our dataset captures the social interactions of 15 cowbirds housed together in an outdoor aviary over the course of a three-month mating season. Our dataset for multi-view multi-object tracking originates from four 15 minute segments drawn from one day in early April and two days in mid May. We chose these months because we expected to see rapid change in the social network across this period. The social network, including pair bonds, is not yet formed in April but solidifies by mid-May.
We collected 986 motion sequences and formed our “Where’d It LanD” or WILD challenge. Motion sequences are usually between 15 and 200 frames (a) between endpoints separated by 0-6 meters (b). In a recon- struction of all stationary sequence start and end points (c), areas of high point density reveal the perch geometry and ground plane. Lines between motion sequence start and end points (d) reveal flights from perch to perch, and from perches and the ground. Lines connect start and end points belonging to the same sequence; they do not indi- cate the actual trajectories. Points in (c) and (d) are colored by bird ID. Large spheres show the locations of the camera centers. An example from the dataset (e) shows the target bird’s start loca- tion (green), approximate flight tra jectory (blue), and ending location (red). Image borders denote the camera and correspond to large sphere colors in (c,d).

Results

Our stracking pipeline (shown at the top) produces good qualitative tracks for a variety of motion sequences. (a) Examples of detected bird instances with variations in pose, shape, lighting, scale, occlusion, and motion blur. (b) Example of a successful short track (56 frames) followed by its 2D projections in 3 different views. Colors indicating the camera views are consistent with those in the above figure. The green cube/circle is the start 3D/2D position and the red cube/circle is the end position. Dots in the 2D images are smaller/larger as the bird gets further away/closer. (c) Example of a successful long track (375 frames). During flight, the individual hops on the wall and briefly pauses for 1-2 seconds. Examples in (b) and (c) are from video segments drawn from different days, demonstrating variable time of day and lighting.

Bird re-identification results. We use a ResNet50 network supervised with triplet and ID losses to predict the identity of perched birds.

Using our dataset we analyzed the birds’ social network and investigated how birds’ behavior depends on social context. We show that interaction transition probabilities differ between pair-bonded (a, n = 163 transitions) and non-pair-bonded (b, n = 187 transitions) males and females.


Acknowledgements

We are grateful for the help of Henry Korpi, Ana Alonso, Greg Forkin, and Marcelina Martynek for their helpful discussion and many contributions to annotations in the dataset. We gratefully acknowledge support through the following grants: National Science Foundation IOS-1557499, National Science Foundation MRI 1626008, National Science Foundation NCS-FO 2124355.

The design of this project page was based on this website.