DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition
⁴Istanbul Medipol Üniversitesi, ⁵SRMIST
A COLLABORATION BETWEEN

A COLLABORATION BETWEEN
We are proud to introduce DetReIDX - is a new benchmark dataset built for real-world, long-range human recognition. It supports key computer vision tasks like person detection, re-identification (ReID), multi-view tracking, and action recognition — all captured in complex outdoor scenes using drones and ground cameras.
The dataset starts with an indoor session where each person is photographed from three angles — left, front, and right — and recorded while walking, enabling motion-based recognition like gait analysis.
In the outdoor sessions, drones capture videos from 18 viewpoints per subject, across different heights (up to 120m), distances, and camera angles (30°, 60°, 90°). Each person wears different outfits across sessions to simulate real-world variation.
Every frame is labeled with bounding boxes and 16 soft biometric attributes (like age, gender, clothing, action), offering fine-grained details for deep analysis.
With over 13+ million annotations and rich visual diversity, DetReIDX sets a new standard for evaluating human-centric AI in aerial and surveillance scenarios.
Figure 1: Comparison between the publicly available datasets (ground-ground, aerial-aerial, and aerial-ground) and the DetReIDX dataset, announced in this paper. Unlike its counterparts, DetReIDX includes clothing variation, detection and tracking annotations, action labels, and wide aerial altitude coverage (5.8m–120m), making it well-suited for long-range surveillance tasks..
DetReIDX is one of the most comprehensive UAV-based person identification datasets, featuring multi-altitude, multi-distance, and multi-session recordings.
5.8m to 120m
Provides unique perspective variation10m to 120m
Tests capability at extreme distances2 Sessions
Different clothing & conditionsDetReIDX exposes critical challenges in person recognition that are overlooked in traditional datasets but common in real-world UAV surveillance:
Person ROIs range from full-HD indoor captures to sub-10px silhouettes at 120m altitude, testing resolution robustness.
Subjects wear different outfits across sessions, requiring models to learn identity beyond superficial appearance cues.
18 unique UAV perspectives across three pitch angles (30°, 60°, 90°) challenge current view-specific approaches.
Aerial-to-ground matching requires bridging vastly different capture modalities and perspectives.
Real-world interference from motion blur, atmospheric conditions, and partial visibility.
Multi-day sessions with environmental changes test long-term recognition capabilities.
DetReIDX significantly exceeds prior datasets in altitude span, viewpoint coverage, identity diversity, and annotation richness. The table below compares key features across benchmark datasets for person detection, ReID, tracking, and action recognition.
No. | Dataset | Camera | View | Format | Detection | Tracking | ReID | Search | Action | PIDs | BBox | Height (m) | Distance (m) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRID-2011 [1] | UAV | Aerial | Still | ✗ | ✗ | ✓ | ✗ | ✗ | 1581 | 40K | 20~60 | - |
2 | CUHK03 [2] | CCTV | Ground | Still | ✗ | ✗ | ✓ | ✗ | ✗ | 1467 | 13K | - | - |
3 | iLIDS-VID [3] | CCTV | Ground | Video | ✗ | ✗ | ✓ | ✗ | ✗ | 300 | 42K | - | - |
4 | MRP [4] | UAV | Aerial | Video | ✓ | ✓ | ✓ | ✗ | ✗ | 28 | 4K | <10< /td> | - |
5 | PRAI-1581 [5] | UAV | Aerial | Still | ✗ | ✗ | ✓ | ✗ | ✗ | 1581 | 39K | 20~60 | - |
6 | CSM [6] | Various | Aerial | Video | ✗ | ✗ | ✓ | ✗ | ✗ | 1218 | 11M | - | - |
7 | Market1501 [7] | CCTV | Ground | Still | ✓ | ✓ | ✓ | ✗ | ✗ | 1501 | 32.6K | <10< /td> | - |
8 | Mini-drone [8] | UAV | Aerial | Video | ✓ | ✓ | ✗ | ✗ | ✓ | - | >27K | <10< /td> | - |
9 | Mars [9] | CCTV | Ground | Video | ✓ | ✓ | ✓ | ✗ | ✗ | 1261 | 20K | - | - |
10 | AVI [10] | UAV | Aerial | Still | ✓ | ✓ | ✓ | ✓ | ✓ | 5124 | 10K | 2~8 | - |
11 | DUKEMTMC [11] | CCTV | Ground | Video | ✓ | ✓ | ✓ | ✗ | ✗ | 1812 | 815K | - | - |
12 | iQIYI-VID [12] | Various | Aerial | Video | ✓ | ✓ | ✓ | ✓ | ✗ | 5000 | 600K | - | - |
13 | DRone HIT [13] | UAV | Aerial | Still | ✓ | ✗ | ✓ | ✓ | ✗ | 101 | 40K | - | - |
14 | LTCC [14] | CCTV | Ground | Still | ✓ | ✗ | ✓ | ✓ | ✗ | 152 | 17K | - | - |
15 | P-DESTRE [15] | UAV | Aerial | Video | ✓ | ✓ | ✓ | ✓ | ✓ | 269 | >14.8M | 5.8~6.7 | - |
16 | UAVHuman [16] | UAV | Aerial | Still | ✗ | ✗ | ✓ | ✗ | ✗ | 1144 | 41K | 2~8 | - |
17 | AG-ReID.v2 [17] | UAV + CCTV | Ground + Aerial | Still | ✓ | ✓ | ✓ | ✓ | ✗ | 1615 | 100.6K | 15~45 | - |
18 | G2APS-ReID [18] | UAV + CCTV | Ground + Aerial | Still | ✓ | ✓ | ✓ | ✓ | ✗ | 2788 | 200.8K | 20~60 | - |
19 | DetReIDX (Ours) | DSLR + UAV | Ground + Aerial | Video + Still | ✓ | ✓ | ✓ | ✓ | ✓ | 334 | 13M | 5~120 | 10~120 |
DetReIDX is built through a multi-institutional collaboration across Portugal, Angola, Turkey, and India. It captures both indoor and outdoor scenarios to support robust evaluation of person re-identification, tracking, action recognition, and gait analysis.
Data was collected using high-resolution drones (e.g., DJI Phantom 4) and DSLR cameras under diverse altitudes (5–120m), distances (10–120m), and pitch angles (30°, 60°, 90°), across controlled labs and open campuses.
Setting | Environment | Altitude | Distance | Data Type |
---|---|---|---|---|
Indoor | Lab | Ground | Close | Profile Images, Gait Videos |
Outdoor | Campus | 5–120m | 10–120m | Multi-view Videos, Action Clips |
Our data collection process consists of two major phases:
All data is collected using **drones** for precise and consistent data gathering.
The indoor data collection consists of capturing three profile images of each subject: Right (R), Front (F), and Left (L), ensuring full-body visibility.
Additionally, gait videos are captured to analyze the subject's movement patterns.
🎥 Watch how our data collection works!
The outdoor collection is conducted in two sessions, where participants wear different outfits in each session. Both sessions have 18 collection points, covering various distances and heights.
Session 1 | Session 2 | ||||
---|---|---|---|---|---|
Point | Distance (m) | Height (m) | Point | Distance (m) | Height (m) |
1 | 10 | 5.8 | 1 | 10 | 5.8 |
2 | 20 | 11.5 | 2 | 20 | 11.5 |
3 | 30 | 17.3 | 3 | 30 | 17.3 |
4 | 40 | 23.1 | 4 | 40 | 23.1 |
5 | 80 | 40 | 5 | 80 | 40 |
6 | 120 | 60 | 6 | 120 | 60 |
7 | 10 | 15 | 7 | 10 | 15 |
8 | 20 | 30 | 8 | 20 | 30 |
9 | 30 | 45 | 9 | 30 | 45 |
10 | 40 | 60 | 10 | 40 | 60 |
11 | 80 | 75 | 11 | 80 | 75 |
12 | 120 | 90 | 12 | 120 | 90 |
13 | 00 | 10 | 13 | 00 | 10 |
14 | 00 | 20 | 14 | 00 | 20 |
15 | 00 | 30 | 15 | 00 | 30 |
16 | 00 | 40 | 16 | 00 | 40 |
17 | 00 | 80 | 17 | 00 | 80 |
18 | 00 | 120 | 18 | 00 | 120 |
The DetReIDX dataset includes over 13 million manually annotated bounding boxes for 509 unique identities, collected across 36 UAV viewpoints per subject and indoor references. All annotations were performed using CVAT and verified by multiple annotators to ensure frame-level accuracy and identity consistency.
Each subject is annotated with 16 soft biometric features, including demographic (e.g., age, gender), appearance (e.g., hair, clothing, accessories), and physical traits (e.g., height, body type). These enable fine-grained analysis beyond just visual appearance.
Split | #Videos | #Images | #Annotations | Formats |
---|---|---|---|---|
Train | 120 | 131,580 | 5,095,539 | YOLO, COCO |
Validation | 56 | 63,591 | 2,483,836 | YOLO, COCO |
Test | 109 | 108,252 | 4,217,824 | YOLO, COCO |
Total | 285 | 303,423 | 11,797,199 |
Scenario | #Query | #Gallery | Total Images |
---|---|---|---|
Train (Indoor + UAV) | -- | -- | 289,392 |
A2A (UAV → UAV) | 52,926 | 52,552 | 105,478 |
A2G (UAV → Indoor) | 106,927 | 7,959 | 114,886 |
G2A (Indoor → UAV) | 7,959 | 106,927 | 114,886 |
We conducted several experiments to evaluate the robustness and generalization of person detection models under varying conditions.
The table below compares DetReIDX with other benchmark datasets for person detection. It highlights modalities, view types, and complexity levels in terms of height, distance, and number of identities.
Experiment | Train | Test | AP50 |
---|---|---|---|
Baseline | ALL | ALL | 0.734 |
Interpolation | 30° & 90° | 60° | 0.669 ↓ (-8.86%) |
Extrapolation | 30° & 60° | 90° | 0.503 ↓ (-31.5%) |
D1 | D1 | D1 | 0.914 ↑ (+24.5%) |
D1 | D2 | 0.793 ↑ (+8.04%) | |
D1 | D3 | 0.137 ↓ (-81.3%) | |
D2 | D2 | D1 | 0.694 ↓ (-5.45%) |
D2 | D2 | 0.890 ↑ (+21.2%) | |
D2 | D3 | 0.315 ↓ (-57.1%) | |
D3 | D3 | D1 | 0.015 ↓ (-97.9%) |
D3 | D2 | 0.411 ↓ (-44.0%) | |
D3 | D3 | 0.581 ↓ (-20.8%) |
Experiment | Train | Test | AP50 |
---|---|---|---|
Baseline | ALL | ALL | 0.608 |
Interpolation | 30 & 90 | 60 | 0.564 ↓ (-7.24%) |
Extrapolation | 30 & 60 | 90 | 0.474 ↓ (-22.04%) |
D1 | D1 | D1 | 0.857 ↑ (+40.9%) |
D1 | D2 | 0.380 ↓ (-37.5%) | |
D1 | D3 | 0.008 ↓ (-98.7%) | |
D2 | D2 | D1 | 0.582 ↓ (-4.28%) |
D2 | D2 | 0.776 ↑ (+27.6%) | |
D2 | D3 | 0.111 ↓ (-81.75%) | |
D3 | D3 | D1 | 0.004 ↓ (-99.3%) |
D3 | D2 | 0.274 ↓ (-54.9%) | |
D3 | D3 | 0.408 ↓ (-32.9%) |
Experiment | Train | Test | AP50 |
---|---|---|---|
Baseline | ALL | ALL | 0.620 |
Interpolation | 30 & 90 | 60 | 0.514 ↓ (-17.1%) |
Extrapolation | 30 & 60 | 90 | 0.403 ↓ (-35.0%) |
D1 | D1 | D1 | 0.839 ↑ (+35.3%) |
D1 | D2 | 0.428 ↓ (-30.9%) | |
D1 | D3 | 0.009 ↓ (-98.5%) | |
D2 | D2 | D1 | 0.668 ↑ (+7.74%) |
D2 | D2 | 0.770 ↑ (+24.2%) | |
D2 | D3 | 0.150 ↓ (-75.8%) | |
D3 | D3 | D1 | 0.002 ↓ (-99.7%) |
D3 | D2 | 0.261 ↓ (-57.9%) | |
D3 | D3 | 0.280 ↓ (-54.8%) |
The table below compares performance of PersonViT, SeCap, and CLIP-ReID on the DetReIDX-ReID dataset across different aerial and ground viewpoints. Metrics include mAP and CMC curves at Rank-1, Rank-5, and Rank-10.
Experiment | mAP | Rank-1 | Rank-5 | Rank-10 |
---|---|---|---|---|
Aerial → Aerial (LongTerm) | 9.9% | 8.8% | 14.4% | 17.6% |
Aerial → Ground | 22.3% | 19.6% | 24.8% | 27.6% |
Ground → Aerial | 23.3% | 51.9% | 59.4% | 63.0% |
Experiment | mAP | Rank-1 | Rank-5 | Rank-10 |
---|---|---|---|---|
Aerial → Aerial (LongTerm) | 11.16% | 8.20% | 13.03% | 16.16% |
Aerial → Ground | 20.49% | 18.08% | 21.50% | 23.43% |
Ground → Aerial | 21.23% | 50.89% | 57.68% | 60.72% |
Experiment | mAP | Rank-1 | Rank-5 | Rank-10 |
---|---|---|---|---|
Aerial → Aerial (LongTerm) | 9.5% | 8.9% | 12.8% | 15.3% |
Aerial → Ground | 22.0% | 19.7% | 24.0% | 26.2% |
Ground → Aerial | 20.8% | 58.1% | 63.1% | 65.2% |
Figure: Distribution of image resolutions (width × height) captured at different aerial distances. Each dot represents one image. Colors indicate image size relative to the mean area: Small, Medium, and Large. Min and Max images are marked with blue "X" and red "P" respectively.
The DetReIDX dataset is organized task-wise, including separate partitions for detection, tracking, ReID, and action recognition. Access is strictly limited to researchers from academic/government institutions. Only the requested module will be shared. To access the full dataset, you must submit separate access requests for each module.
We acknowledge and give credit to the following universities for their
contributions:
Istanbul Medipol University, J.N.N College of Engineering, SRM Institute of
Science and
Technology,
Swami Ramanand Teerth Marathwada University, Nanded, Universidade Beira
Interior, Universidade de
Luanda.
(Sorted in A-Z order)