My Research Journey in 3D Human Digitization

How It All Started

My research journey began with a focus on object detection during my early graduate studies. Working on weakly semi-supervised methods, I developed Point-Teaching, which leveraged point annotations to bridge the gap between fully supervised and weakly supervised detection — a problem that fascinated me because of its practical implications for reducing annotation costs.

Diving into Human Pose Estimation

The natural next step was understanding human pose. With Poseur, we explored a regression-based approach using transformers for 2D human pose estimation, published at ECCV 2022. This work challenged the dominant heatmap-based paradigm and showed that direct regression could be competitive.

The Perspective Distortion Problem

A key insight came when we noticed that existing 3D human mesh reconstruction methods failed badly under perspective distortion — the kind you see in close-up photos or wide-angle lenses. This led to Zolly (ICCV 2023 Oral), which explicitly modeled focal length effects in human mesh recovery.

\[\mathbf{x}_{2d} = \Pi(f, \mathbf{X}_{3d}) = f \cdot \frac{\mathbf{X}_{3d}}{Z} + \mathbf{c}\]

The equation above captures the core idea: the projection function \(\Pi\) is parameterized by focal length \(f\), and getting this wrong leads to systematic distortions in the reconstructed mesh.

Scaling Up with Synthetic Data

One persistent challenge in 3D human reconstruction is the lack of diverse training data. In HumanWild, we tackled this by using generative models to create synthetic training data with realistic appearance and accurate 3D annotations. This approach, published in TPAMI 2025, showed that carefully designed synthetic data pipelines can match or exceed the performance of models trained on real data.

Current Directions

My current work focuses on two exciting directions:

4D Scene Reconstruction: With POMATO, we’re exploring how to reconstruct dynamic 3D scenes from monocular video, combining pointmap matching with temporal motion modeling.
Geometry Estimation: GeoBench provides a comprehensive benchmark for evaluating monocular geometry estimation models, helping the community understand what works and why.

Lessons Learned

A few principles that have guided my research:

Start with the failure cases — they often reveal the most interesting problems
Synthetic data is underrated — with the right generation pipeline, it can be incredibly powerful
Benchmarks matter — good evaluation drives good research

If you’re interested in any of these topics or potential collaborations, feel free to reach out at yongtao.ge@adelaide.edu.au.