We introduce ArtiPoint for estimating motion models of articulated objects from in-the-wild RGB-D data on a scene level using deep point tracking and factor graph optimization. Additionally, we present Arti4D, the first ego-centric demonstration dataset of articulated object interactions with detailed labels and camera poses.
Teaser Image

Abstract

Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic, unconstrained environments. In contrast, humans effortlessly infer articulation modes by watching others manipulating objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework capable of inferring articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset capturing articulated object interactions at a scene level, accompanied with articulation labels and ground truth camera poses. We benchmark ArtiPoint against a range of classical and modern deep learning baselines, demonstrating its superior performance on Arti4D. We make our code and Arti4D publicly available.

ArtiPoint Method

Overview of our approach
ArtiPoint: We take an ego-centric RGB-D video as input and employ hand tracking as a trigger signal to identify interaction segments (top left). We uniformly sample points around the hand masks and prompt a class-agnostic instance segmentation model (MobileSAM), which yields object masks in the immediate vicinity that may be undergoing articulation. Given those masks, we detect stable keypoints (bottom left) that are fed into an any-point tracking model (CoTracker3) in order to obtain point trajectories throughout each entire articulation segment (top right). Finally, we estimate the underlying articulation model of the object through a factor graph formulation that operates directly on the obtained point trajectories (bottom right).

Arti4D Dataset

45 RGB-D sequences across 4 environments + camera pose ground truth + axis labels + interaction difficulty + interaction intervals

Code & Dataset

We will soon make the Arti4D dataset publicly available. Similarly, code is coming soon. Stay tuned!

Publications

If you find our work useful, please consider citing our paper:

Abdelrhman Werby, Martin Büchner, Adrian Roefer, Chenguang Huang, Wolfram Burgard, and Abhinav Valada
Articulated Object Estimation in the Wild
Conference on Robot Learning (CoRL), 2025.

(PDF) (BibTeX)

Authors

Abdelrhman Werby

Abdelrhman Werby*

University of Freiburg

Martin Büchner

Martin Büchner*

University of Freiburg

Adrian Röfer

Adrian Röfer*

University of Freiburg

Chenguang Huang

Chenguang Huang

University of Technology Nuremberg

Wolfram Burgard

Wolfram Burgard

University of Technology Nuremberg

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was funded by an academic grant from NVIDIA, and the BrainLinks-BrainTools center of University of Freiburg.