📝 Publications

* denotes equal contribution

beingbeyond BeingBeyond Series

arxiv
sym

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
Hao Luo*, Yicheng Feng*, Wanpeng Zhang*, Sipeng Zheng*, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu

Project | Code | huggingface Model

  • Being-H0 is the first VLA pretrained from large-scale human videos with hand motion.
arxiv
sym

RLPF: Physical Feedback: Aligning Large Motion Models with Humanoid Control
Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, Zongqing Lu

Project

  • RLPF translates text-driven human motions into executable actions for humanoid robots.
ICCV 2025
sym

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
Bin Cao*, Sipeng Zheng*, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu

ICCV25

Project

  • Being-M is the first large motion generation model scaling to million-level motion sequences.

Being-M0: Scaling Large Motion Models with Million-Level Human Motions (ICML 2025) | page

ICCV 2025
sym

Being-VL-0.5: Unified Multimodal Understanding via Byte-Pair Visual Encoding
Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu.

ICCV25 (Highlight)

Project | Code

  • Being-VL is the first large multimodal model based on compressed discrete visual representation using 2D-BPE.

Being-VL-0: From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities (ICLR 2025) | page

🎙 Before BeingBeyond

ICLR 2024
sym

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds, Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu.

ICLR24 (Spotlight 5.02%)

Project | Code

ECCV 2022
sym
CVPR 2022
sym

VRDFormer: End-to-end video visual relation detection with transformer, Sipeng Zheng, Shizhe Chen, Qin Jin.

CVPR22 (Oral 4.14%)

Code

📚 Paper List