Welcome to My Homepage!

I’m a partner at BeingBeyond, a startup dedicated to advancing foundation models for general-purposed humanoid robots, where I collaborate closely Prof. Zongqing Lu. Now I am leading the Embodied Multimodal Pretraining team in BeingBeyond, with projects including Being-H, Being-M and Being-VL series. Prior to this, I was a researcher at the Beijing Academy of Artificial Intelligence (BAAI). I obtained my PhD and bachelor’s degree from Renmin University of China (RUC), under the guidance of Prof. Qin Jin. My research primarily focuses on human behavior understanding, vision-and-language learning, and the development of open-world embodied agents. Currently I’m working towards an intelligent humanoid robot. For more details, please refer to my CV.

Join Us!

We are actively recruiting full-time researchers and interns to join our team. If you’re passionate about embodied AI, feel free to reach out.

Our research blogs: https://research.beingbeyond.com/

Research Interest

  • Large large multimodal models
  • Human behavior and motion understanding
  • Vision-language-action models
  • Humanoid Robots

🔥 News

  • 2026.01: ⭐ We release Being-H0.5, our latest, cross-embodiment foundation VLA.
  • 2025.09: 🎉 Two paper is accepted to Neurips’25.
  • 2025.08: ⭐ We present our Being-M0.5, an improved version of its Being-M0 with real-time controllability.
  • 2025.07: ⭐ Our next version of LMM Being-VL-0.5 is released, including code and checkpoints.
  • 2025.07: ⭐ We release Being-H0, the first VLA pretrained from large-scale human videos with hand motion!
  • 2025.06: 🎉 Three paper is accepted to ICCV’25.
  • 2025.06: 🏆 We won 1st place in GemBench Challenge at CVPR 2025 Workshop GRAIL.
  • 2025.05: ⭐ We present our first million-level motion model Being-M0, which is accepted by ICML 2025.
  • 2024.10: ⭐ We present our Being-VL-0, which is accepted by ICLR 2025.

📝 Publications

* denotes equal contribution, † denotes project lead, ✉ denotes corresponding author

beingbeyond BeingBeyond Series

arxiv
sym

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization
Hao Luo*, Ye Wang*, Wanpeng Zhang*, Sipeng Zheng*, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, Zongqing Lu

Blog | Code | huggingface Model

  • Robots do not just look different. They also act through different physical control languages: different kinematics, sensors, action conventions, and timing. Being-H0.5 is our attempt to make one Vision-Language-Action model travel across those differences without turning into a brittle collection of per-robot hacks. The model is trained using over 35,000 hours of data, including 16,000 hours of human videos and 14,000 hours of robot manipualtion (30+ embodiments).
arxiv
sym

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
Hao Luo*, Yicheng Feng*, Wanpeng Zhang*, Sipeng Zheng*, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu

Blog | Code | huggingface Model

  • Being-H0 is the first VLA pretrained from large-scale human videos with hand motion.
ICCV 2025
sym

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
Bin Cao*, Sipeng Zheng*, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu

ICCV25

Blog | Page

  • Being-M is the first large motion generation model scaling to million-level motion sequences.

Being-M0: Scaling Large Motion Models with Million-Level Human Motions (ICML 2025)

ICCV 2025
sym

Being-VL0.5: Unified Multimodal Understanding via Byte-Pair Visual Encoding
Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

ICCV25 (Highlight)

Blog | Code | Page

  • Being-VL is the first large multimodal model based on compressed discrete visual representation using 2D-BPE.

Being-VL-0: From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities (ICLR 2025)

🎙 Before BeingBeyond

ICLR 2024
sym

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds, Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu

ICLR24 (Spotlight 5.02%)

Project | Code

ECCV 2022
sym
CVPR 2022
sym

VRDFormer: End-to-end video visual relation detection with transformer, Sipeng Zheng, Shizhe Chen, Qin Jin

CVPR22 (Oral 4.14%)

Code

📚 Paper List

🎖 Honors and Awards

  • 2025 Ranked 1st in GemBench Challenge at CVPR 2025 Workshop GRAIL.
  • 2022 Ranked 3th in CVPR 2022 Ego4D Natural Language Query Challenge.
  • 2021 Ranked 3th in NIST TRECVID 2021 Ad-hoc Video Search (AVS) Challenge.
  • 2021 Ranked 2nd in CVPR 2021 HOMAGE Scene-graph Generation Challenge.
  • 2020 Ranked 2nd in ACM MM 2020 Video Relationship Understanding Grand Challenge.
  • 2019 Ranked 2nd in ACM MM 2019 Video Relationship Understanding Grand Challenge.
  • 2022 National Scholarship for Ph.D Students.
  • 2019 Best Method Prize in ACM MM 2019 Grand Challenge.
  • 2019 First Class Scholarship for Ph.D Students from 2018 to 2021.
  • 2015 First Prize in National University Mathematical Modeling Competition of Beijing Area.

📖 Educations

  • 2018.09 - 2023.06, PhD, Computer Science and Engineering, Renmin University of China, China.
  • 2014.09 - 2018.06, Undergraduate; Computer Science and Engineering, Renmin University of China, China.

💻 Work Experience

  • 2025.05 - now, Research Scientist; BeingBeyond, Beijing, China.
  • 2023.07 - 2025.05, Researcher; Beijing Academy of Artificial Intelligence, Beijing, China.
  • 2022.04 - 2022.10, Research Intern; Microsoft Research Asia, Beijing, China.
  • 2021.11 - 2022.04, Research Intern; Beijing Academy of Artificial Intelligence, Beijing, China.

🔧 Services

  • Conference Reviewer for CVPR, ICCV, ECCV, ACCV, NeurIPS, AAAI, ACM MM.
  • Journal Reviewer for IJCV, TCSVT, TMM, JATS.