📝 Publications
* denotes equal contribution
BeingBeyond Series

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
Hao Luo*, Yicheng Feng*, Wanpeng Zhang*, Sipeng Zheng*, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu
- Being-H0 is the first VLA pretrained from large-scale human videos with hand motion.

RLPF: Physical Feedback: Aligning Large Motion Models with Humanoid Control
Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, Zongqing Lu
- RLPF translates text-driven human motions into executable actions for humanoid robots.

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
Bin Cao*, Sipeng Zheng*, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu
ICCV25
- Being-M is the first large motion generation model scaling to million-level motion sequences.
Being-M0: Scaling Large Motion Models with Million-Level Human Motions (ICML 2025) | page

Being-VL0.5: Unified Multimodal Understanding via Byte-Pair Visual Encoding
Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu.
ICCV25 (Highlight)
- Being-VL is the first large multimodal model based on compressed discrete visual representation using 2D-BPE.
Being-VL-0: From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities (ICLR 2025) | page
🎙 Before BeingBeyond

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds, Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu.
ICLR24 (Spotlight 5.02%)

Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning, Sipeng Zheng, Shizhe Chen, Qin Jin.
ECCV22

VRDFormer: End-to-end video visual relation detection with transformer, Sipeng Zheng, Shizhe Chen, Qin Jin.
CVPR22 (Oral 4.14%)
📚 Paper List
Arxiv 2025Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos, Hao Luo*, Yicheng Feng*, Wanpeng Zhang*, Sipeng Zheng*, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu Project.Arxiv 2025RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control, Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, Zongqing Lu Project.NeurIPS 2025OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data, Hao Luo, Zihao Yue, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Deheng Ye, Zongqing Lu. Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, Jin Qin.NeurIPS 2025EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining, Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, Jin Qin.EMNLP 2025Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning, Jiazheng Liu, Sipeng Zheng, Börje F Karlsson, Zongqing Lu.ICCV 2025Unified Multimodal Understanding via Byte-Pair Visual Encoding, Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu.ICCV 2025MotionCtrl: A Real-time Controllable Vision-Language-Motion Model, Bin Cao*, Sipeng Zheng*, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu.ICCV 2025VideoOrion: Tokenizing Object Dynamics in Videos, Yicheng Feng*, Yijiang Li*, Wanpeng Zhang, Sipeng Zheng, Zongqing Lu.Arxiv 2025QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds, Yuting Mei*, Ye Wang*, Sipeng Zheng, Qin Jin, Project.ICML 2025Scaling Large Motion Models with Million-Level Human Motions, Ye Wang*, Sipeng Zheng*, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, Zongqing Lu, Project.ICLR 2025EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?, Boshen Xu, Ziheng Wang, Yang Du, Zhinan Song, Sipeng Zheng, Qin Jin, Code.ICLR 2025From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities, Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu, Project.3DV 2025SPAFormer: Sequential 3D Part Assembly with Transformers, Boshen Xu, Sipeng Zheng, Qin Jin, Code.ECCV 2024UniCode: Learning a Unified Codebook for Multimodal Large Language Models, Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu.ICLR 2024Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds, Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu.NAACL 2024LLaMA Rider: Spurring Large Language Models to Explore the Open World, Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu, Project.ACM-MM 2023POV: Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World, Boshen Xu, Sipeng Zheng, Qin Jin, Code | Project.AAAI 2023No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection, Qi Zhang, Sipeng Zheng, Qin Jin, Code.CVPR 2023Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework, Sipeng Zheng, Boshen Xu, Qin Jin.AAAI 2023Accommodating audio modality in CLIP for multimodal processing, Ludan Ruan, Anwen Hu, Yuqing Song, Lliang Zhang, Sipeng Zheng, Qin Jin.IEEC 2023Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos, Sipeng Zheng, Bei Liu, Jianlong Fu, Wen-Huang Cheng, Code.ECCV 2022Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning, Sipeng Zheng, Shizhen Chen, Qin Jin.CVPR 2022VRDFormer: End-to-end video visual relation detection with transformer, Sipeng Zheng, Shizhe Chen, Qin Jin, Code.CVPR 2022 workshopExploring anchor-based detection for ego4d natural language query, Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu, Code.ICME 2020Skeleton-based interactive graph network for human object interaction detection, Sipeng Zheng, Shizhe Chen, Qin Jin, Code.ACM-MM 2019Visual relation detection with multi-level attention, Sipeng Zheng, Shizhe Chen, Qin Jin.ACM-MM 2019Relation understanding in videos, Sipeng Zheng, Xiangyu Chen, Shizhe Chen, Qin Jin.