Sipeng Zheng

I’m Sipeng Zheng, currently working as a researcher with the Multimodal Interaction Research Group at the Beijing Academy of Artificial Intelligence(BAAI), where I collaborate with Prof. Zongqing Lu. I obtained my PhD and bachelor’s degree from Renmin University of China (RUC), under the guidance of Prof. Qin Jin. My research primarily focuses on human behavior understanding, vision-and-language learning, and the development of open-world embodied agents. Currently I’m working towards an intelligent humanoid robot. For more details, please refer to my CV.

Research Interest

  • Human behavior understanding
  • Large language models and large multimodal models
  • Open-world embodied agent learning

Work Experience

  • Jul. 2023 - Current: Research Scientist
    • Beijing Academy of Artificial Intelligence, Beijing, China
    • Duties included: large multimodal model, multi-agent learning
  • Apr. 2022 - Oct. 2022: Research Intern
    • Microsoft Research Asia, Beijing, China
    • Duties included: temporal sentence grounding for long-term videos.
  • Nov. 2021 - Apr. 2022: Research Intern
    • Beijing Academy of Artificial Intelligence, Beijing, China
    • Duties included: multi-lingual language-vision-audio pre-training.

Education

  • B.Eng. in Computer Science and Engineering, Renmin University of China, China, 2023
  • Ph.D. in Computer Science and Engineering, Renmin University of China, China, 2023

Publications

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Ye Wang$^*$, Sipeng Zheng$^*$, Bin Cao, Qianshan Wei, Qin Jin, Zongqing Lu
arxiv
[pdf]

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu
arxiv
[pdf]

QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
Yuting Mei$^*$, Ye Wang$^*$, Sipeng Zheng, Qin Jin
arxiv
[pdf] [page]

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?
Boshen Xu, Ziheng Wang, Yang Du, Zhinan Song, Sipeng Zheng, Qin Jin
arxiv
[pdf] [code]

SPAFormer: Sequential 3D Part Assembly with Transformers
Boshen Xu, Sipeng Zheng, Qin Jin
3DV 2025
[pdf] [code]

UniCode: Learning a Unified Codebook for Multimodal Large Language Models
Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu
ECCV 2024
[pdf]

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds
Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu
ICLR 2024
[pdf] [code] [page]

LLaMA Rider: Spurring Large Language Models to Explore the Open World
Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu
NAACL 2024
[pdf] [code]

POV: Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World
Boshen Xu, Sipeng Zheng, Qin Jin
ACM MM, 2023
[pdf] [code] [page]

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection
Qi Zhang, Sipeng Zheng, Qin Jin
arxiv
[pdf] [code]

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework
Sipeng Zheng, Boshen Xu, Qin Jin
CVPR, 2023
[pdf]

Accommodating audio modality in CLIP for multimodal processing
Ludan Ruan, Anwen Hu, Yuqing Song, Lliang Zhang, Sipeng Zheng, Qin Jin
AAAI, 2023
[pdf]

Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos
Sipeng Zheng, Bei Liu, Jianlong Fu, Wen-Huang Cheng
IEEC, 2023
[pdf] [code]

Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning
Sipeng Zheng, Shizhe Chen, Qin Jin
ECCV, 2022
[pdf] [code]

VRDFormer: End-to-end video visual relation detection with transformer
Sipeng Zheng, Shizhe Chen, Qin Jin
CVPR Oral, 2022
[pdf] [code]

Exploring anchor-based detection for ego4d natural language query
Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu
CVPR Workshop, 2022
[pdf] [code]

Skeleton-based interactive graph network for human object interaction detection
Sipeng Zheng, Shizhe Chen, Qin Jin
ICME, 2020
[pdf] [code]

Visual relation detection with multi-level attention
Sipeng Zheng, Shizhe Chen, Qin Jin
ACM MM, 2019
[pdf]

Relation understanding in videos
Sipeng Zheng, Xiangyu Chen, Shizhe Chen, Qin Jin
ACM MM Grand Challenge, 2019
[pdf]

Awards

  • National Scholarship for Ph.D Students.
  • 2022 Ranked 3th in CVPR 2022 Ego4D Natural Language Query Challenge.
  • 2021 Ranked 3th in NIST TRECVID 2021 Ad-hoc Video Search (AVS) Challenge. (20+ teams)
  • 2021 Ranked 4th in CVPR 2021 HOMAGE Scene-graph Generation Challenge.
  • 2020 Ranked 2th in ACM MM 2020 Video Relationship Understanding Grand Challenge.
  • 2019 Ranked 2nd in ACM MM 2019 Video Relationship Understanding Grand Challenge.
  • Best Method Prize in ACM MM 2019 Grand Challenge.
  • 2019 First Class Scholarship for Ph.D Students from 2018 to 2021.
  • 2015 First Prize in National University Mathematical Modeling Competition of Beijing Area.

Services

  • Conference Reviewer for CVPR, ICCV, ECCV, ACCV, NeurIPS, AAAI, ACM MM, ICME.
  • Journal Reviewer for IJCV, TCSVT, TMM, JATS.