Sipeng Zheng
I’m Sipeng Zheng, currently working as a researcher with the Multimodal Interaction Research Group at the Beijing Academy of Artificial Intelligence(BAAI), where I collaborate with Prof. Zongqing Lu. I obtained my PhD and bachelor’s degree from Renmin University of China (RUC), under the guidance of Prof. Qin Jin. My research primarily focuses on human behavior understanding, vision-and-language learning, and the development of open-world embodied agents. Currently I’m working towards an intelligent humanoid robot. For more details, please refer to my CV.
Research Interest
- Human behavior understanding
- Large language models and large multimodal models
- Open-world embodied agent learning
Work Experience
- Jul. 2023 - Current: Research Scientist
- Beijing Academy of Artificial Intelligence, Beijing, China
- Duties included: large multimodal model, multi-agent learning
- Apr. 2022 - Oct. 2022: Research Intern
- Microsoft Research Asia, Beijing, China
- Duties included: temporal sentence grounding for long-term videos.
- Nov. 2021 - Apr. 2022: Research Intern
- Beijing Academy of Artificial Intelligence, Beijing, China
- Duties included: multi-lingual language-vision-audio pre-training.
Education
- B.Eng. in Computer Science and Engineering, Renmin University of China, China, 2023
- Ph.D. in Computer Science and Engineering, Renmin University of China, China, 2023
Publications
Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models | |
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities | |
QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds | |
EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? | |
SPAFormer: Sequential 3D Part Assembly with Transformers | |
UniCode: Learning a Unified Codebook for Multimodal Large Language Models | |
Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds | |
LLaMA Rider: Spurring Large Language Models to Explore the Open World | |
POV: Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World | |
No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection | |
Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework | |
Accommodating audio modality in CLIP for multimodal processing | |
Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos | |
Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning | |
VRDFormer: End-to-end video visual relation detection with transformer | |
Exploring anchor-based detection for ego4d natural language query | |
Skeleton-based interactive graph network for human object interaction detection | |
Visual relation detection with multi-level attention | |
Relation understanding in videos |
Awards
- National Scholarship for Ph.D Students.
- 2022 Ranked 3th in CVPR 2022 Ego4D Natural Language Query Challenge.
- 2021 Ranked 3th in NIST TRECVID 2021 Ad-hoc Video Search (AVS) Challenge. (20+ teams)
- 2021 Ranked 4th in CVPR 2021 HOMAGE Scene-graph Generation Challenge.
- 2020 Ranked 2th in ACM MM 2020 Video Relationship Understanding Grand Challenge.
- 2019 Ranked 2nd in ACM MM 2019 Video Relationship Understanding Grand Challenge.
- Best Method Prize in ACM MM 2019 Grand Challenge.
- 2019 First Class Scholarship for Ph.D Students from 2018 to 2021.
- 2015 First Prize in National University Mathematical Modeling Competition of Beijing Area.
Services
- Conference Reviewer for CVPR, ICCV, ECCV, ACCV, NeurIPS, AAAI, ACM MM, ICME.
- Journal Reviewer for IJCV, TCSVT, TMM, JATS.