Publications

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Published in arXiv preprint, 2023

This paper proposes the first large multi-modal model for open-world agents in Minecraft.

Recommended citation: S Zheng, J Liu, Y Feng, Z Lu. "Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds." 2023 arXiv preprint. arXiv:2310.13255. https://arxiv.org/abs/2310.13255

LLaMA Rider: Spurring Large Language Models to Explore the Open World

Published in arXiv preprint, 2023

This paper proposes a self-driven framework for LLM-based agents in open worlds.

Recommended citation: Y Feng, Y Wang, J Liu, S Zheng, Z Lu. "LLaMA Rider: Spurring Large Language Models to Explore the Open World." 2023 arXiv preprint. arXiv:2310.08922. https://arxiv.org/abs/2310.08922

POV: Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World

Published in Proceedings of the 31st ACM International Conference on Multimedia, 2023

This paper propose a prompt-oriented view-agnostic learning framework for multi-view action understanding.

Recommended citation: B Xu, S Zheng, Q Jin. "POV: Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World." 2023 Proceedings of the 31st ACM International Conference on Multimedia. 2807-2816. https://doi.org/10.1007/s00704-018-2617-z

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

This paper proposes an open-category pre-trained model for human-object interaction understanding.

Recommended citation: S Zheng, B Xu, Q Jin. "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework." 2023 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19392-19402. https://openaccess.thecvf.com/content/CVPR2023/papers/Zheng_Open-Category_Human-Object_Interaction_Pre-Training_via_Language_Modeling_Framework_CVPR_2023_paper.pdf

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Published in arXiv preprint, 2023

This paper proposes an effective no-frill framework for temporal video grounding.

Recommended citation: Q Zhang, S Zheng, Q Jin. "No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection." 2023 arXiv preprint. arXiv:2307.10567. https://arxiv.org/abs/2307.10567

Accommodating audio modality in CLIP for multimodal processing

Published in Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

This paper proposes a large pretrained model for vision, text and audio modalities.

Recommended citation: L Ruan, A Hu, Y Song, L Zhang, S Zheng, Q Jin. "Accommodating audio modality in CLIP for multimodal processing." 2023 Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence. 7 - 14. https://ojs.aaai.org/index.php/AAAI/article/view/26153/25925

Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos

Published in 2023 IEEE International Conference on Consumer Electronics (ICCE), 2023

This paper proposes a anchor-based detection method for ego-centric natural language localization.

Recommended citation: B Liu, S Zheng, J Fu, WH Cheng. "Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos." 2023 IEEE International Conference on Consumer Electronics (ICCE). 01-04. https://ieeexplore.ieee.org/abstract/document/10043460/

Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning

Published in European Conference on Computer Vision, 2022

This paper proposes a hierarchical matching model for few-shot action recognition.

Recommended citation: S Zheng, S Chen, Q Jin. "Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning." 2022 European Conference on Computer Vision. 297-313. https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136640293.pdf

VRDFormer: End-to-end video visual relation detection with transformer

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

This paper proposes the first end2end framework for visual relation detection.

Recommended citation: S Zheng, S Chen, Q Jin. "VRDFormer: End-to-end video visual relation detection with transformer." 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18836-18846. https://openaccess.thecvf.com/content/CVPR2022/papers/Zheng_VRDFormer_End-to-End_Video_Visual_Relation_Detection_With_Transformers_CVPR_2022_paper.pdf

Exploring anchor-based detection for ego4d natural language query

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop 2022, 2022

This paper proposes an egocentric framework for natural language query.

Recommended citation: S Zheng, Q Zhang, B Liu, Q Jin, J Fu. "Exploring anchor-based detection for ego4d natural language query." 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop. https://arxiv.org/abs/2208.05375

Skeleton-based interactive graph network for human object interaction detection

Published in IEEE International Conference on Multimedia and Expo, 2020

This paper proposes a skeleton-based interactive graph network for human-object interaction.

Recommended citation: S Zheng, S Chen, Q Jin. "Skeleton-based interactive graph network for human object interaction detection." 2020 IEEE International Conference on Multimedia and Expo (ICME). 1-6. https://ieeexplore.ieee.org/document/9102755

Visual relation detection with multi-level attention

Published in Proceedings of the 27th ACM International Conference on Multimedia, 2019

This paper proposes a multi-level attention visual relation detection model for visual relation detection.

Recommended citation: S Zheng, S Chen, Q Jin. "Visual relation detection with multi-level attention." 2019 Proceedings of the 27th ACM International Conference on Multimedia. 121-129. https://dl.acm.org/doi/abs/10.1145/3343031.3350962

Relation understanding in videos

Published in Proceedings of the 27th ACM International Conference on Multimedia, 2019

This paper proposes a effective video relation detection model.

Recommended citation: S Zheng, X Chen, S Chen, Q Jin. "Relation understanding in videos." 2019 Proceedings of the 27th ACM International Conference on Multimedia. 2662-2666. https://dl.acm.org/doi/abs/10.1145/3343031.3356080

Sipeng Zheng

Publications