Publications

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

Published in arXiv preprint, 2024

This paper introduces a novel asymmetric contrastive objective for EgoHOI named EgoNCE++, while proposing an open-vocabulary benchmark named EgoHOIBench to reveal the diminished performance of current egocentric video-language models (EgoVLM) on fined-grained concepts.

Recommended citation: B Xu, Z Wang, Y Du, Z Song, S Zheng, Q Jin. "EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?" 2024 arXiv preprint. arXiv:2405.17719v1. https://arxiv.org/html/2405.17719v1

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Published in arXiv preprint, 2024

This paper proposes UniCode, novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals.

Recommended citation: S Zheng, B Zhou, Y Feng, Y Wang, Z Lu. "UniCode: Learning a Unified Codebook for Multimodal Large Language Models" 2024 arXiv preprint. arXiv:2403.09072. https://arxiv.org/abs/2403.09072

POV: Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World

Published in Proceedings of the 31st ACM International Conference on Multimedia, 2023

This paper propose a prompt-oriented view-agnostic learning framework for multi-view action understanding.

Recommended citation: B Xu, S Zheng, Q Jin. "POV: Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World." 2023 Proceedings of the 31st ACM International Conference on Multimedia. 2807-2816. https://doi.org/10.1007/s00704-018-2617-z

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

This paper proposes an open-category pre-trained model for human-object interaction understanding.

Recommended citation: S Zheng, B Xu, Q Jin. "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework." 2023 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19392-19402. https://openaccess.thecvf.com/content/CVPR2023/papers/Zheng_Open-Category_Human-Object_Interaction_Pre-Training_via_Language_Modeling_Framework_CVPR_2023_paper.pdf

Accommodating audio modality in CLIP for multimodal processing

Published in Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

This paper proposes a large pretrained model for vision, text and audio modalities.

Recommended citation: L Ruan, A Hu, Y Song, L Zhang, S Zheng, Q Jin. "Accommodating audio modality in CLIP for multimodal processing." 2023 Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence. 7 - 14. https://ojs.aaai.org/index.php/AAAI/article/view/26153/25925

Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos

Published in 2023 IEEE International Conference on Consumer Electronics (ICCE), 2023

This paper proposes a anchor-based detection method for ego-centric natural language localization.

Recommended citation: B Liu, S Zheng, J Fu, WH Cheng. "Anchor-Based Detection for Natural Language Localization in Ego-Centric Videos." 2023 IEEE International Conference on Consumer Electronics (ICCE). 01-04. https://ieeexplore.ieee.org/abstract/document/10043460/

VRDFormer: End-to-end video visual relation detection with transformer

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

This paper proposes the first end2end framework for visual relation detection.

Recommended citation: S Zheng, S Chen, Q Jin. "VRDFormer: End-to-end video visual relation detection with transformer." 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18836-18846. https://openaccess.thecvf.com/content/CVPR2022/papers/Zheng_VRDFormer_End-to-End_Video_Visual_Relation_Detection_With_Transformers_CVPR_2022_paper.pdf

Exploring anchor-based detection for ego4d natural language query

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop 2022, 2022

This paper proposes an egocentric framework for natural language query.

Recommended citation: S Zheng, Q Zhang, B Liu, Q Jin, J Fu. "Exploring anchor-based detection for ego4d natural language query." 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop. https://arxiv.org/abs/2208.05375