Bei Liu
Senior Researcher, Microsoft Research Asia, Beijing.
đ Beijing, China
đą Microsoft Research Asia
đŹ Visual Computing Group
My research focuses on Multimodal AI, Document Understanding, and AI Agents. I also serve as a Guest Associate Professor at Nagoya University in Japan. Before joining Microsoft, I earned my Ph.D. and Masterâs degrees from Kyoto University, Japan, under the guidance of Professors Katsumi Tanaka, Masatoshi Yoshikawa, and Makoto P. Kato. I hold a Bachelorâs degree from Nanjing University, China.
My current interest is in enabling agents that actively read, navigate, and reason over complex documents, combining perception, planning, and tool use.
I am open to research collaboration, academic visits, and supervising interns working on multimodal agents. Feel free to reach out!
News
| Jan 1, 2025 | One paper accepted to MMM 2025, awarded đ Best Paper! |
|---|---|
| Dec 1, 2024 | One paper accepted to ACM MMAsia 2024, awarded Best Student Paper Runner-Up. đ |
Selected Publications
- MMMRoLD: Robot Latent Diffusion for Multi-task Policy ModelingIn International Conference on Multimedia Modeling (MMM), 2025đ Best Paper Award, MMM 2025đ Cited by 2 (updated: 2026-03-17)
- MM AsiaViCo: Engaging Video Comment Generation with Human Preference RewardsIn Proceedings of ACM Multimedia Asia (MM Asia), 2024đ„ Best Student Paper Runner-Up, ACM MM Asia 2024đ Cited by 3 (updated: 2026-03-17)
- ICLRCLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation AlignmentIn International Conference on Learning Representations (ICLR), 2023đ Cited by 258 (updated: 2026-03-17)
- CVPRAdvancing High-Resolution Video-Language Representation with Large-Scale Video TranscriptionsIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022đ Cited by 306 (updated: 2026-03-17)
- CVPRSeeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021Oralđ Cited by 346 (updated: 2026-03-17)
- NeurIPSProbing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-trainingIn Advances in Neural Information Processing Systems (NeurIPS), 2021đ Cited by 105 (updated: 2026-03-17)
- ACM MMUnifying Multimodal Transformer for Bi-directional Image and Text GenerationIn Proceedings of the 29th ACM International Conference on Multimedia (MM), 2021đ Cited by 73 (updated: 2026-03-17)
- ACM MMBeyond Narrative Description: Generating Poetry from Images by Multi-Adversarial TrainingIn Proceedings of the 26th ACM International Conference on Multimedia (MM), 2018đ Best Paper Award, ACM Multimedia 2018đ Cited by 111 (updated: 2026-03-17)
Selected Publications
Awards & Honors
| 2025.1 | International Conference on Multimedia Modeling (MMM) 2025 â Best Paper Award |
| 2024.12 | ACM Multimedia Asia 2024 â Best Student Paper Runner-Up |
| 2022.7 | China Multimedia Company Innovation Technology Award |
| 2020 | IEEE Transactions on Multimedia â Outstanding Reviewer Award |
| 2019.6 | CVPR 2019, ActivityNet Challenge, ActivityNet Captions Track â 1st Place |
| 2019.5 | CVPR 2019, VQA and Dialog Workshop, VQA Challenge â 2nd Place |
| 2018.10 | ACM Multimedia 2018 â Best Paper Award |
| 2018.7 | FashionAI Challenge, Attribute Recognition Task (Alibaba) â 3rd Place / 2950 teams |
| 2014â2016 | Asian Future Leaders Scholarship â Fellowship funded by the Bai Xian Education Foundation |