Bei Liu

📍 Beijing, China

🏢 Microsoft Research Asia

🔬 Visual Computing Group

My research focuses on Multimodal AI, Document Understanding, and AI Agents. I also serve as a Guest Associate Professor at Nagoya University in Japan. Before joining Microsoft, I earned my Ph.D. and Master’s degrees from Kyoto University, Japan, under the guidance of Professors Katsumi Tanaka, Masatoshi Yoshikawa, and Makoto P. Kato. I hold a Bachelor’s degree from Nanjing University, China.

My current interest is in enabling agents that actively read, navigate, and reason over complex documents, combining perception, planning, and tool use.

I am open to research collaboration, academic visits, and supervising interns working on multimodal agents. Feel free to reach out!

News

Jan 1, 2025	One paper accepted to MMM 2025, awarded 🏆 Best Paper!
Dec 1, 2024	One paper accepted to ACM MMAsia 2024, awarded Best Student Paper Runner-Up. 🎉

Selected Publications

MMM
RoLD: Robot Latent Diffusion for Multi-task Policy Modeling

Wenhui Tan, Bei Liu, Junbo Zhang, Ruihua Song, and Jianlong Fu

In International Conference on Multimedia Modeling (MMM), 2025

Abs arXiv Bib HTML

🏆 Best Paper Award, MMM 2025

📖 Cited by 2 (updated: 2026-03-17)

We propose RoLD, a novel approach that decouples robot action trajectory encoding and control policy generation by leveraging latent action trajectory spaces. We pre-train a task-agnostic auto-encoder to project action trajectories into a latent space, then learn a diffusion model to generate actions. Experiments demonstrate 7%–29% improvement over baselines across eight tasks.
@inproceedings{tan2025rold, title = {{RoLD}: Robot Latent Diffusion for Multi-task Policy Modeling}, author = {Tan, Wenhui and Liu, Bei and Zhang, Junbo and Song, Ruihua and Fu, Jianlong}, booktitle = {International Conference on Multimedia Modeling (MMM)}, pages = {340--353}, year = {2025}, doi = {10.1007/978-981-96-2064-7_25}, google_scholar_id = {rO6llkc54NcC}, award = {🏆 Best Paper Award, MMM 2025} }
MM Asia
ViCo: Engaging Video Comment Generation with Human Preference Rewards

Yuchong Sun, Bei Liu, Xu Chen, Ruihua Song, and Jianlong Fu

In Proceedings of ACM Multimedia Asia (MM Asia), 2024

Abs arXiv Bib HTML

🥈 Best Student Paper Runner-Up, ACM MM Asia 2024

📖 Cited by 3 (updated: 2026-03-17)

We propose ViCo with three novel designs for generating engaging video comments. We use the number of "likes" as a proxy for human preference, train a reward model to align with this proxy, and optimize a comment generator via reward feedback. We collect a large video comment dataset ViCo-20k to facilitate research.
@inproceedings{sun2024vico, title = {{ViCo}: Engaging Video Comment Generation with Human Preference Rewards}, author = {Sun, Yuchong and Liu, Bei and Chen, Xu and Song, Ruihua and Fu, Jianlong}, booktitle = {Proceedings of ACM Multimedia Asia (MM Asia)}, year = {2024}, doi = {10.1145/3696409.3700260}, google_scholar_id = {M3NEmzRMIkIC}, award = {🥈 Best Student Paper Runner-Up, ACM MM Asia 2024} }
ICLR
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo

In International Conference on Learning Representations (ICLR), 2023

Abs arXiv Bib HTML

📖 Cited by 258 (updated: 2026-03-17)

We propose CLIP-ViP, an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism, adapting the image-text pre-trained CLIP model for video-language post-pretraining. Our approach significantly improves CLIP’s performance on video-text retrieval and achieves state-of-the-art results on MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
@inproceedings{xue2023clipvip, title = {{CLIP-ViP}: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment}, author = {Xue, Hongwei and Sun, Yuchong and Liu, Bei and Fu, Jianlong and Song, Ruihua and Li, Houqiang and Luo, Jiebo}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2023}, google_scholar_id = {a0OBvERweLwC} }
CVPR
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Abs arXiv Bib HTML

📖 Cited by 306 (updated: 2026-03-17)

We propose HD-VILA, a High-resolution and Diversified VIdeo-LAnguage pre-training model. We collect 371.5k hours of 720p videos across 15 YouTube categories. HD-VILA achieves new state-of-the-art on 10 VL understanding tasks, improving SOTA by 40.4% R@1 on zero-shot MSR-VTT text-to-video retrieval.
@inproceedings{xue2022advancing, title = {Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions}, author = {Xue, Hongwei and Hang, Tiankai and Zeng, Yanhong and Sun, Yuchong and Liu, Bei and Yang, Huan and Fu, Jianlong and Guo, Baining}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {5026--5035}, year = {2022}, doi = {10.1109/CVPR52688.2022.00498}, google_scholar_id = {dhFuZR0502QC} }
CVPR
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Oral

Abs arXiv Bib HTML

📖 Cited by 346 (updated: 2026-03-17)

We propose SOHO to "See Out of tHe bOx", an end-to-end vision-language pre-training framework that takes a whole image as input without requiring bounding box annotations, enabling 10x faster inference than region-based approaches. SOHO uses a dynamic visual dictionary for cross-modal understanding.
@inproceedings{huang2021seeing, title = {Seeing Out of t{H}e b{O}x: End-to-End Pre-training for Vision-Language Representation Learning}, author = {Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {12976--12985}, year = {2021}, note = {Oral}, google_scholar_id = {hqOjcs7Dif8C} }
NeurIPS
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo

In Advances in Neural Information Processing Systems (NeurIPS), 2021

Abs arXiv Bib HTML

📖 Cited by 105 (updated: 2026-03-17)

We propose a fully Transformer visual embedding for Vision-Language Pre-training (VLP), introducing a metric named Inter-Modality Flow (IMF) to measure vision-language interaction, and a Masked Feature Regression (MFR) mechanism to promote inter-modality learning.
@inproceedings{xue2021probing, title = {Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training}, author = {Xue, Hongwei and Huang, Yupan and Liu, Bei and Peng, Houwen and Fu, Jianlong and Li, Houqiang and Luo, Jiebo}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {34}, pages = {4514--4528}, year = {2021}, google_scholar_id = {qxL8FJ1GzNcC} }
ACM MM
Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Yupan Huang, Hongwei Xue, Bei Liu, and Yutong Lu

In Proceedings of the 29th ACM International Conference on Multimedia (MM), 2021

Abs arXiv Bib HTML

📖 Cited by 73 (updated: 2026-03-17)

We propose a unified image-and-text generative framework using a single multimodal Transformer to jointly study bi-directional image-to-text and text-to-image generation tasks, with two-level granularity feature representations and sequence-level training.
@inproceedings{huang2021unifying, title = {Unifying Multimodal Transformer for Bi-directional Image and Text Generation}, author = {Huang, Yupan and Xue, Hongwei and Liu, Bei and Lu, Yutong}, booktitle = {Proceedings of the 29th ACM International Conference on Multimedia (MM)}, pages = {1138--1147}, year = {2021}, doi = {10.1145/3474085.3481540}, google_scholar_id = {mVmsd5A6BfQC} }
ACM MM
Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

Bei Liu, Jianlong Fu, Makoto P. Kato, and Masatoshi Yoshikawa

In Proceedings of the 26th ACM International Conference on Multimedia (MM), 2018

Abs arXiv Bib HTML

🏆 Best Paper Award, ACM Multimedia 2018

📖 Cited by 111 (updated: 2026-03-17)

We propose a multi-adversarial training framework via policy gradient to generate poems from images, jointly learning cross-modal relevance and poetic language style. We introduce a deep coupled visual-poetic embedding and two discriminative networks, and release two poem datasets including the first human-annotated image-poem pair dataset.
@inproceedings{liu2018beyond, title = {Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training}, author = {Liu, Bei and Fu, Jianlong and Kato, Makoto P. and Yoshikawa, Masatoshi}, booktitle = {Proceedings of the 26th ACM International Conference on Multimedia (MM)}, pages = {783--791}, year = {2018}, doi = {10.1145/3240508.3240587}, google_scholar_id = {2osOgNQ5qMEC}, award = {🏆 Best Paper Award, ACM Multimedia 2018} }

Selected Publications

Awards & Honors

2025.1	International Conference on Multimedia Modeling (MMM) 2025 — Best Paper Award
2024.12	ACM Multimedia Asia 2024 — Best Student Paper Runner-Up
2022.7	China Multimedia Company Innovation Technology Award
2020	IEEE Transactions on Multimedia — Outstanding Reviewer Award
2019.6	CVPR 2019, ActivityNet Challenge, ActivityNet Captions Track — 1st Place
2019.5	CVPR 2019, VQA and Dialog Workshop, VQA Challenge — 2nd Place
2018.10	ACM Multimedia 2018 — Best Paper Award
2018.7	FashionAI Challenge, Attribute Recognition Task (Alibaba) — 3rd Place / 2950 teams
2014–2016	Asian Future Leaders Scholarship — Fellowship funded by the Bai Xian Education Foundation