• Home
  • About
  • Resume
  • Research
  • Projects
  • Blog

Hi! I'm Boyuan Chen.

AI Researcher and Roboticist at MIT EECS.
Specialized in machine learning. generative models. reinforcement learning. robotics.

Photo of Boyuan Chen

About Me

I'm Boyuan Chen (陈博远), an AI researcher and roboticist at MIT. I am currently a fourth year PhD student working with Prof. Russ Tedrake and Prof. Vincent Sitzmann. I am interested in model-based reinforcement learning, generative world models and robotics. I hope to leverage video world models trained on internet-scale data as planners for general-purpose robots, replicating LLM's success but for the visual world, and eventually solve robotics.

Previously, I interned at Google Deepmind and Google X. I obtained my bachelor's degree in computer science and math at UC Berkeley, where I spent a signficant amount of time doing research at Berkeley Artificial Intelligence Research (BAIR) on deep reinforcement learning and unsupervised learning. I also spent a year studying philosophy during my undergrad. I am a big fan of chess, robots and boba.

Resume / CV

My research

paper thumbnail
History-Guided Video Diffusion
Kiwhan Song*, Boyuan Chen*, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann
* Equal contribution
arXiv 2025

website | paper | abstract | bibtex
@misc{song2025historyguidedvideodiffusion,
  title={History-Guided Video Diffusion}, 
  author={Kiwhan Song and Boyuan Chen and Max Simchowitz and Yilun Du and Russ Tedrake and Vincent Sitzmann},
  year={2025},
  eprint={2502.06764},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2502.06764}, 
}
              

Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos.

paper thumbnail
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann
NeurIPS 2024 (Conference of Neural Information Processing Systems)

website | paper | abstract | bibtex
@article{chen2025diffusion,
  title={Diffusion forcing: Next-token prediction meets full-sequence diffusion},
  author={Chen, Boyuan and Mart{\'\i} Mons{\'o}, Diego and Du, Yilun and Simchowitz, Max and Tedrake, Russ and Sitzmann, Vincent},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={24081--24125},
  year={2025}
}
              

This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.

paper thumbnail
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia
CVPR 2024 (Conference on Computer Vision and Pattern Recognition)

website | paper | abstract | bibtex
@InProceedings{Chen_2024_CVPR,
    author    = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
    title     = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {14455-14465}
}
              

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability.

paper thumbnail
DittoGym: Learning to Control Soft Shape-Shifting Robots
Suning Huang, Boyuan Chen, Huazhe Xu, Vincent Sitzmann
ICLR 2024 (International Conference on Learning Representations)

website | paper | abstract | bibtex
@misc{huang2024dittogym,
  title={DittoGym: Learning to Control Soft Shape-Shifting Robots}, 
  author={Suning Huang and Boyuan Chen and Huazhe Xu and Vincent Sitzmann},
  year={2024},
  eprint={2401.13231},
  archivePrefix={arXiv},
  primaryClass={cs.RO}
}
              

Robot co-design, where the morphology of a robot is optimized jointly with a learned policy to solve a specific task, is an emerging area of research. It holds particular promise for soft robots, which are amenable to novel manufacturing techniques that can realize learned morphologies and actuators. Inspired by nature and recent novel robot designs, we propose to go a step further and explore the novel reconfigurable robots, defined as robots that can change their morphology within their lifetime. We formalize control of reconfigurable soft robots as a highdimensional reinforcement learning (RL) problem. We unify morphology change, locomotion, and environment interaction in the same action space, and introduce an appropriate, coarse-to-fine curriculum that enables us to discover policies that accomplish fine-grained control of the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark for reconfigurable soft robots that require fine-grained morphology changes to accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine algorithm on DittoGym and demonstrate robots that learn to change their morphology several times within a sequence, uniquely enabled by our RL algorithm.

paper thumbnail
Self-Supervised Reinforcement Learning that Transfers using Random Features
Boyuan Chen, Chuning Zhu, Pulkit Agrawal, Kaiqing Zhang, Abhishek Gupta
NeurIPS 2023 (Conference of Neural Information Processing Systems)

website | paper | abstract | bibtex
@article{chen2024self,
  title={Self-supervised reinforcement learning that transfers using random features},
  author={Chen, Boyuan and Zhu, Chuning and Agrawal, Pulkit and Zhang, Kaiqing and Gupta, Abhishek},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}
              

Reinforcement learning (RL) algorithms have the potential not only for synthesizing complex control behaviors, but also for transfer across tasks. Model-free RL excels in solving problems with high-dimensional observations or long horizons, but the learned policies do not transfer across different reward functions. Model-based RL, on the other hand, naturally enables transfer across different reward functions, but struggles in complex environments due to compounding error. In this work, we propose a new method for transferring behaviors across tasks with different rewards, combining the performance of model-free RL with the transferability of model-based RL. In particular, we show how model-free RL using a number of random features as the reward allows for implicit modeling of long-horizon environment dynamics. Model-predictive control using these implicit models enables fast adaptation to problems with new reward functions while avoiding the compounding error from model rollouts. Our method can be trained on offline datasets without reward labels, and quickly deployed on new tasks, making it more widely applicable than typical RL methods. We validate that our proposed method enables transfer across tasks on a variety of manipulation and locomotion domains.

paper thumbnail
Open-vocabulary Queryable Scene Representations for Real World Planning
Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S. Ryoo, Austin Stone, Daniel Kappler
ICRA 2023 (International Conference on Robotics and Automation)

website | paper | abstract | bibtex | talk video
@inproceedings{chen2023open,
  title={Open-vocabulary queryable scene representations for real world planning},
  author={Chen, Boyuan and Xia, Fei and Ichter, Brian and Rao, Kanishka and Gopalakrishnan, Keerthana and Ryoo, Michael S and Stone, Austin and Kappler, Daniel},
  booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={11509--11522},
  year={2023},
  organization={IEEE}
}
              

Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate contextual information into LLM planners, allowing them to see and query available objects in the scene before generating a context-conditioned plan. NLMap first establishes a natural language queryable scene representation with Visual Language models (VLMs). An LLM based object proposal module parses instructions and proposes involved objects to query the scene representation for object availability and location. An LLM planner then plans with such information about the scene. NLMap allows robots to operate without a fixed list of objects nor executable options, enabling real robot operation unachievable by previous methods.

paper thumbnail
Unsupervised Learning of Visual 3D Keypoints for Control
Boyuan Chen, Pieter Abbeel, Deepak Pathak
ICML 2021 (International Conference on Machine Learning)

website | paper | abstract | bibtex | code | talk video
@inproceedings{chen2021unsupervised,
  title={Unsupervised learning of visual 3d keypoints for control},
  author={Chen, Boyuan and Abbeel, Pieter and Pathak, Deepak},
  booktitle={International Conference on Machine Learning},
  pages={1539--1549},
  year={2021},
  organization={PMLR}
}
              

Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations. Prior works show that structured latent space such as visual keypoints often outperforms unstructured representations for robotic control. However, most of these representations, whether structured or unstructured are learned in a 2D space even though the control tasks are usually performed in a 3D environment. In this work, we propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner. The input images are embedded into latent 3D keypoints via a differentiable encoder which is trained to optimize both a multi-view consistency loss and downstream task objective. These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space. The proposed approach outperforms prior state-of-art methods across a variety of reinforcement learning benchmarks.

paper thumbnail
Zero-shot Policy Learning with Spatial Temporal Reward Decomposition on Contingency-aware Observation
Boyuan Chen*, Huazhe Xu*, Yang Gao and Trevor Darrell
ICRA 2021 (International Conference on Robotics and Automation)

website | paper | abstract | bibtex | code |
@inproceedings{xu2021zero,
  title={Zero-shot policy learning with spatial temporal reward decomposition on contingency-aware observation},
  author={Xu, Huazhe and Chen, Boyuan and Gao, Yang and Darrell, Trevor},
  booktitle={2021 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={10786--10792},
  year={2021},
  organization={IEEE}
}
              

It is a long-standing challenge to enable an intelligent agent to learn in one environment and generalize to an unseen environment without further data collection and finetuning. In this paper, we consider a zero shot generalization problem setup that complies with biological intelligent agents' learning and generalization processes. The agent is first presented with previous experiences in the training environment, along with task description in the form of trajectory-level sparse rewards. Later when it is placed in the new testing environment, it is asked to perform the task without any interaction with the testing environment. We find this setting natural for biological creatures and at the same time, challenging for previous methods. Behavior cloning, state-of-art RL along with other zero-shot learning methods perform poorly on this benchmark. Given a set of experiences in the training environment, our method learns a neural function that decomposes the sparse reward into particular regions in a contingency-aware observation as a per step reward. Based on such decomposed rewards, we further learn a dynamics model and use Model Predictive Control (MPC) to obtain a policy. Since the rewards are decomposed to finer-granularity observations, they are naturally generalizable to new environments that are composed of similar basic elements. We demonstrate our method on a wide range of environments, including a classic video game -- Super Mario Bros, as well as a robotic continuous control task. Please refer to the project page for more visualized results.

paper thumbnail
Discovering Diverse Multi-agent Strategic Behavior via Reward Randomization
Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu
ICLR 2021 (International Conference on Learning Representations)

website | paper | abstract | bibtex | code |
@misc{tang2021discovering,
    title={Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization}, 
    author={Zhenggang Tang and Chao Yu and Boyuan Chen and Huazhe Xu and Xiaolong Wang and Fei Fang and Simon Du and Yu Wang and Yi Wu},
    year={2021},
    eprint={2103.04564},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}
                            

We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games(MonsterHunt and Escalation) and a real-world web game Agar.io, where multiple equilibria exist but standard multi-agent policy gradient algorithms always converge to a fixed one with a sub-optimal payoff for every player even using state-of-the-art exploration techniques (including RND, DIAYN, MAVEN). Furthermore, with the set of diverse strategies from RPG, we can (1) achieve higher payoffs by fine-tuning the best policy from the set; and (2) obtain an adaptive agent by using this set of strategies as its training opponents.

MISC

  • Robots
  • Cooking
  • Teams
Robomooc Robotics Kit

I designed it with my friend, Kinsky. We sold it as an education kit to schools. You can ride on it!

Robomaster ICRA challenge

DJI robomaster robot for ICRA AI Challenge. During my undergrad, I was the captain of the team, leading the development of autonomous algorithms in the robot shooting challenge.

Autonomous Bogie Rover

My personal robot that can handle a variety of terrains. I did everything from machanical design, electronics to programming. It uses computer vision to autonomously follow me and avoid obstables.

FRC 2017 Robot

In 2017, I founded my high school's first FRC team. We didn't have the mentorship nor funding we need, but the team did amazing. I did the majority of the design.

PR2 in RLL

In 2021, I graduated from UC Berkeley, where I spent some amazing time doing research in robotics learning lab.

Autonomous Drone

An autonomous drone which I built and coded. I installed a camera a mini railgun on it to track and aim at the target I select.

FTC 2017 Robot

Our FTC competition robot in 2017, when I became the captain of the team. It's my team's first robot designed with CAD. The robot won the east China regional.

My first ftc robot

In 2016, I participanted in robotics competition for the first time. This is a super cool robot which marks the beginning of my robotics journey.

FRC 2018 Robot

After my graduation from high school, I continued mentoring the team. My successor Xinpei designed the robot under my mentorship.

Robomaster Team

In 2019, I was the captain of Berkeley's team in ICRA Robomaster AI Challenge. I co-founded the team and lead 20 student developing autonomous robots.

MIT Chess Club

I became one of the execs at MIT Chess Club in 2022. It was a great time to organize events and hangout with the team!

FRC Team

In 2017, I founded my high school's first FRC team. I worked as both captain and mentor. We won the Rookie All Star Award at CRC 2017.

Chinese New Year 2022

To celebrate Chinese New Year 2022, I made a big dinner with my friend Maohao Shen at MIT. MITCSSA awarded me the title Master Chef MIT for my Peking duck in their cooking competition.

Home Style Noodle with Braised Chicken
黄焖鸡面

I cooked 黄焖鸡 during COVID-19 quarantine!

Soy sauce braised pork

I made Dongporou (东坡肉) during Thanksgiving 2023. The best Soy sauce braised pork I've every made!

Chicken Soup with Mushroom
(松茸鸡汤)

Traditional Chinese chicken soup with dried matsutake mushroom.

Chinese New Year 2023

I cooked 5 dished for 2023 Chinese New Year. All of them are amazing! The dishes are steam eel, egg plant with minced meat, soy sauce braised pork with bamboo shoots, chinese chicken soup with bamboo-mushroom and stir-fried Chinese chives.

Birthday Noodle (长寿面)

I made my roommate and long time friend Haoyuan a bowl of traditional birthday noodle in 2021, when he turned 23.

Potato Braised Beef Brisket

I cooked beef brisket (土豆炖排骨) in COVID-19 lockdown.

Lamb Croutons

During the COVID-19 pandemic, I tried to make Lamb Croutons following Gordon Ramsay's tutorial.

XO sauce Tofu Stew with mushrooms

Tofu stew cooked with various mushrooms and XO sauce. The Umami flavor will burst into your mouse - it's finished in 2 minutes by all my friends.

Blog

Best computer science schools ranked by boba

Many people underestimate the importance of boba when they choose grad school. For those who don't know what boba is ...

Read More 20/Jun/2022

随笔:我们应该如何看待具身智能

以ChatGPT为代表的大模型让我们瞥见了未来的一隅。机器人大模型在过去一年里出现在了几乎每一个机器人公司的PPT里。那么大语言模型的思路会给我们带通用机器人么? ...

Read More 16/Jun/2024

The epic quest to Embodied AI and general-purpose robots

ChatGPT has given us a glimpse of the future. So, will the same bring us general-purpose robots? ...

Read More 16/Jun/2024
View More

Boyuan Chen © 2022. All Right Reserved.