Developing next-gen AI agents, exploring new modalities, and pioneering foundational learning
Developing next-gen AI agents, exploring new modalities, and pioneering foundational learning
Exploring the promise and risks of a future with more capable AI
Exploring the promise and risks of a future with more capable AI
Exploring the promise and risks of a future with more capable AI
Exploring the promise and risks of a future with more capable AI
Exploring the promise and risks of a future with more capable AI
Exploring the promise and risks of a future with more capable AI
Exploring the promise and risks of a future with more capable AI
Exploring the promise and risks of a future with more capable AI
As computer vision researchers, we believe that every pixel can tell a story. However, there seems to be a writer’s block settling into the field when it comes to dealing with large images. Large images are no longer rare—the cameras we carry in our pockets and those orbiting our planet snap pictures so big and detailed that they stretch our current best models and hardware to their breaking points when handling them. Generally, we face a quadratic increase in memory usage as a function of image size.
Today, we make one of two sub-optimal choices when handling large images: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. We take another look at these approaches and introduce $x$T, a new framework to model large images end-to-end on contemporary GPUs while effectively aggregating global context with local details.
Architecture for the $x$T framework.
Why Bother with Big Images Anyway?
Why bother handling large images anyways? Picture yourself in front of your TV, watching your favorite football team. The field is dotted with players all over with action occurring only on a small portion of the screen at a time. Would you be satisified, however, if you could only see a small region around where the ball currently was? Alternatively, would you be satisified watching the game in low resolution? Every pixel tells a story, no matter how far apart they are. This is true in all domains from your TV screen to a pathologist viewing a gigapixel slide to diagnose tiny patches of cancer. These images are treasure troves of information. If we can’t fully explore the wealth because our tools can’t handle the map, what’s the point?
Sports are fun when you know what's going on.
That’s precisely where the frustration lies today. The bigger the image, the more we need to simultaneously zoom out to see the whole picture and zoom in for the nitty-gritty details, making it a challenge to grasp both the forest and the trees simultaneously. Most current methods force a choice between losing sight of the forest or missing the trees, and neither option is great.
How $x$T Tries to Fix This
Imagine trying to solve a massive jigsaw puzzle. Instead of tackling the whole thing at once, which would be overwhelming, you start with smaller sections, get a good look at each piece, and then figure out how they fit into the bigger picture. That’s basically what we do with large images with $x$T.
$x$T takes these gigantic images and chops them into smaller, more digestible pieces hierarchically. This isn’t just about making things smaller, though. It’s about understanding each piece in its own right and then, using some clever techniques, figuring out how these pieces connect on a larger scale. It’s like having a conversation with each part of the image, learning its story, and then sharing those stories with the other parts to get the full narrative.
Nested Tokenization
At the core of $x$T lies the concept of nested tokenization. In simple terms, tokenization in the realm of computer vision is akin to chopping up an image into pieces (tokens) that a model can digest and analyze. However, $x$T takes this a step further by introducing a hierarchy into the process—hence, nested.
Imagine you’re tasked with analyzing a detailed city map. Instead of trying to take in the entire map at once, you break it down into districts, then neighborhoods within those districts, and finally, streets within those neighborhoods. This hierarchical breakdown makes it easier to manage and understand the details of the map while keeping track of where everything fits in the larger picture. That’s the essence of nested tokenization—we split an image into regions, each which can be split into further sub-regions depending on the input size expected by a vision backbone (what we call a region encoder), before being patchified to be processed by that region encoder. This nested approach allows us to extract features at different scales on a local level.
Coordinating Region and Context Encoders
Once an image is neatly divided into tokens, $x$T employs two types of encoders to make sense of these pieces: the region encoder and the context encoder. Each plays a distinct role in piecing together the image’s full story.
The region encoder is a standalone “local expert” which converts independent regions into detailed representations. However, since each region is processed in isolation, no information is shared across the image at large. The region encoder can be any state-of-the-art vision backbone. In our experiments we have utilized hierarchical vision transformers such as Swin and Hiera and also CNNs such as ConvNeXt!
Enter the context encoder, the big-picture guru. Its job is to take the detailed representations from the region encoders and stitch them together, ensuring that the insights from one token are considered in the context of the others. The context encoder is generally a long-sequence model. We experiment with Transformer-XL (and our variant of it called Hyper) and Mamba, though you could use Longformer and other new advances in this area. Even though these long-sequence models are generally made for language, we demonstrate that it is possible to use them effectively for vision tasks.
The magic of $x$T is in how these components—the nested tokenization, region encoders, and context encoders—come together. By first breaking down the image into manageable pieces and then systematically analyzing these pieces both in isolation and in conjunction, $x$T manages to maintain the fidelity of the original image’s details while also integrating long-distance context the overarching context while fitting massive images, end-to-end, on contemporary GPUs.
Results
We evaluate $x$T on challenging benchmark tasks that span well-established computer vision baselines to rigorous large image tasks. Particularly, we experiment with iNaturalist 2018 for fine-grained species classification, xView3-SAR for context-dependent segmentation, and MS-COCO for detection.
Powerful vision models used with $x$T set a new frontier on downstream tasks such as fine-grained species classification.
Our experiments show that $x$T can achieve higher accuracy on all downstream tasks with fewer parameters while using much less memory per region than state-of-the-art baselines*. We are able to model images as large as 29,000 x 25,000 pixels large on 40GB A100s while comparable baselines run out of memory at only 2,800 x 2,800 pixels.
Powerful vision models used with $x$T set a new frontier on downstream tasks such as fine-grained species classification.
*Depending on your choice of context model, such as Transformer-XL.
Why This Matters More Than You Think
This approach isn’t just cool; it’s necessary. For scientists tracking climate change or doctors diagnosing diseases, it’s a game-changer. It means creating models which understand the full story, not just bits and pieces. In environmental monitoring, for example, being able to see both the broader changes over vast landscapes and the details of specific areas can help in understanding the bigger picture of climate impact. In healthcare, it could mean the difference between catching a disease early or not.
We are not claiming to have solved all the world’s problems in one go. We are hoping that with $x$T we have opened the door to what’s possible. We’re stepping into a new era where we don’t have to compromise on the clarity or breadth of our vision. $x$T is our big leap towards models that can juggle the intricacies of large-scale images without breaking a sweat.
There’s a lot more ground to cover. Research will evolve, and hopefully, so will our ability to process even bigger and more complex images. In fact, we are working on follow-ons to $x$T which will expand this frontier further.
In Conclusion
For a complete treatment of this work, please check out the paper on arXiv. The project page contains a link to our released code and weights. If you find the work useful, please cite it as below:
@article{xTLargeImageModeling,
title={xT: Nested Tokenization for Larger Context in Large Images},
author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya},
journal={arXiv preprint arXiv:2403.01915},
year={2024}
}
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
As part of our multi-year collaboration with Liverpool FC, we develop a full AI system that can advise coaches on corner kicks
Introducing SIMA, a Scalable Instructable Multiworld Agent
Introducing SIMA, a Scalable Instructable Multiworld Agent
Introducing SIMA, a Scalable Instructable Multiworld Agent
Introducing SIMA, a Scalable Instructable Multiworld Agent
Introducing SIMA, a Scalable Instructable Multiworld Agent
Introducing SIMA, a Scalable Instructable Multiworld Agent
Introducing SIMA, a Scalable Instructable Multiworld Agent
Introducing SIMA, a Scalable Instructable Multiworld Agent
Every year, the Berkeley Artificial Intelligence Research (BAIR) Lab graduates some of the most talented and innovative minds in artificial intelligence and machine learning. Our Ph.D. graduates have each expanded the frontiers of AI research and are now ready to embark on new adventures in academia, industry, and beyond.
These fantastic individuals bring with them a wealth of knowledge, fresh ideas, and a drive to continue contributing to the advancement of AI. Their work at BAIR, ranging from deep learning, robotics, and natural language processing to computer vision, security, and much more, has contributed significantly to their fields and has had transformative impacts on society.
This website is dedicated to showcasing our colleagues, making it easier for academic institutions, research organizations, and industry leaders to discover and recruit from the newest generation of AI pioneers. Here, you’ll find detailed profiles, research interests, and contact information for each of our graduates. We invite you to explore the potential collaborations and opportunities these graduates present as they seek to apply their expertise and insights in new environments.
Join us in celebrating the achievements of BAIR’s latest PhD graduates. Their journey is just beginning, and the future they will help build is bright!
Thank you to our friends at the Stanford AI Lab for this idea!
Abdus Salam Azad
Email: salam_azad@berkeley.edu
Website: https://www.azadsalam.org/
Advisor(s): Ion Stoica
Research Blurb: My research interest lies broadly in the field of Machine Learning and Artificial Intelligence. During my PhD I have focused on Environment Generation/ Curriculum Learning methods for training Autonomous Agents with Reinforcement Learning. Specifically, I work on methods that algorithmically generates diverse training environments (i.e., learning scenarios) for autonomous agents to improve generalization and sample efficiency. Currently, I am working on Large Language Model (LLM) based autonomous agents.
Jobs Interested In: Research Scientist, ML Engineer
Alicia Tsai
Email: aliciatsai@berkeley.edu
Website: https://www.aliciatsai.com/
Advisor(s): Laurent El Ghaoui
Research Blurb: My research delves into the theoretical aspects of deep implicit models, beginning with a unified "state-space" representation that simplifies notation. Additionally, my work explores various training challenges associated with deep learning, including problems amenable to convex and non-convex optimization. In addition to theoretical exploration, my research extends the potential applications to various problem domains, including natural language processing, and natural science.
Jobs Interested In: Research Scientist, Applied Scientist, Machine Learning Engineer
Catherine Weaver
Email: catherine22@berkeley.edu
Website: https://cwj22.github.io
Advisor(s): Masayoshi Tomizuka, Wei Zhan
Research Blurb: My research focuses on machine learning and control algorithms for the challenging task of autonomous racing in Gran Turismo Sport. I leverage my background in Mechanical Engineering to discover how machine learning and model-based optimal control can create safe, high-performance control systems for robotics and autonomous systems. A particular emphasis of mine has been how to leverage offline datasets (e.g. human player's racing trajectories) to inform better, more sample efficient control algorithms.
Jobs Interested In: Research Scientist and Robotics/Controls Engineer
Chawin Sitawarin
Email: chawin.sitawarin@gmail.com
Website: https://chawins.github.io/
Advisor(s): David Wagner
Research Blurb: I am broadly interested in the security and safety aspects of machine learning systems. Most of my previous works are in the domain of adversarial machine learning, particularly adversarial examples and robustness of machine learning algorithms. More recently, I am excited about emerging security and privacy risks on large language models.
Jobs Interested In: Research scientist
Dhruv Shah
Email: shah@cs.berkeley.edu
Website: http://cs.berkeley.edu/~shah/
Advisor(s): Sergey Levine
Research Blurb: I train big(-ish) models and make robots smarter.
Jobs Interested In: Research scientist, roboticist
Eliza Kosoy
Email: eko@berkeley.edu
Website: https://www.elizakosoy.com/
Advisor(s): Alison Gopnik
Research Blurb: Eliza Kosoy works at the intersection of child development and AI with Prof. Alison Gopnik. Her work includes creating evaluative benchmarks for LLMs rooted in child development and studying how children and adults use GenAI models such as ChatGPT/Dalle and form mental models about them. She’s an intern at Google working on the AI/UX team and previously with the Empathy Lab. She has published in Neurips, ICML, ICLR, Cogsci and cognition. Her thesis work created a unified virtual environment for testing children and AI models in one place for the purposes of training RL models. She also has experience building startups and STEM hardware coding toys.
Jobs Interested In: Research Scientist (child development and AI), AI safety (specializing in children), User Experience (UX) Researcher (specializing in mixed methods, youth, AI, LLMs), Education and AI (STEM toys)
Fangyu Wu
Email: fangyuwu@berkeley.edu
Website: https://fangyuwu.com/
Advisor(s): Alexandre Bayen
Research Blurb: Under the mentorship of Prof. Alexandre Bayen, Fangyu focuses on the application of optimization methods to multi-agent robotic systems, particularly in the planning and control of automated vehicles.
Jobs Interested In: Faculty, or research scientist in control, optimization, and robotics
Frances Ding
Email: frances@berkeley.edu
Website: https://www.francesding.com/
Advisor(s): Jacob Steinhardt, Moritz Hardt
Research Blurb: My research focus is in machine learning for protein modeling. I work on improving protein property classification and protein design, as well as understanding what different protein models learn. I have previously worked on sequence models for DNA and RNA, and benchmarks for evaluating the interpretability and fairness of ML models across domains.
Jobs Interested In: Research scientist
Jianlan Luo
Email: jianlanluo@eecs.berkeley.edu
Website: https://people.eecs.berkeley.edu/~jianlanluo/
Advisor(s): Sergey Levine
Research Blurb: My research interests are broadly in scalable algorithms and practice of machine learning, robotics, and controls; particularly their intersections.
Jobs Interested In: Faculty, Research Scientist
Kathy Jang
Email: kathyjang@gmail.com
Website: https://kathyjang.com
Advisor(s): Alexandre Bayen
Research Blurb: My thesis work has specialized in reinforcement learning for autonomous vehicles, focusing on enhancing decision-making and efficiency in applied settings. In future work, I'm eager to apply these principles to broader challenges across domains like natural language processing. With my background, my aim is to see the direct impact of my efforts by contributing to innovative AI research and solutions.
Jobs Interested In: ML research scientist/engineer
Kevin Lin
Email: k-lin@berkeley.edu
Website: https://people.eecs.berkeley.edu/~kevinlin/
Advisor(s): Dan Klein, Joseph E. Gonzalez
Research Blurb: My research focuses on understanding and improving how language models use and provide information.
Jobs Interested In: Research Scientist
Nikhil Ghosh
Email: nikhil_ghosh@berkeley.edu
Website: https://nikhil-ghosh-berkeley.github.io/
Advisor(s): Bin Yu, Song Mei
Research Blurb: I am interested in developing a better foundational understanding of deep learning and improving practical systems, using both theoretical and empirical methodology. Currently, I am especially interested in improving the efficiency of large models by studying how to properly scale hyperparameters with model size.
Jobs Interested In: Research Scientist
Olivia Watkins
Email: oliviawatkins@berkeley.edu
Website: https://aliengirlliv.github.io/oliviawatkins
Advisor(s): Pieter Abbeel and Trevor Darrell
Research Blurb: My work involves RL, BC, learning from humans, and using common-sense foundation model reasoning for agent learning. I’m excited about language agent learning, supervision, alignment & robustness.
Jobs Interested In: Research scientist
Ruiming Cao
Email: rcao@berkeley.edu
Website: https://rmcao.net
Advisor(s): Laura Waller
Research Blurb: My research is on computational imaging, particularly the space-time modeling for dynamic scene recovery and motion estimation. I also work on optical microscopy techniques, optimization-based optical design, event camera processing, novel view rendering.
Jobs Interested In: Research scientist, postdoc, faculty
Ryan Hoque
Email: ryanhoque@berkeley.edu
Website: https://ryanhoque.github.io
Advisor(s): Ken Goldberg
Research Blurb: Imitation learning and reinforcement learning algorithms that scale to large robot fleets performing manipulation and other complex tasks.
Jobs Interested In: Research Scientist
Sam Toyer
Email: sdt@berkeley.edu
Website: https://www.qxcv.net/
Advisor(s): Stuart Russell
Research Blurb: My research focuses on making language models secure, robust and safe. I also have experience in vision, planning, imitation learning, reinforcement learning, and reward learning.
Jobs Interested In: Research scientist
Shishir G. Patil
Email: shishirpatil2007@gmail.com
Website: https://shishirpatil.github.io/
Advisor(s): Joseph Gonzalez
Research Blurb: Gorilla LLM - Teaching LLMs to use tools (https://gorilla.cs.berkeley.edu/); LLM Execution Engine: Guaranteeing reversibility, robustness, and minimizing blast-radius for LLM-Agents incorporated into user and enterprise workflows; POET: Memory bound, and energy efficient fine-tuning of LLMs on edge devices such as smartphones and laptops (https://poet.cs.berkeley.edu/).
Jobs Interested In: Research Scientist
Suzie Petryk
Email: spetryk@berkeley.edu
Website: https://suziepetryk.com/
Advisor(s): Trevor Darrell, Joseph Gonzalez
Research Blurb: I work on improving the reliability and safety of multimodal models. My focus has been on localizing and reducing hallucinations for vision + language models, along with measuring and using uncertainty and mitigating bias. My interests lay in applying solutions to these challenges in actual production scenarios, rather than solely in academic environments.
Jobs Interested In: Applied research scientist in generative AI, safety, and/or accessibility
Xingyu Lin
Email: xingyu@berkeley.edu
Website: https://xingyu-lin.github.io/
Advisor(s): Pieter Abbeel
Research Blurb: My research lies in robotics, machine learning, and computer vision, with the primary goal of learning generalizable robot skills from two angles: (1) Learning structured world models with spatial and temporal abstractions. (2) Pre-training visual representation and skills to enable knowledge transfer from Internet-scale vision datasets and simulators.
Jobs Interested In: Faculty, or research scientist
Yaodong Yu
Email: yyu@eecs.berkeley.edu
Website: https://yaodongyu.github.io/
Advisor(s): Michael I. Jordan, Yi Ma
Research Blurb: My research interests are broadly in theory and practice of trustworthy machine learning, including interpretability, privacy, and robustness.
Jobs Interested In: Faculty
Gemma is built for responsible AI development from the same research and technology used to create Gemini models.
Gemma is built for responsible AI development from the same research and technology used to create Gemini models.