Building architectures that can handle the world’s data

by | Oct 11, 2021 | Data Science | 0 comments

Perceiver and Perceiver IO work as multi-purpose tools for AI

Most architectures used by AI systems today are specialists. A 2D residual network may be a good choice for processing images, but at best it’s a loose fit for other kinds of data — such as the Lidar signals used in self-driving cars or the torques used in robotics. What’s more, standard architectures are often designed with only one task in mind, often leading engineers to bend over backwards to reshape, distort, or otherwise modify their inputs and outputs in hopes that a standard architecture can learn to handle their problem correctly. Dealing with more than one kind of data, like the sounds and images that make up videos, is even more complicated and usually involves complex, hand-tuned systems built from many different parts, even for simple tasks. As part of DeepMind’s mission of solving intelligence to advance science and humanity, we want to build systems that can solve problems that use many types of inputs and outputs, so we began to explore a more general and versatile architecture that can handle all types of data.

The Perceiver IO architecture maps input arrays to output arrays by means of a small latent array, which lets it scale gracefully even for very large inputs and outputs. Perceiver IO uses a global attention mechanism that generalizes across many different kinds of data.

In a paper presented at ICML 2021 (the International Conference on Machine Learning) and published as a preprint on arXiv, we introduced the Perceiver, a general-purpose architecture that can process data including images, point clouds, audio, video, and their combinations. While the Perceiver could handle many varieties of input data, it was limited to tasks with simple outputs, like classification. A new preprint on arXiv describes Perceiver IO, a more general version of the Perceiver architecture. Perceiver IO can produce a wide variety of outputs from many different inputs, making it applicable to real-world domains like language, vision, and multimodal understanding as well as challenging games like StarCraft II. To help researchers and the machine learning community at large, we’ve now open sourced the code.

Perceiver IO processes language by first choosing which characters to attend to. The model learns to use several different strategies: some parts of the network attend to specific places in the input, while others attend to specific characters like punctuation marks.

Perceivers build on the Transformer, an architecture that uses an operation called “attention” to map inputs into outputs. By comparing all elements of the input, Transformers process inputs based on their relationships with each other and the task. Attention is simple and widely applicable, but Transformers use attention in a way that can quickly become expensive as the number of inputs grows. This means Transformers work well for inputs with at most a few thousand elements, but common forms of data like images, videos, and books can easily contain millions of elements. With the original Perceiver, we solved a major problem for a generalist architecture: scaling the Transformer’s attention operation to very large inputs without introducing domain-specific assumptions. The Perceiver does this by using attention to first encode the inputs into a small latent array. This latent array can then be processed further at a cost independent of the input’s size, enabling the Perceiver’s memory and computational needs to grow gracefully as the input grows larger, even for especially deep models.