Canada

DeepMind’s Gato is mediocre, so why did they build it?

Written by Tiernan Ray, contributor to the writer Tiernan Ray, contributor to the writer

Tiernan Ray has been covering technology and business for 27 years. Most recently, he was technology editor for Barron’s, where he wrote daily market coverage for the Tech Trader blog and wrote the weekly print column of that name.

Complete bio

DeepMind’s Gato neural network features a variety of tasks, including controlling robotic hands that arrange blocks, playing Atari 2600 games, and captioning images.

DeepMind

The world is used to seeing headlines about the latest breakthrough in forms of deep learning of artificial intelligence. However, the latest achievement from Google’s DeepMind division can be summed up as “An artificial intelligence program that does so much work in many things.”

Gato, as the DeepMind program is called, was introduced this week as a so-called multimodal program that can play video games, chat, write songs, take captions and control a robotic arm that arranges blocks. It is a neural network that can work with many types of data to perform many kinds of tasks.

“With a set of weights, Gato can engage in dialogue, captions, arrange blocks with a real robot hand, outperform people in Atari games, navigate simulated 3D environments, follow instructions, and more.” lead author Scott Reed and colleagues in their document “A Generalist Agent”, published on the Arxiv prepress server.

DeepMind co-founder Demis Hasabis applauded the team, tweeting: “Our most general agent so far !! Fantastic work from the team!”

Also: New experiment: Does AI really know cats or dogs – or something else?

The only trick is that Gato isn’t really that great at a few tasks.

On the one hand, the program is able to do better than a special machine learning program in controlling the robotic arm Sawyer, which arranges the blocks. On the other hand, he creates captions for images, which in many cases are quite bad. His ability to have a standard dialogue in a chat with a human interlocutor is just as mediocre, sometimes provoking contradictory and meaningless statements.

And his Atari 2600 video gameplay falls below that of most specialized ML programs designed to compete in Arcade’s comparative learning environment.

Why would you make a program that does some things quite well and a bunch of other things not so well? Precedent and anticipation, according to the authors.

There is a precedent for more general types of programs that are becoming the most advanced in AI, and it is expected that increasing amounts of computing power in the future will compensate for the shortcomings.

The community may tend to triumph in AI. As the authors note, citing artificial intelligence scientist Richard Sutton, “Historically, generic models that are better at computing also tend to overtake more specialized domain-specific approaches.”

As Sutton wrote in his own blog post, “The biggest lesson that can be read from 70 years of AI research is that common methods that use calculations are ultimately the most effective and by far the biggest difference.”

Embedded in a formal thesis, Reed and the team write that “here we are testing the hypothesis that training an agent who is usually capable of performing a large number of tasks is possible; and that this general agent can be adapted with a little extra data to succeed in even more tasks. ”

Also: Meta LeCun’s AI light explores the energy frontier of deep learning

The model in this case is really very general. This is a version of Transformer, the dominant attention-based model that has become the basis of many programs, including GPT-3. A transformer models the probability of an element, considering the elements that surround it, like words in a sentence.

In the case of Gato, DeepMind scientists can use the same conditional probability search for multiple data types.

As Reed and colleagues describe Gato’s training task,

During the Gato learning phase, data from different tasks and modalities are serialized in a flat sequence of tokens, grouped and processed by a transformer neural network similar to a large language model. The loss is masked, so Gato only predicts actions and text goals.

Gato, in other words, does not treat tokens differently, whether they are chat words or motion vectors in a block stacking exercise. Everything is the same.

Gato learning scenario.

Reed et al. 2022

Buried in Reed and the team’s hypothesis is a consequence, namely that in the end they will gain more and more computing power. Currently, Gato is limited by the response time of the Sawyer robot arm, which performs the arrangement of blocks. With 1.18 billion network parameters, the Gato is significantly smaller than many large models with artificial intelligence such as the GPT-3. As models for deep learning become larger, inference leads to latency that can fail in the undetermined real-world robot world.

But Reed and his colleagues expect this limit to be exceeded as AI hardware becomes faster to process.

“We are focusing our training on the scale of the model, which allows real-time control of real-world robots, currently around 1.2B parameters in the Gato case,” they wrote. “As hardware and model architectures improve, this operating point will naturally increase the possible size of the model, pushing universal models up the curve of the scaling law.”

Therefore, Gato is really a model for how the scale of calculations will continue to be the main vector in the development of machine learning, making common models bigger and bigger. Bigger is better, in other words.

Gato improves with increasing neural network size in parameters.

Reed et al. 2022

And the authors have some evidence for that. Gato seems to improve as he gets bigger. They compared the average scores for all comparison tasks for the three model sizes according to the parameters, 79 million, 364 million, and the base model, 1.18 billion. “We can see that for an equivalent number of tokens there is a significant improvement in productivity with increased scale,” the authors write.

An interesting future question is whether a program that is universal is more dangerous than other types of AI programs. The authors spend a lot of time in the article discussing the fact that there are potential dangers that are not yet well understood.

The idea of ​​a multi-tasking program implies some human adaptability to the layman, but this can be a dangerous misconception. “For example, physical incarnation can cause users to anthropomorphize the agent, leading to improper trust in the event of a malfunctioning system or being used by bad actors,” Reed and team wrote.

“Furthermore, while the transfer of knowledge between domains is often a goal in ML research, it can produce unexpected and undesirable results if certain behaviors (such as arcade battles) are transferred to the wrong context.”

Therefore, they write: “Ethical and safety considerations in knowledge transfer may require significant new research with the advancement of universal systems.”

(As an interesting side note, Gato uses a risk description scheme developed by former Google AI researcher Margaret Michel and colleagues called Model Cards. Model maps provide a brief summary of what an AI program is, what it does, and what factors affect it. Last year, Michelle wrote that she was forced to leave Google because she supported her former colleague Timnit Gebru, whose ethical concerns about AI differ from Google’s AI guide.)

Gato is by no means unique in his generalizing trend. This is part of a broader trend toward aggregation and larger models that use buckets of horsepower. The world first felt Google’s tilt in this direction last summer, with Google’s neural network “Perceiver”, which combines the tasks of text transformers with images, sound and spatial coordinates of LiDAR.

Also: Google Supermodel: DeepMind Perceiver is a step towards an AI machine that can handle anything and everything

Among his colleagues is PaLM, the Pathways language model introduced this year by Google scientists, a 540 billion-parameter model that uses a new technology to coordinate thousands of chips, known as Pathways, also invented by Google. A neural network launched by Meta in January, called “data2vec”, uses image data transformers, speech sound waves and text language representation all in one.

What’s new in Gato seems to be the intention to use AI used for non-robotic tasks and push it into robotics.

The creators of Gato, noting the achievements of Pathways and other common approaches, see …