By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
TrendSnapNewsTrendSnapNews
  • Home
Reading: DIAMOND: Visual Details Matter in Atari and Diffusion for World Modeling
Share
Notification Show More
TrendSnapNewsTrendSnapNews
  • Home
Follow US
© 2024 All Rights Reserved |Powered By TrendSnapNews
TrendSnapNews > Uncategorized > DIAMOND: Visual Details Matter in Atari and Diffusion for World Modeling
Uncategorized

DIAMOND: Visual Details Matter in Atari and Diffusion for World Modeling

July 17, 2024 20 Min Read
Share
DIAMOND: Visual Details Matter in Atari and Diffusion for World Modeling
SHARE

It was in 2018, when the idea of reinforcement learning in the context of a neural network world model was first introduced, and soon, this fundamental principle was applied on world models. Some of the prominent models that implement reinforcement learning were the Dreamer framework, which introduced reinforcement learning from the latent space of a recurrent state space model. The DreamerV2 demonstrated that the use of discrete latents might result in reduced compounding errors, and the DreamerV3 framework was able to achieve human-like performance on a series of tasks across different domains with fixed hyperparameters. 

Contents
DIAMOND : Methodology and ArchitectureDIAMOND: Experiments and ResultsConclusion

Furthermore, parallels can be drawn between image generation models and world models indicating that the progress made in generative vision models could be replicated to benefit the world models. Ever since the use of transformers in natural language processing frameworks gained popularity, DALL-E and VQGAN frameworks emerged. The frameworks implemented discrete autoencoders to convert images into discrete tokens, and were able to build highly powerful and efficient text to image generative models by leveraging the sequence modeling abilities of the autoregressive transformers. At the same time, diffusion models gained traction, and today, diffusion models have established themselves as a dominant paradigm for high-resolution image generation. Owing to the capabilities offered by diffusion models and reinforcement learning, attempts are being made to combine the two approaches, with the aim to take advantage of the flexibility of diffusion models as trajectory models, reward models, planners, and as policy for data augmentation in offline reinforcement learning. 

World models offer a promising method for training reinforcement learning agents safely and efficiently. Traditionally, these models use sequences of discrete latent variables to simulate environment dynamics. However, this compression can overlook visual details crucial for reinforcement learning. At the same time, diffusion models have risen in popularity for image generation, challenging traditional methods that use discrete latents. Inspired by this shift, in this article, we will talk about DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained within a diffusion world model. We will explore the necessary design choices to make diffusion suitable for world modeling and show that enhanced visual details lead to better agent performance. DIAMOND sets a new benchmark on the competitive Atari 100k test, achieving a mean human normalized score of 1.46, the highest for agents trained entirely within a world model. 

World models or Generative models of environments are emerging as one of the more important components for generative agents to plan and reason about their environments. Although the use of reinforcement learning has achieved considerable success in recent years, models implementing reinforcement learning are known for being sample inefficient, which significantly limits their real world applications. On the other hand, world models have demonstrated their ability to efficiently train reinforcement learning agents across diverse environments with a significantly improved sample efficiency, allowing the model to learn from real world experiences. Recent world modeling frameworks usually model environment dynamics as a sequence of discrete latent variables, with the model discretizing the latent space to avoid compounding errors over multi-step time horizons. Although the approach might deliver substantial results, it is also associated with a loss of information, leading to loss of reconstruction quality and loss of generality. The loss of information might become a significant roadblock for real-world scenarios that require the information to be well-defined, like training autonomous vehicles. In such tasks, small changes or details in the visual input like the color of the traffic light, or the turn indicator of the vehicle in front can change the policy of an agent. Although increasing the number of discrete latents can help avoid information loss, it shoots the computation costs significantly. 

Furthermore, in the recent years, diffusion models have emerged as the dominant approach for high-quality image generation frameworks since frameworks built on diffusion models learn to reverse a noising process, and directly competes with some of the more well-established approaches modeling discrete tokens, and therefore offers a promising alternative to eliminate the need for discretization in world modeling. Diffusion models are known for their ability to be easily conditioned and to flexibly model complex, multi-modal distributions without mode collapse. These attributes are crucial for world modeling, as conditioning enables a world model to accurately reflect an agent’s actions, leading to more reliable credit assignment. Moreover, modeling multimodal distributions offers a greater diversity of training scenarios for the agent, enhancing its overall performance. 

See also  How Microsoft is Tackling AI Security with the Skeleton Key Discovery

Building upon these characteristics, DIAMOND, (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained within a diffusion world model. The DIAMOND framework makes careful design choices to ensure its diffusion world model remains efficient and stable over long time horizons. The framework provides a qualitative analysis to demonstrate the importance of these design choices. DIAMOND sets a new state-of-the-art with a mean human normalized score of 1.46 on the well-established Atari 100k benchmark, the highest for agents trained entirely within a world model. Operating in image space allows DIAMOND’s diffusion world model to seamlessly substitute the environment, offering greater insights into world model and agent behaviors. Notably, the improved performance in certain games is attributed to better modeling of critical visual details. The DIAMOND framework models the environment as a standard POMDP or Partially Observable Markov Decision Process with a set of states, a set of discrete actions, and a set of image observations. The transition functions describe the environment dynamics, and the reward function maps the transitions to scalar rewards. The observation function describes the observation probabilities, and emits image observations, that are then used by the agents to see the environments, since they cannot directly access the states. The primary aim of the approach was to obtain a policy that maps observations to actions with the attempt to maximize the expected discount return with a discount factor. World models are generative models of the environment, and world models can be used to create simulated environments to train reinforcement learning agents in the real environment, and train reinforcement learning agents in the world model environment. Figure 1 demonstrates the unrolling imagination of the DIAMOND framework over time. 

DIAMOND : Methodology and Architecture

At its core, diffusion models are a class of generative models that generate a sample by reversing the noising process, and draw heavy inspiration from non-equilibrium thermodynamics. The DIAMOND framework considers a diffusion process indexed by a continuous time variable with corresponding marginals and boundary conditions with a tractable unstructured prior distribution. Furthermore, to obtain a generative model, which maps from noise to data, the DIAMOND framework must reverse the process, with the reversion process also being a diffusion process, running backwards in time. Furthermore, at any given point in time, it is not trivial to estimate the score function since the DIAMOND framework does not access to the true score function, and the model overcomes this hurdle by implementing score matching objective, an approach that facilitates a framework to train a score model without knowing the underlying score function. The score-based diffusion model provides an unconditional generative model. However, a conditional generative model of environment dynamics is required to serve as a world model, and to serve this purpose, the DIAMOND framework looks at the general case of the POMDP approach, in which the framework can make use of past observations and actions to approximate the unknown Markovian state. As demonstrated in Figure 1., the DIAMOND framework makes use of this history to condition a diffusion model, to estimate and generate the next observation directly. Although the DIAMOND framework can resort to any SDE or ODE solver in theory, there is a trade-off between NFE or Number of Function Evaluations, and sample quality that impacts the inference cost of diffusion models significantly. 

Building on the above learnings, let us now look at the practical realization of the DIAMOND framework of a diffusion-based world model including the drift and diffusion coefficients corresponding to a particular choice of diffusion approach. Instead of opting for DDPM, a naturally suitable candidate for the task, the DIAMOND framework builds on the EDM formulation, and considers a perturbation kernel with a real-valued function of diffusion time called the noise schedule. The framework selects the preconditioners to keep the input and output variance for any voice level. The network training mixes signal and noise adaptively depending on the degradation level, and when the noise is low, and the target becomes the difference between the clean and the perturbed signal, i.e. the added Gaussian noise. Intuitively, this prevents the training objective from becoming trivial in the low-noise regime. In practice, this objective is high variance at the extremes of the noise schedule, so the model samples the noise level from a log-normal distribution chosen empirically in order to concatenate the training around the medium noise regions. The DIAMOND framework makes use of a standard U-Net 2D component for the vector field, and keeps a buffer of past observations and actions that the framework uses to condition itself. The DIAMOND framework then concatenates these past observations to the next noisy observation, and input actions through adaptive group normalization layers in the residual blocks of the U-Net. 

See also  Bam Margera Married: ‘Jackass’ Star Weds Dannii Marie in Intimate Ceremony

DIAMOND: Experiments and Results

For comprehensive evaluation, the DIAMOND framework opts for the Atari 100k benchmark. The Atari 100k benchmark consists of 26 games designed to test a wide range of agent capabilities. In each game, an agent is limited to 100k actions in the environment, which is roughly equivalent to 2 hours of human gameplay, to learn the game before evaluation. For comparison, unconstrained Atari agents typically train for 50 million steps, representing a 500-fold increase in experience. We trained DIAMOND from scratch using 5 random seeds for each game. Each training run required around 12GB of VRAM and took approximately 2.9 days on a single Nvidia RTX 4090, amounting to 1.03 GPU years in total. The following table provides the score for all games, the mean, and the IQM or interquartile mean of human-normalized scores. 

Following the limitations of point estimates, the DIAMOND framework provides stratified bootstrap confidence in the mean, and the IQM or interquartile mean of human-normalized scores along with performance profiles and additional metrics, as summed up in the following figure. 

The results show that DIAMOND performs exceptionally well across the benchmark, surpassing human players in 11 games and achieving a superhuman mean HNS of 1.46, setting a new record for agents trained entirely within a world model. Additionally, DIAMOND’s IQM is comparable to STORM and exceeds all other baselines. DIAMOND excels in environments where capturing small details is crucial, such as Asterix, Breakout, and RoadRunner. Furthermore, as discussed earlier, the DIAMOND framework has the flexibility of implementing any diffusion model in its pipeline, although it opts for the EDM approach, it would have been a natural choice to opt for the DDPM model since it is already being implemented in numerous image generative applications. To compare the EDM approach against DDPM implementation, the DIAMOND framework trains both the variants with the same network architecture on the same shared static dataset with over 100k frames collected with an expert policy. The number of denoising steps is directly related to the inference cost of the world model, and so fewer steps will reduce the cost of training an agent on imagined trajectories. To ensure our world model remains computationally comparable with other baselines, such as IRIS which requires 16 NFE per timestep, we aim to use no more than tens of denoising steps, preferably fewer. However, setting the number of denoising steps too low can degrade visual quality, leading to compounding errors. To assess the stability of different diffusion variants, we display imagined trajectories generated autoregressively up to t = 1000 timesteps in the following figure, using different numbers of denoising steps n ≤ 10. 

We observe that using DDPM (a), in this regime results in severe compounding errors, causing the world model to quickly drift out of distribution. In contrast, the EDM-based diffusion world model (b) remains much more stable over long time horizons, even with a single denoising step. Imagined trajectories with diffusion world models based on DDPM (left) and EDM (right) are shown. The initial observation at t = 0 is the same for both, and each row corresponds to a decreasing number of denoising steps n. We observe that DDPM-based generation suffers from compounding errors, with smaller numbers of denoising steps leading to faster error accumulation. In contrast, DIAMOND’s EDM-based world model remains much more stable, even for n = 1. The optimal single-step prediction is the expectation over possible reconstructions for a given noisy input, which can be out of distribution if the posterior distribution is multimodal. While some games, like Breakout, have deterministic transitions that can be accurately modeled with a single denoising step, other games exhibit partial observability, resulting in multimodal observation distributions. In these cases, an iterative solver is necessary to guide the sampling procedure towards a specific mode, as illustrated in the game Boxing in the following figure. Consequently, The DIAMOND framework set n = 3 in all of our experiments.

See also  Chinese May Be ‘Probing’ American Military Readiness Through Base Breaches, Lawmaker Says

The above figure compares single-step (top row) and multi-step (bottom row) sampling in Boxing. The movements of the black player are unpredictable, causing single-step denoising to interpolate between possible outcomes, resulting in blurry predictions. In contrast, multi-step sampling produces a clear image by guiding the generation towards a specific mode. Interestingly, since the policy controls the white player, his actions are known to the world model, eliminating ambiguity. Thus, both single-step and multi-step sampling correctly predict the white player’s position.

In the above figure, the trajectories imagined by DIAMOND generally exhibit higher visual quality and are more faithful to the true environment compared to those imagined by IRIS. The trajectories generated by IRIS contain visual inconsistencies between frames (highlighted by white boxes), such as enemies being displayed as rewards and vice-versa. Although these inconsistencies may only affect a few pixels, they can significantly impact reinforcement learning. For instance, an agent typically aims to target rewards and avoid enemies, so these small visual discrepancies can make it more challenging to learn an optimal policy. The figure shows consecutive frames imagined with IRIS (left) and DIAMOND (right). The white boxes highlight inconsistencies between frames, which only appear in trajectories generated with IRIS. In Asterix (top row), an enemy (orange) becomes a reward (red) in the second frame, then reverts to an enemy in the third, and again to a reward in the fourth. In Breakout (middle row), the bricks and score are inconsistent between frames. In Road Runner (bottom row), the rewards (small blue dots on the road) are inconsistently rendered between frames. These inconsistencies do not occur with DIAMOND. In Breakout, the score is reliably updated by +7 when a red brick is broken. 

Conclusion

In this article, we have talked about DIAMOND, a reinforcement learning agent trained within a diffusion world model. The DIAMOND framework makes careful design choices to ensure its diffusion world model remains efficient and stable over long time horizons. The framework provides a qualitative analysis to demonstrate the importance of these design choices. DIAMOND sets a new state-of-the-art with a mean human normalized score of 1.46 on the well-established Atari 100k benchmark, the highest for agents trained entirely within a world model. Operating in image space allows DIAMOND’s diffusion world model to seamlessly substitute the environment, offering greater insights into world model and agent behaviors. Notably, the improved performance in certain games is attributed to better modeling of critical visual details. The DIAMOND framework models the environment as a standard POMDP or Partially Observable Markov Decision Process with a set of states, a set of discrete actions, and a set of image observations. The transition functions describe the environment dynamics, and the reward function maps the transitions to scalar rewards.

You Might Also Like

The King of Fighters 15 – Vice and Mature Announced for December 2024

Lego Hill Climb Adventures is a charming, simplified Trials

France National Assembly’s reelected speaker Braun-Pivet to cohabit with New Popular Front

DeFi Protocol Rho Markets Suffers $7.6 Million Loss Scare With Gray Hat Hackers

US Calls on Chinese Regime to End Its 25-Year Persecution of Falun Gong

Share This Article
Facebook Twitter Copy Link
Previous Article Balkan nations create committee to protect shared endangered lake Balkan nations create committee to protect shared endangered lake
Next Article The best Dungeons & Dragons deals of Amazon Prime Day The best Dungeons & Dragons deals of Amazon Prime Day
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Latest News

The King of Fighters 15 – Vice and Mature Announced for December 2024
The King of Fighters 15 – Vice and Mature Announced for December 2024
Uncategorized
Lego Hill Climb Adventures is a charming, simplified Trials
Lego Hill Climb Adventures is a charming, simplified Trials
Uncategorized
France National Assembly’s reelected speaker Braun-Pivet to cohabit with New Popular Front
France National Assembly’s reelected speaker Braun-Pivet to cohabit with New Popular Front
Uncategorized
DeFi Protocol Rho Markets Suffers .6 Million Loss Scare With Gray Hat Hackers
DeFi Protocol Rho Markets Suffers $7.6 Million Loss Scare With Gray Hat Hackers
Uncategorized
US Calls on Chinese Regime to End Its 25-Year Persecution of Falun Gong
US Calls on Chinese Regime to End Its 25-Year Persecution of Falun Gong
Uncategorized
The AI boom has an unlikely early winner: Wonky consultants
The AI boom has an unlikely early winner: Wonky consultants
Uncategorized

You Might Also Like

The King of Fighters 15 – Vice and Mature Announced for December 2024
Uncategorized

The King of Fighters 15 – Vice and Mature Announced for December 2024

July 20, 2024
Lego Hill Climb Adventures is a charming, simplified Trials
Uncategorized

Lego Hill Climb Adventures is a charming, simplified Trials

July 20, 2024
France National Assembly’s reelected speaker Braun-Pivet to cohabit with New Popular Front
Uncategorized

France National Assembly’s reelected speaker Braun-Pivet to cohabit with New Popular Front

July 20, 2024
DeFi Protocol Rho Markets Suffers .6 Million Loss Scare With Gray Hat Hackers
Uncategorized

DeFi Protocol Rho Markets Suffers $7.6 Million Loss Scare With Gray Hat Hackers

July 20, 2024

About Us

Welcome to TrendSnapNews, your go-to destination for the latest updates and insightful analysis on the world’s most pressing topics. At TrendSnapNews, we are committed to delivering accurate, timely, and engaging news that keeps you informed and empowered in an ever-changing world.

Legal Pages

  • About Us
  • Contact US
  • Disclaimer
  • Privacy Policy
  • Terms of Service
  • About Us
  • Contact US
  • Disclaimer
  • Privacy Policy
  • Terms of Service

Trending News

Helicopter carrying Iran's president apparently crashes in mountainous region

Helicopter carrying Iran's president apparently crashes in mountainous region

Para rowing – Paralympic power

Para rowing – Paralympic power

‘Portal’ installations in NYC, Dublin temporarily closed due to 'inappropriate behavior'

‘Portal’ installations in NYC, Dublin temporarily closed due to 'inappropriate behavior'

Helicopter carrying Iran's president apparently crashes in mountainous region
Helicopter carrying Iran's president apparently crashes in mountainous region
May 26, 2024
Para rowing – Paralympic power
Para rowing – Paralympic power
May 26, 2024
‘Portal’ installations in NYC, Dublin temporarily closed due to 'inappropriate behavior'
‘Portal’ installations in NYC, Dublin temporarily closed due to 'inappropriate behavior'
May 26, 2024
Stunning meteor lights up the sky over Europe
Stunning meteor lights up the sky over Europe
May 26, 2024
© 2024 All Rights Reserved |Powered By TrendSnapNews
trendsnapnews
Welcome Back!

Sign in to your account

Lost your password?