Exploring a Pixel-Maze with Evolution Strategies

2019-05-17 - Martin Renold


Since a few years it is possible to learn playing Atari games directly from pixels with reinforcement learning1. A more recent discovery was that evolution strategies are competitive2 for training deep neural networks on those tasks. And even a simple genetic algorithm3 can work.

Those methods are fun to play around with. I have created a maze-world and trained a neural network to explore it. This task is much simpler than Atari but still gives interesting results.

I’m using Covariance-Matrix Adaptation (CMA-ES) and have tried a few simpler methods. The focus is on results but code is available.


This is my generated 2D maze8, featuring:

  • Food (red)
  • Agents (green)
  • Traces (dark green)

The agents cannot die. They all use a copy of the same controller. The goal is to improve this controller so that the agents (together) find as much food as possible.

This video shows a mediocre controller in five random worlds.

Observations: The agents start moving in different directions, but later all drift in a random walk to the bottom right. Sometimes they eat food blobs. (They can actually move through walls with a very low probability9.)

Agent Controller

The controller is a neural network with fixed topology10 and learned weights. The architecture is pretty standard except for the memory11.

Inputs, Neural Network and Outputs of Controller


  • The agents are nearly blind. They only see items next to them.
  • The agents don’t see their previous action. They must to use memory to keep going into a new direction.
  • The trace count can detect crowded areas. It includes the agent’s own trace.12


  • The softmax makes it simple to output fixed probabilities over actions, like “60% down, 40% left”.
  • An output with high certainty (“99.9% down”) requires larger weights which can only appear later during training.

The network has a single hidden layer and roughly 1‘000 parameters (weights and biases).

The parameters are optimized directly. The setup violates some assumptions13 for standard (value-based) reinforcement learning, but a direct policy search does not rely on those assumptions.


Inspired by the Visual Guide to Evolution Strategies5 and by remarks in various papers I have used the Covariance-Matrix Adaptation Evolution Strategy (CMA-ES). It worked with practically no tuning. I have tried other black-box optimization methods (discussed below) but was unable to reach the same score.

I would compare CMA-ES to the Random Forest Classifier6: a strong baseline that works out-of-the-box and is difficult to beat. Though it only works up to a few thousand parameters.

CMA-ES tracks a Gaussian distribution (with full covariance) over the parameters. The user provides an initial estimate of the scale of each parameter (variance). CMA-ES will then sample a small population, evaluate it and sort by fitness. Only rank information is used to update the distribution.

For a better overview see What is an Evolution Strategy? from the same guide. The general principle is easy to understand, but the details are a bit involved. With CMA-ES you really should be using a library (pycma) and not implement from scratch.

Training plot:

training plot

Each generation is evaluated on a different set of random maps. There is a lot of noise because of this. The moving average shows a slight upwards trend at the end.

It takes about an hour to reach an average score of 1000 on my low-end PC, and about ten hours on an AWS c5.9xlarge instance to reach the score above14.

After 44‘000 evaluations (second video):

Observations: The agents now keep changing directions. Sometimes they jitter around for a moment, especially when they should be turning. They can escape from dead ends, but it often requires multiple attempts when the exit is narrow.

Observations: The winning strategy follows the left wall at a short distance. In empty space it prefers straight lines with random left-turns. At the end of a horizontal food line the agent usually turns 180 degrees and finds back to the remaining food blob.15


CMA-ES is known to work well with tens to hundreds of parameters. With just 1‘000 parameters we are already approaching its limits. This may be a bit disappointing, because the learned strategy does not look really complex.

To put the network size into context: six parameters are enough to learn the classic cart-pole balancing task, 5‘000 have been used to control a simple 2D-Hopper robot, and 33‘000 to play Atari games from pixels4, though four million have also been used3.

Other evolution strategies exist that can scale up. But CMA-ES can achieve amazing results with smart use of just a few hundred parameters, like this Muscle-Based Locomotion for Bipedal Creatures7. For the pixel-maze 1‘000 parameters was probably a bit of a brute-force approach. It should be possible to get a good score with fewer parameters, at least with some additional problem-specific engineering.

I have tried a few simpler methods: a simple GA, the Cross-Entropy Method (CEM) and differential evolution. They all got stuck in local optima.

Here is an impression of my attempts to find good CEM parameters:

plot comparing CMA-ES with CEM and differential evolution

The CMA-ES lines have more noise because they use a smaller population size of 23 per generation, while other methods use between 200 and 1000. The dots show the fitness per generation before filtering. This is a different (older) variant of the task which is faster to train.

As you can see, this is not a trivial optimization problem. One big challenge is the evaluation noise caused by the random maze generator. One lucky maze can have a lot of food exactly where the agents are going. Some algorithms will give this score too much credibility. They may use the parameters of that lucky controller as a starting point for mutations until the score gets beaten with an even more lucky evaluation. I think this is why differential evolution failed. It may be possible to tune and tweak those methods get better results. But CMA-ES works great even without tuning.

I have also tried variations on the neural network. The controller did surprisingly well when I removed the hidden layer. It finds good solutions much faster. But the advantage of using a hidden layer becomes clear after 3000 evaluations.

The performance difference between 20 and 40 hidden nodes is not that large, and 20 hidden nodes are easier to train (only 600 parameters total). The CMA-ES defaults worked fine with 20 hidden nodes, but I have found that increasing the population size from 23 (the default) to 200 helped for larger networks.

A good scaling of the initial weights can speed up the first learning phase a lot, and also normalization of the inputs and outputs. I have found that this advantage often disappears completely long before the final score is reached. This should become more important when additional hidden layers are used. It’s known to be critical for deep neural networks.


Code is available in my pixelcrawl GitHub repository. It is not very polished. However it is reasonably optimized (at the expense of flexibility). See the README for instructions.

The original version was pure Python. It was about 100x slower than the current Python/C++ mix. There is a lot access to individual pixels and math involving small arrays, something which Python is really slow at.


Can it survive in empty space? Or will it always stick to the walls? Let’s transfer the trained agent into a new environment:

Observations: The learned strategy is somewhat robust and looks well randomized. It has trouble getting into certain one-pixel cavities. But it keeps exploring.

  1. Playing Atari with Deep Reinforcement Learning, Mnih et al., 2013. 

  2. Evolution Strategies as a Scalable Alternative to Reinforcement Learning, OpenAI blog, 2017. With accompanying paper Salimans et al., 2017

  3. Deep Neuroevolution: Genetic Algorithms Are a Competitive…, Such et al., 2017. With related article Welcoming the Era of Deep Neuroevolution, Uber blog, 2017. 

  4. Trust Region Policy Optimization, Schulman et al., 2015. 

  5. A Visual Guide to Evolution Strategies, blog post by David Ha (hardmaru), 2017. 

  6. The Unreasonable Effectiveness Of Random Forest, blog post by Ahmed El Deeb, 2015. 

  7. Muscle-Based Locomotion for Bipedal Creatures, Geijtenbeek et al., 2013. Video and paper. 

  8. How do you generate a maze? Simple. Perlin Noise and a threshold. Start from a noise image and apply a 3x3 binary filter repeatedly. There are 2512 possible filters to choose from. Most of them transform noise into noise. Pick one that produces a low number of horizontal and vertical edges. Using this as a fitness measure, the Cross-Entropy method can find a good filter. If you’re not happy with the result, use novelty search to evolve filters with maximum diversity. To measure the diversity of 2D patterns, use the first few layers of a pre-trained ResNet50 to extract basic visual features. This should result in different textures. Pick one that looks like a maze. Simple as that. 

  9. In early experiments I used a higher probability to move through walls because it was possible to start inside of a wall. The agents liked this a bit too much, especially when I gave them more time to explore. Drifting through walls in a fixed direction is a sure strategy to explore the whole maze, including disconnected parts. Whenever they got stuck they moved into the next wall instead of searching a way around. 

  10. A well-known alternative would be to evolve the topology together with the weights using NEAT. I have not tried this, but it’s known to work well for small networks. During the deep learning hype this approach was not getting much attention. Now the evolution of topology seems to be coming back as architecture search for deep vision networks. There is also recent work happening on indirect encoding (e.g. evolving regular patterns). See Designing neural networks through neuroevolution (Stanley et al., 2019) for a recent overview.  

  11. Using memory[t+1] = 0.99 * memory[t] + 0.01 * outputs[t], so the memory is just a low-pass filtered version of the outputs. A standard recurrent neuron could have learnt this rule. The memory half-time of about 70 steps (1/-log2(0.99)) was found to work well. I have also attempted to learn individual half-times for all 6 memory cells, so far without success. The filter states are randomly initialized. I have not tested other structures, but it seems to be working okay. 

  12. The trace input is a left-over from early experiments with 200 agents swarming the maze. It was meant to detect and avoid crowded areas. But 200 agents were very confusing to watch and the strategy mostly resembled a flood-fill algorithm. I’m not sure if or how the current agents use this input. 

  13. Standard RL methods like Q-learning expect to be solving a Markov Decision Process (MDP) that is fully observable. However the nearest four pixels are lacking a lot of information about the state. A memory is evolved to extract relevant information from the past. For the evolution strategy this is just some additional parameters to optimize, and never mind the feedback loop this creates. When learning Atari games the inputs are usually extended to include some history, e.g. the last four video frames, in the hope that all relevant state information can be extracted from those. A different issue is that the network controls only a single agent. It would be straight-forward to maximize the reward of this agent with standard RL. But it is not clear how prevent competition and maximize the collective reward of all agents instead. 

  14. The run presented here actually took a few days, because I was evaluating each agent on 100 maps instead of 10. This was complete overkill. It helps, but not enough to justify 10x training time. Training was done on AWS spot instances because they are cheap. They can be interrupted at any time, and I’m just counting on my luck; so far I’ve seen only two interruptions. Previously I used dask distributed with a central scheduler and spot instances as workers for a different experiment. This was quite involved to manage. For now I’m back to plain dask and good old rsync and ssh. There is also an unanalyzed scaling issue: the 36 vCPUs of the c5.9xlarge instance are utilized only at 40%. On smaller machines it is close to 100%. There is no point going distributed if the largest AWS instance (72 vCPUs) cannot be saturated. The price per vCPU is nearly the same for small and large instances. 

  15. Remember that the agent cannot see when a food blob has ended, only that all four directions are empty. It must therefore be using its memory to prepare the correct turn-around. Also, the agent sometimes leaves crumbs of food that it should have noticed. Probably a fast exploration of the maze has higher priority than collecting every bit of food. It’s also likely that it lacks the memory capacity to remember long enough. The videos actually run twice as long as an episode, and the optimization goal is the amount of food collected in the first half of each video-episode. At this point there are usually big food blobs left, which are more important than the crumbs.