Reinforcement Learning Researcher

About me
I am an associate professor in the Polish Academy of Sciences where I lead a research group. Currently, I am a visiting professor at the University of Oxford in the WhiRL group. I work in machine learning focusing on *reinforcement learning*.

** Quick links: **

For the whole list see my Google Scholar profile or arXiv.

Three papers on NeurIPS 2021 (main track) and two on the DRL workshop.

Konrad Czechowski, Tomasz Odrzygóźdź, Marek Zbysiński, Michał Zawalski, Krzysztof Olejnik, Yuhuai Wu, Łukasz Kuciński, Piotr Miłoś

NeurIPS 2021 Mila tea talk

Humans excel in solving complex reasoning tasks through a mental process of moving from one idea to a related one. Inspired by this, we propose Subgoal Search (kSubS) method. Its key component is a learned subgoal generator that produces a diversity of subgoals that are both achievable and closer to the solution. Using subgoals reduces the search space and induces a high-level search graph suitable for efficient planning. In this paper, we implement kSubS using a transformer-based subgoal module coupled with the classical best-first search framework. We show that a simple approach of generating k-th step ahead subgoals is surprisingly efficient on three challenging domains: two popular puzzle games, Sokoban and the Rubik's Cube, and an inequality proving benchmark INT. kSubS achieves strong results including state-of-the-art on INT within a modest computational budget.

Maciej Wołczyk, Michał Zając, Razvan Pascanu, Łukasz Kuciński, Piotr Miłoś

NeurIPS 2021

Continual learning (CL) -- the ability to continuously learn, building on previously acquired knowledge -- is a natural requirement for long-lived autonomous reinforcement learning (RL) agents. While building such agents, one needs to balance opposing desiderata, such as constraints on capacity and compute, the ability to not catastrophically forget, and to exhibit positive transfer on new tasks. Understanding the right trade-off is conceptually and computationally challenging, which we argue has led the community to overly focus on catastrophic forgetting. In response to these issues, we advocate for the need to prioritize forward transfer and propose Continual World, a benchmark consisting of realistic and meaningfully diverse robotic tasks built on top of Meta-World as a testbed. Following an in-depth empirical evaluation of existing CL methods, we pinpoint their limitations and highlight unique algorithmic challenges in the RL setting. Our benchmark aims to provide a meaningful and computationally inexpensive challenge for the community and thus help better understand the performance of existing and future solutions.

Łukasz Kuciński, Tomasz Korbak, Paweł Kołodziej, Piotr Miłoś

NeurIPS 2021

Communication is compositional if complex signals can be represented as a combination of simpler subparts. In this paper, we theoretically show that inductive biases on both the training framework and the data are needed to develop a compositional communication. Moreover, we prove that compositionality spontaneously arises in the signaling games, where agents communicate over a \emph{noisy channel}. We experimentally confirm that a range of noise levels, which depends on the model and the data, indeed promotes compositionality. Finally, we provide a comprehensive study of this dependence and report results in terms of recently studied compositionality metrics: topographical similarity, conflict count, and context independence.

Michał Zawalski, Błażej Osiński, Henryk Michalewski, Piotr Miłoś

AAMAS 2022 (extended abstract), NeurIPS Deep RL workshop 2021

Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with no impact on the quality of training. Furthermore, our algorithm is theoretically grounded -- we prove a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them.

Piotr Januszewski, Mateusz Olko, Michał Królikowski, Jakub Swiatkowski, Marcin Andrychowicz, Łukasz Kuciński, Piotr Miłoś

NeurIPS Deep RL workshop 2021

The growth of deep reinforcement learning (RL) has brought multiple exciting tools and methods to the field. This rapid expansion makes it important to understand the interplay between individual elements of the RL toolbox. We approach this task from an empirical perspective by conducting a study in the continuous control setting. We present multiple insights of fundamental nature, including: a commonly used additive action noise is not required for effective exploration and can even hinder training; the performance of policies trained using existing methods varies significantly across training runs, epochs of training, and evaluation runs; the critics' initialization plays the major role in ensemble-based actor-critic exploration, while the training is mostly invariant to the actors' initialization; a strategy based on posterior sampling explores better than the approximated UCB combined with the weighted Bellman backup; the weighted Bellman backup alone cannot replace the clipped double Q-Learning. As a conclusion, we show how existing tools can be brought together in a novel way, giving rise to the Ensemble Deep Deterministic Policy Gradients (ED2) method, to yield state-of-the-art results on continuous control tasks from \mbox{OpenAI Gym MuJoCo}. From the practical side, ED2 is conceptually straightforward, easy to code, and does not require knowledge outside of the existing RL toolbox.

Piotr Kozakowski, Mikołaj Pacek, Piotr Miłoś

In this paper, we present the Adaptive EntropyTree Search (ANTS) algorithm. ANTS builds on recent successes of maximum entropy planning while mitigating its arguably major drawback - sensitivity to the temperature setting. We endow ANTS with a mechanism, which adapts the temperature to match a given range of action selection entropy in the nodes of the planning tree. With this mechanism, the ANTS planner enjoys remarkable hyper-parameter robustness, achieves high scores on the Atari benchmark, and is a capable component of a planning-learning loop akin to AlphaZero. We believe that all these features make ANTS a compelling choice for a general planner for complex tasks.

Błażej Osiński, Piotr Miłoś, Adam Jakubowski, Paweł Zięcina, Michał Martyniak, Christopher Galias, Antonia Breuer, Silviu Homoceanu, Henryk Michalewski

Autonomous Driving Workshop NeurIPS 2020

This work introduces interactive traffic scenarios in the CARLA simulator, which are based on real-world traffic. We concentrate on tactical tasks lasting several seconds, which are especially challenging for current control methods. The CARLA Real Traffic Scenarios (CRTS) is intended to be a training and testing ground for autonomous driving systems. To this end, we open-source the code under a permissive license and present a set of baseline policies. CRTS combines the realism of traffic scenarios and the flexibility of simulation. We use it to train agents using a reinforcement learning algorithm. We show how to obtain competitive polices and evaluate experimentally how observation types and reward schemes affect the training process and the resulting agent's behavior.

Konrad Czechowski, Tomasz Odrzygóźdź, Michał Izworski, Marek Zbysiński, Łukasz Kuciński, Piotr Miłoś

IJCNN 2021, DRL Workshop, NeurIPS 2020

Planning in large state spaces inevitably needs to balance depth and breadth of the search. It has a crucial impact on their performance and most planners manage this
interplay implicitly. We present a novel method *Shoot Tree Search (STS)*, which makes it possible to control this trade-off more explicitly. Our algorithm can be
understood as an interpolation between two celebrated search mechanisms: MCTS and random shooting. It also lets the user control the bias-variance trade-off, akin to $TD(n)$, but in the tree search context.
In experiments on challenging domains, we show that STS can get the best of both worlds consistently achieving higher scores.

Piotr Januszewski, Konrad Czechowski, Piotr Kozakowski, Łukasz Kuciński, Piotr Miłoś

IJCNN 2021, DRL Workshop, NeurIPS 2020

*Shoot Tree Search (STS)*, which makes it possible to control this trade-off more explicitly. Our algorithm can be
understood as an interpolation between two celebrated search mechanisms: MCTS and random shooting. It also lets the user control the bias-variance trade-off, akin to $TD(n)$, but in the tree search context.
In experiments on challenging domains, we show that STS can get the best of both worlds consistently achieving higher scores.

Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłoś, Błażej Osiński, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi, George Tucker, Henryk Michalewski

ICLR 2020 (spotlight), also Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop, ICML 2019

Our work advances the state-of-the-art in model-based reinforcement learning by introducing a system that, to our knowledge, is the first to successfully handle a variety of challenging games in the ALE benchmark. To that end, we experiment with several stochastic video prediction techniques, including a novel model based on discrete latent variables. We also present an approach, called Simulated Policy Learning (SimPLe), that utilizes these video prediction techniques and can train a policy to play the game within the learned model. With several iterations of dataset aggregation, where the policy is deployed to collect more data in the original game, we can learn a policy that, for many games, can successfully play the game in the real environment (see videos on the project webpage).

In our empirical evaluation, we find that SimPLe is significantly more sample-efficient than a highly tuned version of the state-of-the-art Rainbow algorithm on almost all games. In particular, in low data regime of $100$k samples, on more than half of the games, our method achieves a score which Rainbow requires at least twice as many samples. In the best case of Freeway, our method is more than $10x$ more sample-efficient.

Piotr Miłoś, Łukasz Kuciński, Konrad Czechowski, Piotr Kozakowski, Maciek Klimek

Uncertainty and Robustness in Deep Learning Workshop, ICML 2020

We propose a reinforcement learning framework for discrete environments in which an agent makes both strategic and tactical decisions. The former manifests itself through the use of value function, while the latter is powered by a tree search planner. These tools complement each other. The planning
module performs a local *what-if* analysis, which allows to avoid tactical
pitfalls and boost backups of the value function. The value function, being
global in nature, compensates for inherent locality of the planner. In order
to further solidify this synergy, we introduce an exploration mechanism with
two distinctive components: uncertainty modelling and risk measurement.
To model the uncertainty we use value function ensembles, and to reflect
risk we use propose several functionals that summarize the implied by the
ensemble. We show that our method performs well on hard exploration
environments: Deep-sea, toy Montezuma’s Revenge, and Sokoban. In all
the cases, we obtain speed-up in learning and boost in performance.

Błażej Osiński, Adam Jakubowski, Paweł Zięcina, Piotr Miłoś, Christopher Galias, Silviu Homoceanu, Henryk Michalewski

ICRA 2020 also NeurIPS 2019, Autonomous Driving Workshop

We use synthetic data and a reinforcement learning algorithm to train a driving system controlling a full-size real-world vehicle in a number of restricted driving scenarios. The driving policy uses RGB images as input.

We show how design decisions about perception, control and training impact the real-world performance.

Tomasz Korbak, Julian Zubek, Łukasz Kuciński, Piotr Miłoś, Joanna R̨aczaszek-Leonardi

NeurIPS 2019, Emergent Communication: Towards Natural Language Workshop

This paper explores a novel approach to achieving emergent compositional communication in multi-agent systems. We propose a training regime implementing *template transfer*, the idea of carrying over learned biases across contexts. In our method, a sender--receiver pair is first trained with disentangled loss functions and then the receiver is *transferred* to train a new sender with a standard loss. Unlike other methods (e.g. the obverter algorithm), our approach does not require imposing inductive biases on the architecture of the agents. We experimentally show the emergence of compositional communication using topographical similarity, zero-shot generalization and context independence as evaluation metrics. The presented approach is connected to an important line of work in semiotics and developmental psycholinguistics: it supports a conjecture that compositional communication is scaffolded on simpler communication protocols.

Henryk Michalewski, Michał Gramulewicz, Piotr Miłoś

Deep RL Workshop NeurIPS 2018

We propose an expert-augmented actor-critic algorithm, which we evaluate on two environments with sparse rewards: Montezumas Revenge and a demanding maze from the ViZDoom suite. In the case of Montezumas Revenge, an agent trained with our method achieves very good results consistently scoring above 27,000 points (in many experiments beating the first world). With an appropriate choice of hyperparameters, our algorithm surpasses the performance of the expert data. In a number of experiments, we have observed an unreported bug in Montezumas Revenge which allowed the agent to score more than 800,000 points.

Jakob E. Björnberg, Michał Kotowski, Benjamin Lees, Piotr Miłoś

Electronic Journal of Probability, 24, 2019

We consider an extension of the interchange process on the complete graph, in which a fraction of the transpositions are replaced by `reversals'. The model is motivated by statistical physics, where it plays a role in stochastic representations of $XXZ$-models. We prove convergence to $PD(1/2)$ of the rescaled cycle sizes, above the critical point for the appearance of macroscopic cycles. This extends a result of Schramm on convergence to $PD(1)$ for the usual interchange process.

Radosław Adamczak, Michał Kotowski, Piotr Miłoś

Ann. Inst. H. Poincare Probab. Statist. 57, 2021

We study a family of random permutation models on the $2$-dimensional Hamming graph $H(2,n)$, containing the interchange process and the cycle-weighted interchange process with parameter $\theta>0$. This family contains the random representation of the quantum Heisenberg ferromagnet. We show that in these models the cycle structure of permutations undergoes a phase transition -- when the number of transpositions defining the permutation is $\leq cn2$, for small enough $c>0$, all cycles are microscopic, while for more than $\geq Cn^2$ transpositions, for large enough $C>0$, macroscopic cycles emerge with high probability.

We provide bounds on values $C,c$ depending on the parameter $\theta$ of the model, in particular for the interchange process we pinpoint exactly the critical time of the phase transition. Our results imply also the existence of a phase transition in the quantum Heisenberg ferromagnet on $H(2,n)$, namely for low enough temperatures spontaneous magnetization occurs, while it is not the case for high temperatures.

At the core of our approach is a novel application of the cyclic random walk, which might be of independent interest. By analyzing explorations of the cyclic random walk, we show that sufficiently long cycles of a random permutation are uniformly spread on the graph, which makes it possible to compare our models to the mean-field case, i.e., the interchange process on the complete graph, extending the approach used earlier by Schramm.

Henryk Michalewski, Piotr Miłoś, Błażej Osiński

NIPS 2017 Learning to Run challenge (6th place)

In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.

Rafał Marks, Piotr Miłoś

Consider a branching system with particles moving according to an Ornstein-Uhlenbeck process with drift $\mu>0$ and branching according to a law in the domain of attraction of the $(1+\beta)$-stable distribution. The mean of the branching law is strictly larger than $1$ implying that the system is supercritical and the total number of particles grows exponentially at some rate $\lambda>0$.

It is known that the system obeys a law of large numbers. In the paper we study its rate of convergence.

We discover an interesting interplay between the branching rate $\lambda$ and the drift parameter $\mu$. There are three regimes of the second order behavior:

⋅ small branching, $\lambda<(1+1/\beta)$, then the speed of convergence is the same as in the stable central limit theorem but the limit is affected by the dependence between particles.

⋅ critical branching, $\lambda=(1+1/\beta)$, then the dependence becomes strong enough to make the rate of convergence slightly smaller, yet the qualitative behaviour still resembles the stable central limit theorem

⋅ large branching, $\lambda>(1+1/\beta)$, then the dependence manifests much more profoundly, the rate of convergence is substantially smaller and strangely the limit holds a.s.

Maciej Klimek, Henryk Michalewski, Piotr Miłoś

Proceedings of the 1st Annual Conference on Robot Learning, PMLR 78, 2017

In this work we introduce and evaluate a model of Hierarchical Reinforcement Learning with Parameters. In the first stage we train agents to execute relatively simple actions like reaching or gripping. In the second stage we train a hierarchical manager to compose these actions to solve more complicated tasks. The manager may pass parameters to agents thus controlling details of undertaken
actions. The hierarchical approach with parameters can be used with any optimization algorithm.

In this work we adapt to our setting methods described in Trust Region Policy Optimization. We show that their theoretical foundation, including monotonicity of improvements, still holds. We experimentally compare the hierarchical reinforcement learning with the standard, non-hierarchical approach and conclude that the hierarchical learning with parameters is a viable way to improve final results and stability of learning.

Piotr Miłoś, Batı Şengül

Electronic Journal of Probability 24, 2019

The interchange process on a finite graph is obtained by placing a particle on each vertex of the graph, then at rate $1$, selecting an edge uniformly at random and swapping the two particles at either end of this edge. In this paper we develop new techniques to show the existence of a phase transition of the interchange process on the $2$-dimensional Hamming graph. We show that in the subcritical phase, all of the cycles of the process have length $O(\log n)$, whereas in the supercritical phase a positive density of vertices lie in cycles of length at least $n^{2-\epsilon}$ for any $\epsilon>0$.

Bastien Mallein, Piotr Miłoś

Stoch. Proc. Appli. 129, 2019

The behavior of the maximal displacement of a supercritical branching random walk has been a subject of intense studies for a long time. But only recently the case of time-inhomogeneous branching has gained focus. The contribution of this paper is to analyze a time-inhomogeneous model with two levels of randomness. In the first step a sequence of branching laws is sampled independently according to a distribution on the set of point measures' laws. Conditionally on the realization of this sequence (called environment) we define a branching random walk and find the asymptotic behavior of its maximal particle. It is of the form $V_n-\varphi \log n + o_{\mathbf{P}}(\log n)$, where $V_n$ is a function of the environment that behaves as a random walk and $\varphi>0$ is a deterministic constant, which turns out to be bigger than the usual logarithmic correction of the homogeneous branching random walk.

I coorganize the Reinforcement Learning Seminar in IMPAN. We warmly welcome to attend, also remotely. After the seminar, the participants are invited for informal discussions.

Contact me if you want to be added to the seminar's mailing list!

COVID-19 update: the seminar went fully online.

Ready to start an exciting RL project? I am constantly looking for postdocs, Ph.D. and master students. Please contact me by email.

Student's research projects

Send me an email and I'll get back to you as soon as possible!

pmilos (at) mimuw.edu.pl