# Piotr Miłoś

Reinforcement Learning Researcher

I am an associate professor in the Institute of Mathematics, Polish Academy of Sciences and senior data scientist in deepsense.ai.

I started my scientific career, specialising in probability, stochastic modelling, particularly in problems arising in mathematical physics.

Recently, I have been working in machine learning focusing on reinforcement learning. I am the PI of a RL research grant founded by the National Science Center. Contact me, if you are interested in Ph.D./post-doc positions.

I am a co-host of the reinforcement learning seminar and coorganize reinforcement learning course (see there also for a selection of student's research projects).

## Selected Publications

The site is under development. For the whole list see my Google Scholar profile or arXiv.

##### Emergence of compositional language in communication through noisy channel

Łukasz Kuciński, Paweł Kołodziej, Piotr Miłoś

Language in Reinforcement Learning, ICML 2020

In this paper, we investigate how communication through a noisy channel can lead to the emergence of compositional language. Our approach is \mbox{end-to-end}, allows for different inductive biases on the agents’ architecture, and trains without periodical resets of the networks’ weights. This relaxes some of the assumptions in recently developed methods. The impact on the structure of the resulting language is shown in the context of signaling games. We also develop a new metric for measuring degree of compositionality.

##### Model Based Reinforcement Learning for Atari

Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłoś, Błażej Osiński, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi, George Tucker, Henryk Michalewski

ICLR 2020 (spotlight), also Generative Modeling and Model-Based Reasoning for Robotics and AI Workshop, ICML 2019

Our work advances the state-of-the-art in model-based reinforcement learning by introducing a system that, to our knowledge, is the first to successfully handle a variety of challenging games in the ALE benchmark. To that end, we experiment with several stochastic video prediction techniques, including a novel model based on discrete latent variables. We also present an approach, called Simulated Policy Learning (SimPLe), that utilizes these video prediction techniques and can train a policy to play the game within the learned model. With several iterations of dataset aggregation, where the policy is deployed to collect more data in the original game, we can learn a policy that, for many games, can successfully play the game in the real environment (see videos on the project webpage).
In our empirical evaluation, we find that SimPLe is significantly more sample-efficient than a highly tuned version of the state-of-the-art Rainbow algorithm on almost all games. In particular, in low data regime of $100$k samples, on more than half of the games, our method achieves a score which Rainbow requires at least twice as many samples. In the best case of Freeway, our method is more than $10x$ more sample-efficient.

##### Uncertainty-sensitive Learning and Planning with Ensembles

Piotr Miłoś, Łukasz Kuciński, Konrad Czechowski, Piotr Kozakowski, Maciek Klimek

Uncertainty and Robustness in Deep Learning Workshop, ICML 2020

We propose a reinforcement learning framework for discrete environments in which an agent makes both strategic and tactical decisions. The former manifests itself through the use of value function, while the latter is powered by a tree search planner. These tools complement each other. The planning module performs a local what-if analysis, which allows to avoid tactical pitfalls and boost backups of the value function. The value function, being global in nature, compensates for inherent locality of the planner. In order to further solidify this synergy, we introduce an exploration mechanism with two distinctive components: uncertainty modelling and risk measurement. To model the uncertainty we use value function ensembles, and to reflect risk we use propose several functionals that summarize the implied by the ensemble. We show that our method performs well on hard exploration environments: Deep-sea, toy Montezuma’s Revenge, and Sokoban. In all the cases, we obtain speed-up in learning and boost in performance.

##### Simulation-based reinforcement learning for real-world autonomous driving

Błażej Osiński, Adam Jakubowski, Paweł Zięcina, Piotr Miłoś, Christopher Galias, Silviu Homoceanu, Henryk Michalewski

ICRA 2020 also NeurIPS 2019, Autonomous Driving Workshop

We use synthetic data and a reinforcement learning algorithm to train a driving system controlling a full-size real-world vehicle in a number of restricted driving scenarios. The driving policy uses RGB images as input.
We show how design decisions about perception, control and training impact the real-world performance.

##### Developmentally motivated emergence of compositional communication via template transfer

Tomasz Korbak, Julian Zubek, Łukasz Kuciński, Piotr Miłoś, Joanna R̨aczaszek-Leonardi

NeurIPS 2019, Emergent Communication: Towards Natural Language Workshop

This paper explores a novel approach to achieving emergent compositional communication in multi-agent systems. We propose a training regime implementing template transfer, the idea of carrying over learned biases across contexts. In our method, a sender--receiver pair is first trained with disentangled loss functions and then the receiver is transferred to train a new sender with a standard loss. Unlike other methods (e.g. the obverter algorithm), our approach does not require imposing inductive biases on the architecture of the agents. We experimentally show the emergence of compositional communication using topographical similarity, zero-shot generalization and context independence as evaluation metrics. The presented approach is connected to an important line of work in semiotics and developmental psycholinguistics: it supports a conjecture that compositional communication is scaffolded on simpler communication protocols.

##### Expert-augmented actor-critic for ViZDoom and Montezumas Revenge

Henryk Michalewski, Michał Gramulewicz, Piotr Miłoś

Deep RL Workshop NeurIPS 2018

We propose an expert-augmented actor-critic algorithm, which we evaluate on two environments with sparse rewards: Montezumas Revenge and a demanding maze from the ViZDoom suite. In the case of Montezumas Revenge, an agent trained with our method achieves very good results consistently scoring above 27,000 points (in many experiments beating the first world). With an appropriate choice of hyperparameters, our algorithm surpasses the performance of the expert data. In a number of experiments, we have observed an unreported bug in Montezumas Revenge which allowed the agent to score more than 800,000 points.

##### The interchange process with reversals on the complete graph

Jakob E. Björnberg, Michał Kotowski, Benjamin Lees, Piotr Miłoś

accepted to EJP

We consider an extension of the interchange process on the complete graph, in which a fraction of the transpositions are replaced by `reversals'. The model is motivated by statistical physics, where it plays a role in stochastic representations of $XXZ$-models. We prove convergence to $PD(1/2)$ of the rescaled cycle sizes, above the critical point for the appearance of macroscopic cycles. This extends a result of Schramm on convergence to $PD(1)$ for the usual interchange process.

##### Phase transition for the interchange and quantum Heisenberg models on the Hamming graph

accepted to AIHP

We study a family of random permutation models on the $2$-dimensional Hamming graph $H(2,n)$, containing the interchange process and the cycle-weighted interchange process with parameter $\theta>0$. This family contains the random representation of the quantum Heisenberg ferromagnet. We show that in these models the cycle structure of permutations undergoes a phase transition -- when the number of transpositions defining the permutation is $\leq cn2$, for small enough $c>0$, all cycles are microscopic, while for more than $\geq Cn^2$ transpositions, for large enough $C>0$, macroscopic cycles emerge with high probability.
We provide bounds on values $C,c$ depending on the parameter $\theta$ of the model, in particular for the interchange process we pinpoint exactly the critical time of the phase transition. Our results imply also the existence of a phase transition in the quantum Heisenberg ferromagnet on $H(2,n)$, namely for low enough temperatures spontaneous magnetization occurs, while it is not the case for high temperatures.
At the core of our approach is a novel application of the cyclic random walk, which might be of independent interest. By analyzing explorations of the cyclic random walk, we show that sufficiently long cycles of a random permutation are uniformly spread on the graph, which makes it possible to compare our models to the mean-field case, i.e., the interchange process on the complete graph, extending the approach used earlier by Schramm.

##### Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments. Proximal Policy Optimization with Policy Blending

Henryk Michalewski, Piotr Miłoś, Błażej Osiński

NIPS 2017 Learning to Run challenge (6th place)

In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.

##### CLT for supercritical branching processes with heavy-tailed branching law

Rafał Marks, Piotr Miłoś

(submitted)

Consider a branching system with particles moving according to an Ornstein-Uhlenbeck process with drift $\mu>0$ and branching according to a law in the domain of attraction of the $(1+\beta)$-stable distribution. The mean of the branching law is strictly larger than $1$ implying that the system is supercritical and the total number of particles grows exponentially at some rate $\lambda>0$.
It is known that the system obeys a law of large numbers. In the paper we study its rate of convergence.
We discover an interesting interplay between the branching rate $\lambda$ and the drift parameter $\mu$. There are three regimes of the second order behavior:
⋅ small branching, $\lambda<(1+1/\beta)$, then the speed of convergence is the same as in the stable central limit theorem but the limit is affected by the dependence between particles.
⋅ critical branching, $\lambda=(1+1/\beta)$, then the dependence becomes strong enough to make the rate of convergence slightly smaller, yet the qualitative behaviour still resembles the stable central limit theorem
⋅ large branching, $\lambda>(1+1/\beta)$, then the dependence manifests much more profoundly, the rate of convergence is substantially smaller and strangely the limit holds a.s.

##### Hierarchical Reinforcement Learning with Parameters

Maciej Klimek, Henryk Michalewski, Piotr Miłoś

Proceedings of the 1st Annual Conference on Robot Learning, PMLR 78:301-313, 2017

In this work we introduce and evaluate a model of Hierarchical Reinforcement Learning with Parameters. In the first stage we train agents to execute relatively simple actions like reaching or gripping. In the second stage we train a hierarchical manager to compose these actions to solve more complicated tasks. The manager may pass parameters to agents thus controlling details of undertaken actions. The hierarchical approach with parameters can be used with any optimization algorithm.
In this work we adapt to our setting methods described in Trust Region Policy Optimization. We show that their theoretical foundation, including monotonicity of improvements, still holds. We experimentally compare the hierarchical reinforcement learning with the standard, non-hierarchical approach and conclude that the hierarchical learning with parameters is a viable way to improve final results and stability of learning.

##### Existence of a phase transition of the interchange process on the Hamming graph

Piotr Miłoś, Batı Şengül

Electronic Journal of Probability, Vol. 24, paper no. 64, 2019

The interchange process on a finite graph is obtained by placing a particle on each vertex of the graph, then at rate $1$, selecting an edge uniformly at random and swapping the two particles at either end of this edge. In this paper we develop new techniques to show the existence of a phase transition of the interchange process on the $2$-dimensional Hamming graph. We show that in the subcritical phase, all of the cycles of the process have length $O(\log n)$, whereas in the supercritical phase a positive density of vertices lie in cycles of length at least $n^{2-\epsilon}$ for any $\epsilon>0$.

##### Maximal displacement of a supercritical branching random walk in a time-inhomogeneous random environment

Bastien Mallein, Piotr Miłoś

Stochastic Processes and their Applications, Vol. 129, Issue 9, p. 3239-3260, 2019

The behavior of the maximal displacement of a supercritical branching random walk has been a subject of intense studies for a long time. But only recently the case of time-inhomogeneous branching has gained focus. The contribution of this paper is to analyze a time-inhomogeneous model with two levels of randomness. In the first step a sequence of branching laws is sampled independently according to a distribution on the set of point measures' laws. Conditionally on the realization of this sequence (called environment) we define a branching random walk and find the asymptotic behavior of its maximal particle. It is of the form $V_n-\varphi \log n + o_{\mathbf{P}}(\log n)$, where $V_n$ is a function of the environment that behaves as a random walk and $\varphi>0$ is a deterministic constant, which turns out to be bigger than the usual logarithmic correction of the homogeneous branching random walk.

## Reinforcement learning seminar

I coorganize the Reinforcement Learning Seminar in IMPAN. We warmly welcome to attend, also remotely by hangouts. After the seminar, the participants are invited for informal discussions.
Write me if you want to be added to the seminar's mailing list!

COVID-19 update: the seminar went fully online.

## Teaching

Ready to change your life and start a new, exciting RL project? I am looking for postdocs, Ph.D. and master students and other collaborators.
DeepMind's scholarships
Reinforcement learning course
Student's research projects

## Let's Get In Touch!

Send me an email and I'll get back to you as soon as possible!

pmilos (at) mimuw.edu.pl