# Piotr Miłoś

Reinforcement Learning Researcher

I am an assistant professor in the Faculty of Mathematics, Informatics and Mechanics, University of Warsaw. I was a post-doc at the Université de Genève working under the supervision of prof. Yvan Velenik. Further, I was a post-doc at the Prob-Lab working with dr Simon Harris and prof. Andreas Kyprianou.

I specialise in probability, stochastic modelling, particularly in problems arising in mathematical physics.

Recently, I have been working in machine learning focusing on reinforcement learning. I am the PI of a 500k RL research grant founded by the National Science Center. Contact me, if you are interested in Ph.D./post-doc positions.

## Selected Publications

The site is under development. For the whole list see my Google Scholar profile or arXiv.

##### Model Based Reinforcement Learning for Atari

Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłoś, Błażej Osiński, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi, George Tucker, Henryk Michalewski

(submitted)

Our work advances the state-of-the-art in model-based reinforcement learning by introducing a system that, to our knowledge, is the first to successfully handle a variety of challenging games in the ALE benchmark. To that end, we experiment with several stochastic video prediction techniques, including a novel model based on discrete latent variables. We also present an approach, called Simulated Policy Learning (SimPLe), that utilizes these video prediction techniques and can train a policy to play the game within the learned model. With several iterations of dataset aggregation, where the policy is deployed to collect more data in the original game, we can learn a policy that, for many games, can successfully play the game in the real environment (see videos on the project webpage).
In our empirical evaluation, we find that SimPLe is significantly more sample-efficient than a highly tuned version of the state-of-the-art Rainbow algorithm on almost all games. In particular, in low data regime of $100$k samples, on more than half of the games, our method achieves a score which Rainbow requires at least twice as many samples. In the best case of Freeway, our method is more than $10x$ more sample-efficient.

##### Expert-augmented actor-critic for ViZDoom and Montezumas Revenge

Henryk Michalewski, Michał Gramulewicz, Piotr Miłoś

Deep RL Workshop NeurIPS 2018

We propose an expert-augmented actor-critic algorithm, which we evaluate on two environments with sparse rewards: Montezumas Revenge and a demanding maze from the ViZDoom suite. In the case of Montezumas Revenge, an agent trained with our method achieves very good results consistently scoring above 27,000 points (in many experiments beating the first world). With an appropriate choice of hyperparameters, our algorithm surpasses the performance of the expert data. In a number of experiments, we have observed an unreported bug in Montezumas Revenge which allowed the agent to score more than 800,000 points.

##### The interchange process with reversals on the complete graph

Jakob E. Björnberg, Michał Kotowski, Benjamin Lees, Piotr Miłoś

(submitted)

We consider an extension of the interchange process on the complete graph, in which a fraction of the transpositions are replaced by `reversals'. The model is motivated by statistical physics, where it plays a role in stochastic representations of $XXZ$-models. We prove convergence to $PD(1/2)$ of the rescaled cycle sizes, above the critical point for the appearance of macroscopic cycles. This extends a result of Schramm on convergence to $PD(1)$ for the usual interchange process.

##### Phase transition for the interchange and quantum Heisenberg models on the Hamming graph

(submitted)

We study a family of random permutation models on the $2$-dimensional Hamming graph $H(2,n)$, containing the interchange process and the cycle-weighted interchange process with parameter $\theta>0$. This family contains the random representation of the quantum Heisenberg ferromagnet. We show that in these models the cycle structure of permutations undergoes a phase transition -- when the number of transpositions defining the permutation is $\leq cn2$, for small enough $c>0$, all cycles are microscopic, while for more than $\geq Cn^2$ transpositions, for large enough $C>0$, macroscopic cycles emerge with high probability.
We provide bounds on values $C,c$ depending on the parameter $\theta$ of the model, in particular for the interchange process we pinpoint exactly the critical time of the phase transition. Our results imply also the existence of a phase transition in the quantum Heisenberg ferromagnet on $H(2,n)$, namely for low enough temperatures spontaneous magnetization occurs, while it is not the case for high temperatures.
At the core of our approach is a novel application of the cyclic random walk, which might be of independent interest. By analyzing explorations of the cyclic random walk, we show that sufficiently long cycles of a random permutation are uniformly spread on the graph, which makes it possible to compare our models to the mean-field case, i.e., the interchange process on the complete graph, extending the approach used earlier by Schramm.

##### Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments. Proximal Policy Optimization with Policy Blending

Henryk Michalewski, Piotr Miłoś, Błażej Osiński

NIPS 2017 Learning to Run challenge (6th place)

In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.

##### CLT for supercritical branching processes with heavy-tailed branching law

Rafał Marks, Piotr Miłoś

(submitted)

Consider a branching system with particles moving according to an Ornstein-Uhlenbeck process with drift $\mu>0$ and branching according to a law in the domain of attraction of the $(1+\beta)$-stable distribution. The mean of the branching law is strictly larger than $1$ implying that the system is supercritical and the total number of particles grows exponentially at some rate $\lambda>0$.
It is known that the system obeys a law of large numbers. In the paper we study its rate of convergence.
We discover an interesting interplay between the branching rate $\lambda$ and the drift parameter $\mu$. There are three regimes of the second order behavior:
⋅ small branching, $\lambda<(1+1/\beta)$, then the speed of convergence is the same as in the stable central limit theorem but the limit is affected by the dependence between particles.
⋅ critical branching, $\lambda=(1+1/\beta)$, then the dependence becomes strong enough to make the rate of convergence slightly smaller, yet the qualitative behaviour still resembles the stable central limit theorem
⋅ large branching, $\lambda>(1+1/\beta)$, then the dependence manifests much more profoundly, the rate of convergence is substantially smaller and strangely the limit holds a.s.

##### Hierarchical Reinforcement Learning with Parameters

Maciej Klimek, Henryk Michalewski, Piotr Miłoś

Proceedings of the 1st Annual Conference on Robot Learning, PMLR 78:301-313, 2017

In this work we introduce and evaluate a model of Hierarchical Reinforcement Learning with Parameters. In the first stage we train agents to execute relatively simple actions like reaching or gripping. In the second stage we train a hierarchical manager to compose these actions to solve more complicated tasks. The manager may pass parameters to agents thus controlling details of undertaken actions. The hierarchical approach with parameters can be used with any optimization algorithm.
In this work we adapt to our setting methods described in Trust Region Policy Optimization. We show that their theoretical foundation, including monotonicity of improvements, still holds. We experimentally compare the hierarchical reinforcement learning with the standard, non-hierarchical approach and conclude that the hierarchical learning with parameters is a viable way to improve final results and stability of learning.

##### Existence of a phase transition of the interchange process on the Hamming graph

Piotr Miłoś, Batı Şengül

Accepted to Electronic Journal of Probability

The interchange process on a finite graph is obtained by placing a particle on each vertex of the graph, then at rate $1$, selecting an edge uniformly at random and swapping the two particles at either end of this edge. In this paper we develop new techniques to show the existence of a phase transition of the interchange process on the $2$-dimensional Hamming graph. We show that in the subcritical phase, all of the cycles of the process have length $O(\log n)$, whereas in the supercritical phase a positive density of vertices lie in cycles of length at least $n^{2-\epsilon}$ for any $\epsilon>0$.

##### Maximal displacement of a supercritical branching random walk in a time-inhomogeneous random environment

Bastien Mallein, Piotr Miłoś

Stochastic Processes and their Applications, 2018

The behavior of the maximal displacement of a supercritical branching random walk has been a subject of intense studies for a long time. But only recently the case of time-inhomogeneous branching has gained focus. The contribution of this paper is to analyze a time-inhomogeneous model with two levels of randomness. In the first step a sequence of branching laws is sampled independently according to a distribution on the set of point measures' laws. Conditionally on the realization of this sequence (called environment) we define a branching random walk and find the asymptotic behavior of its maximal particle. It is of the form $V_n-\varphi \log n + o_{\mathbf{P}}(\log n)$, where $V_n$ is a function of the environment that behaves as a random walk and $\varphi>0$ is a deterministic constant, which turns out to be bigger than the usual logarithmic correction of the homogeneous branching random walk.

##### Brownian motion and Random Walk above Quenched Random Wall

Bastien Mallein, Piotr Miłoś

Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, 2018

We study the persistence exponent for the first passage time of a random walk below the trajectory of another random walk. More precisely, let $\{B_n\}$ and $\{W_n\}$ be two centered, weakly dependent random walks. We establish that $\mathbb{P}(\forall_{n\leq N} B_n\geq W_n|W)=N−\gamma+o(1)$ for a non-random $\gamma \geq 1/2$. In the classical setting, $W_n \equiv 0$, it is well-known that $\gamma=1/2$. We prove that for any non-trivial W one has $\gamma>1/2$ and the exponent $\gamma$ depends only on $\text{Var}(B_1)/\text{Var}(W_1)$.
Our result holds also in the continuous setting, when $B$ and $W$ are independent and possibly perturbed Brownian motions or Ornstein-Uhlenbeck processes. In the latter case the probability decays at exponential rate.

## Teaching

Ready to change your life and start a new, exciting RL project? I am looking for Ph.D. and master students.

## Let's Get In Touch!

Send me an email and I'll get back to you as soon as possible!

pmilos (at) mimuw.edu.pl