Modern large-scale computational resources - supercomputers and data centers (the backbone of the cloud computing) - are a crucial element of our increasingly digital world. This infrastructure hosts services we all use daily, such as on-line maps, weather forecasts or popular web sites; but it also acts as a primary research instrument running numerical models in computational biology, chemistry or physics. These platforms are massively parallel, composed of thousands of individual machines and shared by tens to thousands of jobs at any given time. A key to their efficiency is the scheduler: the software allocating jobs to machines. Scheduler policies are currently a mixture of optimization, heuristics and rules of thumb, which obviously is neither a scientific nor an optimal approach.

Our research hypothesis is that the modern computational infrastructure can be managed in a significantly more efficient way through a wider application of optimization methods.

The principal objective of the project is to develop theoretically-sound and practically-relevant results on how a scheduler should manage the new aspects of modern infrastructure such as: performance degradation resulting from colocating many jobs on a single node in data~centers; jobs with ; or supercomputers with storage hierarchies (burst buffers).

The project Optimization Methods for Allocation of Large-Scale Computational Resources is funded by the Polish National Science Center through an Opus grant (October 2018 - August 2021, UMO-2017/25/B/ST6/00116).

method

Although most of our research effort will be concentrated on theoretical approaches, we plan to employ a full spectrum of computer science methods. Inspired by systems problems, we strive for mathematically-sound theoretical results extending scheduling, algorithmics and game theory. Our goal is to formulate efficient algorithms that can be implemented and additionally validated through simulation on relevant traces and statistical analysis of the results.

We formulate combinatorial optimization models that emphasize new features by minimally extending the classic models. We analyze computational complexity and propose exact or approximate algorithms. Then, we extend these algorithms to cope with more standard aspects and test their average-case behavior through simulations on available logs from supercomputers and data centers. We perform the simulations first in a custom, event-based simulator; and then in standard, open-source resource managers, which we extend with prototype implementations of our algorithms.

expected impact

Research, business and our daily life are increasingly digital and large-scale computational resources are what runs them. The humanity gathers increasingly more data; computational models used by modern biology, chemistry or physics are increasingly detailed and covering systems of increasingly larger scales. We thus all need increasingly more computational power to process it on time.

Supercomputers and data centers are expensive to build and to operate, thus any gains in efficiency will directly result in reduced operational costs and, consequently, wider availability. In supercomputers, less idle time translates to faster results, shorter research cycles and, therefore, more meaningful science. In data centers, reduced performance degradation leads to a platform that is more robust and more stable, thus suitable for even more applications.

Our results - efficient allocation algorithms with strong theoretical basis - will demonstrate how to employ these crucial resources in a more efficient way. In particular, we plan to show how data centers can collocate more jobs with smaller performance degradation; and how supercomputers can use burst buffers to speed-up individual jobs without introducing idle time.