High-Performance Computing (HPC) has revolutionized science, enabling in silico experiments and analysis of experimental data on an unprecedented scale. Cloud computing is now transforming computing in engineering and business. However, the increase in computing power is not matched by a change in resource management. Supercomputers still use queuing systems that optimize efficiency of the whole system and neglect objectives of individual users. Fairness towards individual users is controlled by many configuration parameters setting bonuses and penalties for certain actions; the influence of these parameters on users' goals is unclear.
In kassate we propose a radically different approach: to use multi-objective scheduling algorithms that guarantee performance and reliability for each user. The objective of the project is to develop mathematical models of resource management in modern computing systems; fair and efficient algorithms; and prototype implementations of the scheduling software.
kassate is founded by the Polish National Science Center through a Sonata grant (July 2013–June 2016).
in silico experiments
Various scientific disciplines test their hypotheses by in silico experiments. For instance, in biology, models are used to describe phenomena ranging from protein folding through individual cells to interactions in whole ecosystems. In chemistry, physics and engineering, computational fluid dynamics (CFD) models are commonly used.
As the quality of sensors increases, and their cost decrease, real-life experiments provide more data; the data that must be processed by complex methods to interpret the results of the experiment. For instance, a complete read of an individual's genome produced by a DNA sequencer consists of about 60 million sequences, each having approximately 100 symbols; to derive clinically applicable results, this sequences must be matched with a reference (depending on the case, from thousands to bilions of symbols). Similar trends are also present in business: with the increased digitalization of everyday life, more data is available about each one of us; Google, Facebook and similar companies process and correlate the data to build more accurate profiles of their users.
Supercomputers are the only reasonable way to provide the computational power needed by in-silico experiments and big data analysis. Modern supercomputers are massively parallel: hundreds to millions of processors (CPUs) are connected by a fast network. Supercomputers are expensive; thus in academia and science it is common that a supercomputer is shared by many research groups. Sharing a larger supercomputer is more cost-effective that buying a smaller one, as normally a research group produces jobs in bursts, rather than continuos streams. A common, shared infrastructure averages these bursts. Yet, individual users are rarely satisfied with the perceived performance.
Cloud computing has an enormous impact on the way information is processed in business environments. Rather than managing their own small-scale clusters, companies rent infrastructure from specialized cloud providers. The cloud computing providers operate massive datacenters, gathering hunders of thousands of individual nodes. Economies of scale make purchasing and operating such large-scale datacenter comparatively cheaper than small-scale servers that has used to be common in enterprise IT. However, by their scale and by high service level requested by clients, datacenters define new challenges on the algorithms used to manage resources.
While having a significant impact on business computing, cloud computing is unsuitable for large-scale scientific computing. Academic supercomputing centers give computational grants to projects based on expected scientific importance. If the projects were to be executed on commercial computational clouds, their budgets, rather than scientific merit, would count. Clouds use the economies of scale. However, a large supercomputing center gathering many scientific projects can use the same economies of scale. Thus, a standard academic High-Performance Computing (HPC) center will be still a viable solution for scientific projects.
There is a gap between theoretical scheduling models and complex architectures of modern supercomputers. While many models exits, most theoretically-sound works are based on simple models (homogeneous processors, no communication costs, etc.). Most of these problems are NP-hard --- starting from the simplest case with two processors and sequential (one-processor) jobs. In contrast, real-life supercomputers typically use a simple queueing heuristics coupled with complex rules for prioritization of jobs.
The goal of the project is to study the resource management problems present in modern supercomputing with formal mathematical tools. We aim to build resource management and scheduling algorithms that can be analyzed theoretically for the worst-case performance guarantees. However, we also want to further validate our algorithms by implementation in real-life schedulers and resource management software and by tests on supercomputing infrastructure.