Improving Application Resilience through Probabilistic Task Replication

Maintaining performance in a faulty distributed computing environment is a major challenge in the design of future peta and exa-scale class systems. Better defining application resilience as a function of scale, is a key to developing reliable software systems and programming methodologies. This paper defines the resilience of a task as the survivability of that task (i.e., how well will it survive until it completes). Resilience varies with mean time to failure (MTTF) and inversely with runtime. We develop an approach for defining a resilience index(RI) for applications running on a system with a fixed MTTF. Our approach, inspired by radioactive decay, defines an application as a collection of tasks, which we model as particles with an exponential decay rate and therefore measurable half-life. We determine the probability of the number of task failures for an application using a poisson distribution over the interval of the task lifetime. Further we have developed a distributed runtime system, ARRIA, that measures both system reliability and application performance at runtime, which schedules and replicates tasks based on the probability of failure and expected runtime. We demonstrate that the index can help to better define the tradeoffs for the designers of future systems and developers of parallel software.Thus, we propose a formulation of application resilience that results in a resilience index. We evaluate some initial and fundamental proper- ties of the index as they relate to application performance on high performance computing systems composed of many components, each with varying degrees of reliability.