Home | Projects | Publications | Presentations | People | News | Activities | About DCSL | Internal |
<< All Projects | Fault Tolerance for High-Performance Computing Clusters and Applications |
Summary | As today's distributed commercial and scientific applications increase in complexity and scale, providing fault tolerance capabilities becomes increasingly difficult. Faults can arise from multiple sources—such as software bugs, hardware errors and unexpected runtime conditions—and can affect an application in different phases of its execution. The increase in size of the largest supercomputers and data centers on which these applications run imposes challenges to fault-tolerance techniques such as checkpointing and fault detection and localization. On one hand, these techniques need to provide fault-tolerance in a scalable manner—they cannot become a bottleneck as the number of processes and input data increase, and on the other hand, the added overhead should be small enough so that it ultimately reduces the end-to-end completion time of the user applications.
|
Achieved Technical Goals |
|
Publications | |
Future Work | |
Students | |
Code & Data | |
Funding Source | |
465 Northwestern Avenue, West Lafayette, IN 47907 | dcsl@ecn.purdue.edu | +1 765 494 3510 |
Home | Projects | Publications | Presentations | People News | Activities | About DCSL | Internal Last Update: March 19, 2012 12:15 by GMHoward |