Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Show simple item record

dc.creator Morán, Marina
dc.creator Balladini, Javier
dc.creator Rexachs, Dolores
dc.creator Rucci, Enzo
dc.date 2024
dc.date.accessioned 2024-09-04T15:39:28Z
dc.date.available 2024-09-04T15:39:28Z
dc.identifier.uri http://rdi.uncoma.edu.ar/handle/uncomaid/18119
dc.description.abstract Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure es_ES
dc.format application/pdf es_ES
dc.format.extent pp. 1-36 es_ES
dc.language eng es_ES
dc.publisher Elsevier es_ES
dc.relation.uri https://doi.org/10.1016/j.jpdc.2023.104797 es_ES
dc.rights Atribución-NoComercial-CompartirIgual 2.5 Argentina es_ES
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ es_ES
dc.source Journal of Parallel and Distributed Computing Volume 185, March 2024 es_ES
dc.subject Energy saving es_ES
dc.subject Fault tolerance methods es_ES
dc.subject Checkpoint parallel es_ES
dc.subject Applications ACPI DVFS es_ES
dc.subject.other Ciencias de la Computación e Información es_ES
dc.title Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems es_ES
dc.type Articulo es
dc.type article eu
dc.type acceptedVersion eu
dc.description.fil Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. es_ES
dc.description.fil Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. es_ES
dc.description.fil Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. es_ES
dc.description.fil Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. es_ES
dc.cole Artículos es_ES


Files in this item

This item appears in the following Collection(s)

Show simple item record

Atribución-NoComercial-CompartirIgual 2.5 Argentina Except where otherwise noted, this item's license is described as Atribución-NoComercial-CompartirIgual 2.5 Argentina

Search RDI


Browse

My Account

Statistics