Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Mostrar el registro sencillo del ítem

dc.creator Morán, Marina
dc.creator Balladini, Javier
dc.creator Rexachs, Dolores
dc.creator Rucci, Enzo
dc.date 2024
dc.date.accessioned 2024-09-04T15:39:28Z
dc.date.available 2024-09-04T15:39:28Z
dc.identifier.uri http://rdi.uncoma.edu.ar/handle/uncomaid/18119
dc.description.abstract Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure es_ES
dc.format application/pdf es_ES
dc.format.extent pp. 1-36 es_ES
dc.language eng es_ES
dc.publisher Elsevier es_ES
dc.relation.uri https://doi.org/10.1016/j.jpdc.2023.104797 es_ES
dc.rights Atribución-NoComercial-CompartirIgual 2.5 Argentina es_ES
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ es_ES
dc.source Journal of Parallel and Distributed Computing Volume 185, March 2024 es_ES
dc.subject Energy saving es_ES
dc.subject Fault tolerance methods es_ES
dc.subject Checkpoint parallel es_ES
dc.subject Applications ACPI DVFS es_ES
dc.subject.other Ciencias de la Computación e Información es_ES
dc.title Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems es_ES
dc.type Articulo es
dc.type article eu
dc.type acceptedVersion eu
dc.description.fil Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. es_ES
dc.description.fil Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. es_ES
dc.description.fil Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. es_ES
dc.description.fil Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. es_ES
dc.cole Artículos es_ES

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Atribución-NoComercial-CompartirIgual 2.5 Argentina Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución-NoComercial-CompartirIgual 2.5 Argentina