Pioneering the Way: A Conversation with Franck Cappello, Charles Babbage Award Recipient

By IEEE Computer Society Team on

March 12, 2024

Sussman Franck Cappello, a pioneer in the field of high-performance computing, currently R&D lead and senior computer scientist at Argonne National Laboratory, has achieved various accomplishments throughout his momentous career. With a Ph.D. from the University of Paris XI, Cappello embarked on a journey of groundbreaking research at the French National Center for Scientific Research (CNRS), where he worked on hybrid parallel programming and Desktop Grids. While at Inria, National Institute for Research in Digital Science and Technology, he led the R&D phase of Grid’5000, a large-scale experimental platform for parallel and distributed computing that is still used today. Additionally, Cappello's contributions extend to the United States as he established the Joint Laboratory for Extreme Scale Computing (JLESC) at the University of Illinois, one of the largest and longest-lasting collaborative centers in supercomputing. Recognized as an IEEE Fellow and a recipient of numerous prestigious awards, Cappello's incredible contributions take place across scientific domains, marking him as a luminary in the field.

In honor of his many achievements, he has received the 2024 IEEE CS Charles Babbage Award for “... pioneering contributions and inspiring leadership in distributed computing, high-performance computing, resilience, and data reduction.”

"Before responding to the questions, I would like to recognize all the researchers, postdocs, and students I have had the chance to work and collaborate with. The results (publications, software, events, etc.) for every research topic I initiated and developed could not have been from a single person. The main credit for the success of these different research directions should go to the colleagues, students, and friends who were involved in them," Franck Cappello.

You’ve received various awards such as the Euro-Par Achievement Award, the IEEE TCPP Outstanding Service Award, and more recently, the IEEE CS Charles Babbage Award. How do you think achievements such as this can inspire the next generation of researchers?

Awards are important for researchers because they are decided by peers. They are external evaluations of a researcher’s achievement, contribution, and impact on her/his community; and they also mark the significance of topics in a research community. They are extremely encouraging and rewarding for the person who receives the award. Moreover, and maybe more important, awards work as beacons. For early- and mid-career researchers alike, the existence of awards is inspiring, encouraging that their work is on the right track, despite occasional setbacks. I never thought I would get one of these awards, and I hope that my experiences can be inspiring to others: I was not particularly good at school. The schools I attended were not the best ones, by far. I did not have linear academic training. I changed topics several times, from heavy electric machinery to electronics to computer science – not the straight curriculum that one can imagine for a researcher receiving several major awards. If I had only one message to send to early career researchers, I would say: “The level of achievement, contributions, and impact that these awards reflect is attainable: if I did it, you can do it!”

Honor your colleagues' achievements. Nominate Someone for a Major Award Today!

Your work has been fantastic for the community. You’ve made impactful contributions within the fields of high-performance computing resilience, distributed computing, resilience, and data reduction! Tell us about a pivotal moment or project that shaped your path to becoming a leader in these fields.

Retrospectively, I think the easiest way to become a leader in some fields is to invest in a nascent domain where the methodology, techniques, and results are not substantially developed. It was true for all the research directions I worked on: experimental platforms for parallel and distributed computing, fault tolerance/resilience for high-performance parallel computing, and scientific data compression. When the field is almost empty, one can make a lot of impactful contributions. Of course, this is a high-risk high-reward game. So intuition and luck are at play here because just analyzing trends is not enough. The community may focus on one topic for several years and suddenly switch to some other topic. So identifying nascent and long-lasting topics is challenging. An interesting aspect is that if the field is almost empty, the rigorous research/evaluation methodology may not even be established yet. I believe that establishing methods is as important as, if not more important than, contributions addressing research problems when the methodology is established. I had three main pivotal moments in my career.

The first one was when I realized with other researchers in the French community working on Grid computing in 2000 that it was extremely difficult to perform controllable and reproducible experiments in large-scale distributed computing on production platforms. This community was previously focused on cluster computing, where the researchers had full control of the clusters as they ran their parallel/distributed computing experiments. They could make sure that they ran their experiments in isolation and with a carefully designed software stack. They essentially lost all of that capability of controlling and reproducing experiments when they switched their research topic to Grid computing and performed experiments on deployed Grid platforms. That’s when we realized that new research equipment was needed, and we launched Grid’5000.

My second pivotal moment was in 2007-2008. My team was already working on fault tolerance for Grid computing. But when the endeavor of making exascale a reality started, resilience was identified as one of the main challenges to solve to enable exascale parallel executions to complete and produce correct results. Performance projections and imbalance of the different parts of the future exascale system were making single-level synchronous checkpointing (the classic approach of parallel computing fault tolerance) impractical. There was also a fear that the level of silent data corruption would be so high that executions would produce incorrect results. These two aspects fueled over a decade of research in my teams and also in the HPC community. This resulted in the VeloC application multilevel asynchronous checkpointing environment developed during the DOE/NNSA Exascale Computing Project and deployed in several exascale systems.

The third pivotal moment was when I realized that, in the context of scientific data, the research topic of silent data corruption detection shared a fundamental aspect with lossy compression: they both alter scientific data, and for both, we need to design techniques to ensure that the same scientific outcomes could be derived as from non-altered data. At the same time, it became clear that a new generation of supercomputers (exascale) and instruments (update of light sources, for example) will generate so much data that there will not be enough storage space and network bandwidth to store and communicate the data. This situation motivated my research over the past several years on the SZ series of lossy compressors, now widely used in production in diverse areas.

How do centers such as the one you’ve established, the Joint Laboratory on Extreme-Scale Computing, build community and foster connections within the supercomputing domain?

The Joint Laboratory on Extreme-Scale Computing (JLESC) aims to develop collaborations between supercomputing centers and research institutions. The current configuration gathers seven members: Inria, UIUC/NCSA, Argonne National Laboratory, Barcelona Supercomputing Centre, Jülich Supercomputing Center, RIKEN Center for Computational Science, and University of Tennessee at Knoxville. The JLESC organizes an annual workshop where 100-120 people from the seven institutions meet to present their current research interests and report about their existing collaborations. The program is specifically organized to encourage and help develop collaborations between institutions. An important part of these workshops is the short talk presentations during which researchers, postdocs, and students present their latest results. These sessions are followed by one-on-one discussions between speakers and other researchers interested in exploring the potential of collaboration. Another important part of the workshop is the break-out sessions, where researchers gather to focus on a specific open topic, for example, quantum computing for HPC or AI foundation models for sciences. The researchers identify gaps and potential collaborations on these topics. That’s how JLESC joint projects start. In addition to workshops, the JLESC supports research visits where researchers, postdocs, and students from one institution visit another institution to work on identified joint research projects. The JLESC started in 2014. So it will celebrate its 10th anniversary at Kobe during the 2024 workshop. The JLESC is the expansion of another, smaller, joint laboratory between Inria and UIUC/NCSA that started in 2009. Together, these 15 years represent probably the longest-lasting international collaboration in computer science.

Being someone who has fostered the supercomputing community, what advice do you have for emerging researchers seeking to engage in collaborative initiatives?

The most common way to establish collaborations is to attend conferences and events where researchers, postdocs, and students present their research. Establishing contacts with other researchers can be difficult for a junior researcher, especially if the researcher one wants to approach is a well-known international researcher. Career level, international visibility, and mother tongue differences can be daunting. Of course, having a colleague “break the ice” through an appropriate introduction can be a help. However, I have found that, in general, other researchers are interested in talking to emerging researchers. My recommendation here is: “Just go for it! We have been in your shoes before.”

One possibility to establish larger collaboration communities is to organize tutorials, topic-specific workshops, and birds of a feather session at conferences. The IEEE/ACM SC conference would be a good place to organize such events for emerging research in high-performance computing.

In my experience, the most impactful and long-lasting collaborations were/are around new bold ideas, not incremental ones. Collaborations on bold ideas in unexplored territories, essentially what I mentionedbefore as nascent research fields, are at the same time hard to establish (the initiator has to convince other researchers that the idea is sound and promising) and easy to develop (once the idea has attracted enough researchers – gained a critical mass – it is easier to attract even more researchers). This type of collaboration has also the potential to make the initiator the leader of the collaboration and leader of the field if the bold idea turns out to be successful. So my recommendation to early career researchers here is: “If you have bold ideas, don’t self-censor yourself. Get a few compelling use cases, and test your bold idea with fellow researchers.” Generally, bold ideas do not need a long presentation/explanation. They need to be clear, revolutionary, challenging, and rewarding (if successful).

As the leader of the Clusters and Grids team at the French National Center for Scientific Research (CNRS), you’ve made major contributions to the field of hybrid parallel programming. How have these contributions influenced the landscape of parallel programming, and what lessons did you learn from that experience?

The main contribution of the Clusters and Grids team to hybrid parallel programming was a paper published at the 2000 ACM/IEEE conference on Supercomputing. The fundamental contribution of this paper was not a tool or an algorithm. It was the analysis and understanding of the factors impacting the performance of a hybrid programming model using message passing and on-node parallelism. The conclusions of the paper stay true today and will likely remain as long as parallel computing using hybrid parallelism. The paper was the first to demonstrate that the superiority of hybrid programming models depends on (1) the level of shared-memory model parallelization (on node parallelism), (2) the communication patterns (and their implementations), and (3) the memory access patterns. Before the SC2000 paper, several papers in the literature presented the implementation of applications using a hybrid model. However, none of them provided a deep understanding of why this model is better or worse than a unified one (MPI alone). Compared with the few papers on small-scale systems published before, this paper established the ground thinking of hybrid programming models mixing message passing and shared memory in parallel high-performance computers. The main lesson I learned from this experience is the unpredictability of a paper’s immediate and long-lasting impact. All I remember from the presentation at SC2000 is the extreme stress I felt as an early career researcher (6 years after Ph. D.) presenting a paper at the flagship conference of a domain when the room was full of people, some of them standing in the back and many others trying to find a place. This was intimidating. I was certainly not prepared for that! The lesson I learned, however, was that good ideas deserve to be told, and people want to hear them. I am also pleasantly surprised that the paper is still cited 24 years after its publication.

Grid’5000, the experimental platform for parallel and distributed computing that you led at Inria, has had a lasting impact with over 2000 scientific publications! How did you address challenges during the R&D phase to ensure its success?

Grid’5000 was one of the first computer science testbeds designed as an experimental facility. Our model was physics instrument facilities. The fundamental question we had to solve was how to transform pieces of computing hardware into a secure and safe research platform that could enable hundreds of users (researchers and students) to perform controllable and reproducible experiments in distributed and parallel computing. One of the main challenges was that Grid’5000 was supposed to be a large-scale distributed system itself. Several sites were identified in different locations in France to run Grid’5000 nodes. The design of the integrated infrastructure was done with many Inria, university, and CNRS researchers and engineers in France. There was also an important partnership with RENATER, the national research and education network in France. The Grid’5000 project had many facets: buying the hardware equipment, developing the secured/safe network connection between the nodes (sites), and developing the whole software stack. In many respects, building Grid’5000 was a research project in itself. We had many meetings and started the development with a prototype made of scavenged PCs. We progressed gradually, running experiments while we were developing the testbed, learning during that process, and adapting the design based on lessons learned. The development of Grid’5000 also raised substantial coordination challenges to ensure continuous progressive development. Frankly speaking, I don’t think that anyone who participated in the early years of this fantastic adventure even thought that it could last that long (more than 20 years) and have the level of impact it has enjoyed. In addition to the 2000+ publications, I would emphasize the 300+ Ph.D. thesie that relied on Grid’5000.

Lossy compression for scientific data is an area where you've made significant contributions, including the development of the SZ lossy compressor. What motivated your interest in this field, and how do you see lossy compression shaping the handling of scientific data in the future?

When we started exploring lossy compression for scientific data in 2015, we had a fairly good understanding of how silent data corruptions (SDCs) affect the results of numerical simulations. We also developed techniques to detect and mitigate SDCs. It appeared that one of these techniques (the use of data prediction schemes) is commonly used in lossy compressors. I was also aware from the research on fault tolerance that exascale applications will generate extreme volumes of data. So the connection was simple: use our understanding of data prediction schemes and the impacts of data errors to develop lossy compressors for scientific data. In some sense, with SDC research we were suffering from externally generated errors and we needed to find a solution to detect and correct them, whereas for lossy compression we became the generator of errors that we could control – a much better position to be in! Initially, I thought that this would be an interesting research topic that could produce 2-5 papers. Only after a thorough analysis of the state of the art did I realize that this domain was almost an empty field. That observation changed my perception of the potential of this research direction. We now have published about 100 papers and won multiple awards including an R&D 100 on this topic. Another important point that makes lossy compression for scientific data so interesting is that it is a rich topic involving signal analysis, information theory, mathematics, computer science (including AI and optimization methods), and potentially all domain sciences because lossy compression affects scientific data and users need to understand its effect on their analysis.

Lossy compression for scientific data is developing rapidly, and the research is supported by funding agencies (e.g. , DOE, NSF) and industry. More teams are engaging in this topic every year. There is no sign that future generations of supercomputers and scientific instrument facilities will stop generating more data. To the contrary, new use cases of lossy compression for scientific data appear quickly. Some of the most recent ones are lossy compression between the CPU/GPU and memory, lossy compression of MPI communications, and lossy compression for AI (e.g., federated learning). I believe that this topic will develop further also because initial users are getting more used to it and look at extending its application to other scenarios. Moreover, important gaps still exist. For example, lossy compression of unstructured scientific data is a rather unexplored open research problem. Also, as scientists start using lossy compression on their data, they develop a better understanding of what quantities of interest they want to keep and to what accuracy. I would say that several fundamental results are missing, such as the lossy compressibility bound of scientific data (considering given constraints in QOI preservation). Fundamentally, we should preserve only the data that supports the scientific information needed by the user and at the necessary and sufficient accuracy level. However, this would mean that we understand the notion of scientific information that resides in the data, which is not the case, so lossy compression for scientific data will see many flourishing years ahead!

More About Franck Cappello

Franck Cappello received his Ph.D. from the University of Paris XI in 1994 and joined the French National Center for Scientific Research (CNRS), where he led the Clusters and Grids team with contributions on hybrid parallel programming (MPI+OpenMP) and desktop Grids. In 2003, he moved to Inria, where he led until 2008 the R&D phase of Grid’5000, a large-scale experimental platform for parallel and distributed computing. Still actively used, Grid’5000 has produced more than 2000 scientific publications and served hundreds of researchers and Ph.D. students. In 2009, as a visiting research professor at the University of Illinois, Cappello (with Marc Snir) established the Joint Laboratory on Petascale Computing, now the Joint Laboratory on Extreme Scale Computing.

One of the largest and longest-lasting collaborative centers in supercomputing, it has supported hundreds of researchers and students on topics related to scientific computing, high performance, and artificial intelligence. Cappello has also made seminal contributions to high-performance computing resilience, leading research activities in fault-tolerant protocols, checkpointing, silent data corruption detection, and failure prediction. As a member of the International Exascale Software Project, he led the roadmap and strategy efforts for work related to resilience at the extreme scale, and he subsequently became director of the VeloC checkpointing software project as part of the DOE Exascale Computing Project. Moreover, Cappello has become a leading figure in lossy compression for scientific data, directing the development of the SZ lossy compressor, Z-checker compression error assessment tool, and SDRBench repository of reference scientific datasets, all parts of the DOE ECP project. Cappello is an IEEE Fellow and the recipient of the 2024 Euro-Par Achievement Award, the 2022 HPDC Achievement Award, two R&D 100 awards (2019 and 2021), the 2018 IEEE TCPP Outstanding Service Award, and the 2021 IEEE Transactions of Computer Award for Editorial Service and Excellence.

- - - Learn More About Franck Cappello