^{1}

^{-}

^{3}especially when using nondedicated resources. Such is the case for opportunistic computing,

^{4}where we use only the shared machines' idle periods. In such a scenario, machines often fail and frequently change their state from idle to occupied, compromising their execution of applications. Unlike dedicated resources, whose mean time between failures is typically weeks or even months, nondedicated resources can become unavailable several times during a single day. In fact, some machines are unavailable more than they're available. Fault tolerance mechanisms, such as checkpoint-based rollback recovery,

^{5}can help guarantee that applications execute properly amid frequent failures.

^{6}and the addition of parity information.

^{7}we evaluate several strategies for the distributed storage of checkpoint data in opportunistic environments. (The " Sidebar: Related WorkSidebar: Related Work" discusses other recent work in this area.) We focus on the storage of checkpoint data inside a single cluster. We present a prototype implementation of a distributed checkpoint repository over InteGrade ( http://www.integrade.org.br/portal),

^{8}a multiuniversity grid middleware project to leverage the computing power of idle shared workstations. Using this prototype, we performed several experiments to determine the trade-offs in these strategies between computational overhead, storage overhead, and degree of fault tolerance.

*m*slices and then calculating the parity over these slices. We divide a checkpoint vector,

**U**, of size

*n*into

*m*slices of size

*n*/

*m*, given by

**U**=

**U**

_{0},

**U**

_{1}, …,

**U**

*, and*

_{m}**U**

*= u*

_{k}_{0}

*, u*

^{k}_{1}

*, …, u*

^{k}*, 0 ≤*

^{k}_{n/m}*k*<

*m*

*k*is the slice number, and

*u*

_{i}

*represents the elements of slice*

^{k}**U**

*. We calculate elements*

_{k}*p*, 0 ≤

_{i}*i*<

*n*/

*m*of parity information vector

**P**as

**U**

*and parity vector*

_{i}**P**for storage on other nodes. Similarly, we can recover a missing fragment by performing the XOR operation over the recovered fragments.

^{6}It allows coding a vector

**U**of size

*n*into

*m*+

*k*encoded vectors of size

*n/m*, such that regenerating

**U**is possible using only

*m*encoded vectors. This encoding lets you achieve different fault tolerance levels by merely tuning the values of

*m*and

*k*. In practice, it's possible to tolerate

*k*failures with an overhead of only

*k/*(

*mn*) elements.

*GF*(

*q*), a finite field of

*q*elements, where

*q*is either prime or a power

*p*of prime number

^{x}*p*. When using

*q*=

*p*, you carry arithmetic operations over the field by representing the numbers as polynomials of degree

^{x}*x*and coefficients in [0,

*p*- 1]. You calculate sums with XOR operations, whereas you carry out multiplications by multiplying the polynomials modulo an irreducible polynomial of degree

*x*. In our case, we use

*p*= 2 and

*x*= 8, representing a byte. To speed up calculations, we perform simple table lookup for the multiplications.

*m + k*linearly independent vectors

**a**

*of size*

_{i}*m*. We can easily generate these vectors by choosing

*n*distinct values

**a**

*, 0 ≤*

_{i}*i*< n, and setting

**α**

*= (*

_{i}*i*,

**a**

*, …*

_{i}**a**

_{i}^{n}^{-1}), 0 ≤

*i*<

*n*. We then organize these vectors as a matrix,

*, defined as*

**G***= [*

**G****α**

^{T}_{0},

**α**

^{T}_{1}, …

**α**

^{T}_{m}_{+k}]

*T*indicates the transpose of vector

**α**. We now break file

*F*into

*n/m*information words,

**U**

*, of size*

_{i}*m*and generate

*n*/

*m*code words

**V**of size

*m*+

*k*, where

**V**

*=*

_{i}**U**

*×*

_{i}

**G***m*+

*k*encoded vectors,

**E**

*, 0 ≤*

_{i}*i*<

*m*+

*k*, are given by

**E**

*=*

_{i}**V**

_{0}[

*i*],

**V**

_{1}[

*i*], …

**V**

*[*

_{n/m}*i*]

**U**

*, we need to recover*

_{i}*k*of the encoded

*m*+

*k*slices. We then construct code words

**V**

*′, which are equivalent to the original code words*

_{j}**V**

*but contain only the components of the*

_{i}*k*recovered slices. Similarly, we construct matrix

*′, containing only elements relative to the recovered slices. We now recover*

**G****U**

*, multiplying encoded words*

_{i}**V**

*′ by the inverse of*

_{j}*′:*

**G****U**

_{i}=

**V**

*′ × (*

_{j}*′)*

**G**^{-1}

*O*[(

*m*+

*k*)

*nm*] steps and decoding requires

*O*(

*nm*

^{2}) steps, in addition to the inversion of an

*m*×

*m*matrix. Qutaibah Malluhi and William Johnston proposed an algorithm that improves coding computation complexity to

*O*(

*nmk*) and also improves decoding.

^{9}They showed that you can diagonalize the first

*m*columns of

*and still have a valid algorithm. Consequently, the first*

**G***m*fields of code words

**V**

*involve simple data copying. Coding is necessary only for the last*

_{i}*k*fields. This approach reduces encoding complexity considerably.

*m*as 10, the algorithm can tolerate a failure of one node with a 10 percent space overhead, two failures with a 20 percent overhead, and so on. The disadvantage of this approach is the computational complexity of implementing the algorithm and the higher computational overhead.

^{7}It consists of a collection of hierarchically organized InteGrade clusters. Here, we focus on checkpoint storage inside a single cluster.

^{10}The main modules for storing checkpoint data are the checkpointing library, the execution manager (EM), the cluster data repository manager (CDRM), and the autonomous data repositories (ADRs).

*m*,

*k*) represents IDA using

*m*and

*k*as described earlier.

*k*and the data size. The same was true for decoding. Therefore, recovering the data shouldn't take more than a few seconds. With further optimizations in vector multiplications, we can achieve even better results. Consequently, the results of this experiment were very satisfactory.

• *No storage.* The system generates checkpoints but doesn't store them.

• *Centralized repository.* The system stores checkpoints in a centralized repository.

• *Replication.* The system stores one copy of the checkpoint locally and another in a remote repository.

• *Parity over local checkpoints.* The system breaks the checkpoint into 10 slices, with one containing parity information, and stores them in distributed repositories.

• *IDA* ( *m* = 9, *k* = 1) *.* The system codes the checkpoint into 10 slices, from which nine are sufficient for recovery, and stores them in distributed repositories.

• *IDA* ( *m* = 8, *k* = 2) *.* The system codes the checkpoint into 10 slices, from which eight are sufficient for recovery, and stores them in distributed repositories.

*x*-axis contains the six storage scenarios. The

*y*-axis shows the normalized execution time. We used nine nodes to perform the matrix multiplication and three matrix sizes: 1,350 × 1,350; 2,700 × 2,700; and 5,400 × 5,400. To perform the benchmark, we divided the total execution time into execution segments bounded by the checkpoint generation times. Table 1 gives these values for each matrix, along with the number of generated checkpoints and the size of local and global checkpoints.

Table 1. Execution parameters for the execution overhead experiment.

^{11}

^{,}

^{12}

#### References

**Raphael Y. de Camargo**is a doctoral candidate in the Department of Computer Science at the University of São Paulo. His research interests include grid computing, fault tolerance, distributed storage, and distributed-object systems. He received his master's degree in physics from the University of São Paulo. Contact him at Rua Doutor Monteiro Tapajós, 70, São Paulo SP, Brazil, 04152040; rcamargo@ime.usp.br.

**Renato Cerqueira**is a research scientist and a project manager at the Computer Graphics Technology Group of the Pontifical Catholic University of Rio de Janeiro (PUC-Rio), where he is also an assistant professor of computer science. His research interests include component-based development, object-oriented languages, middleware platforms, distributed programming, ubiquitous computing, and grid computing. He received his PhD in computer science from PUC-Rio. He's a member of the ACM and IEEE Computer Society. Contact him at Computer Science Dept., PUC-Rio, Rua Marques de São Vicente 225, 407RDC, Rio de Janeiro RJ, Brazil 22453900; rcerq@inf.pucrio.

**Fabio Kon**is an associate professor in the Department of Computer Science at the University of São Paulo. His research interests include distributed-object systems, reflective middleware, dynamic reconfiguration and adaptation, mobile agents, computer music, multimedia, grid computing, and agile methods. He received his PhD in computer science from the University of Illinois at Urbana-Champaign. He's a member of the ACM and the Hillside Group. Contact him at Departamento de Ciência da Computação, Rua do Matão, 1010, São Paulo SP, Brazil, 05508090; kon@ime.usp.br.

| |||