loading...
January 2006 (Vol. 7, No. 1)
1541-4922/06/$26.00 © 2006 IEEE

Published by the IEEE Computer Society
Book Reviews: Simplifying Large-Volume Data Sharing in Distributed Environments
Pedro Gama , INESC-ID
 Article Contents 
  Overview and details  
  Useful but flawed  
  Conclusion  
Download Citation
   
Download Content
 
PDFs Require Adobe Acrobat
 
Distributed Data Management for Grid Computing
By Michael Di Stefano
312 pages
US$84.95
John Wiley & Sons, 2005
ISBN: 0-471-68719-7
People are increasingly using grid platforms to share resources at numerous levels: via intranets, across organizations, and even worldwide. With present-day technologies, users can share not only computing power but also storage, network bandwidth, and even software licenses. However, the sheer amount of information in transit over networks poses issues that current technology can't cope with efficiently. Data volume can range from a few gigabytes in a corporate application to several petabytes in scientific applications such as world weather modeling. Distributed Data Management for Grid Computing by Michael Di Stefano deals specifically with the complex problems related to data sharing in distributed environments.
Overview and details
Di Stefano gives a thorough overview of data grid technology. He elaborates on distributed computing's evolution, focusing on the benefits that such advances provide to organizations. In this context, he describes SOA (service-oriented architecture), SONA (service-oriented network architecture), and utility computing—three important and related concepts (although not fundamental for understanding the core chapters). After reading these introductory chapters, you will have a reasonable picture of data sharing in heterogeneous environments. This part of the book will be interesting for a CIO or IT manager looking for alternative evolutions to corporate IT infrastructures—less expensive, more manageable, and more flexible. In addition, you can easily skim through the technical chapters and get to the practical usage scenarios for a bird's eye view of the technology.
Di Stefano subsequently presents traditional data management as the baseline on which to discuss data grids, considered the next evolution for the domain. Although useful for the distributed-systems novice, these chapters might be a bit too introductory for a data management audience—that is, readers who are SQL-savvy.
The book's core chapters are on data regionalization, synchronization, integration, and affinity. Di Stefano successfully explains these concepts in detail, pinpointing key issues (such as quality-of-service levels), alternative architectures (for example, centralized versus peer-to-peer), and business drivers.
He dedicates the rest of the book to practical applications of data grid concepts. You might notice a bias toward financial services examples, but I guess that's unavoidable due to the author's profession (he's CEO and cofounder of IntegraSoft, a data grid software provider, and has extensive experience in the financial market). From data mining to the virtualization of data centers, the examples clarify the concepts Di Stefano presented in the first chapters.
Useful but flawed
Di Stefano combines expertise with technical explanations of the key concepts and practical insights into their applications, leaving no doubt that he knows his business. An undergraduate distributed systems course could use this book as an auxiliary resource for a distributed data management discussion.
However, this book isn't for someone looking for implementation details. Most examples are characterized by a pseudoarchitecture and a custom notation for application definitions that I didn't particularly appreciate.
Of the extensive references, I would pay the most attention to the "additional reading" chapter. Divided by subject (Grid computing, Web services, and data affinity), it contains numerous references you can use to delve into specific aspects of data grid management. Some URLs are outdated, but with Google just a click away, who cares? You could skip the chapter on programming examples altogether if you're not interested in developing with IntegraSoft's products (even if you are, I doubt that chapter will be useful by itself). The white paper on data movement patterns inside a grid is interesting, but Di Stefano could have integrated it into the book's main text.
An important shortcoming is a lack of references to standards and implementations in the field. I would definitely have appreciated guidance in associating key concepts with work going on in the standards bodies. Organizations in this domain—such as OASIS (Organization for the Advancement of Structured Information Standards), W3C (World Wide Web Consortium), IETF (Internet Engineering Task Force), and especially the Global Grid Forum—are quite active, and standards are being proposed as we speak, so this absence significantly reduces the book's usefulness outside of academic discussion.
Additionally, I found the book to be a bit repetitive in some sections, making me lose focus on the main subjects. Also, some analogies are extended into unnecessary detail. Nevertheless, the text is well written, and you can read the manuscript from cover to cover without significant effort.
Conclusion
I would recommend Distributed Data Management for Grid Computing to anyone interested in the overall concepts behind data grids, such as data affinity and regionalization. This is the most complete book I've read to date on the subject, and the lack of technical expertise requirements make it a fair choice for a wide audience. However, you will need to do further research on architecture and standards to apply the concepts.
As a complementary text I strongly advise readers to check the de facto grid bible: The Grid 2: Blueprint for a New Computing Infrastructure by Ian Foster and Carl Kesselman (Morgan Kaufmann, 2003), which has an interesting section on data grids.
Pedro Gama is a PhD student at INESC-ID. Contact him at pedro.gama@inesc-id.pt.