Distributed Data-Analysis Approach Gains Popularity
by George Lawton
The more data that organizations collect, the more they must process. At a certain point, they face the enormous challenge of making sense of the information in a reasonable amount of time.
Running large data sets serially through a single computer can be very time consuming, so organizations prefer to distribute the workload over multiple machines and process it in parallel. Many companies also want to take a parallel processing approach because they store related data in multiple sources.
Traditional relational database techniques can do this but work only with structured data, which is information that can be organized in rows and columns as part of tables. However, a lot of the data that organizations gather — such as word-processing documents and Web-activity logs — is unstructured.
Now, though, Google has popularized the MapReduce parallel-programming framework for developing distributed-processing applications that run on clusters of commodity hardware. MapReduce applications simplify the processing and analysis of large data sources and also enable new analytic models and techniques.
There are now numerous MapReduce implementations and many commercial database systems that include or work with the technology, said Google Fellow Jeffrey Dean, one of the approach's developers.
Google uses the approach to analyze petabytes of data every day to analyze its business activities and index the Web.
MapReduce also promises to help in working with other types of large data sets, such as those found in financial applications, bioinformatics, and military-intelligence analysis.
Yahoo! and Facebook have also helped popularize the use of such parallel programming frameworks by developing an open source implementation called Hadoop. They use Hadoop internally and make it available to other organizations for free.
Companies such as Amazon, IBM, and Oracle have created their own custom Hadoop implementations. The Amazon tool can be accessed only via the company's cloud services. IBM's and Oracle's implementations work with the companies' databases.
Database vendors Aster Data Systems and Greenplum have developed MapReduce-based tools.
Nonetheless, there are still concerns regarding MapReduce.
The Need
Researchers and vendors pioneered relational database technology in the early 1970s to provide a framework for storing, managing, and accessing structured data using the Structured Query Language (SQL).
A relational database is a set of tables containing structured data that fits into predefined categories. Each table contains one or more data categories in columns. Each row contains data for the categories defined by the columns. SQL was designed to work with such data.
This is a problem because organizations now frequently want to make sense of distributed, unstructured data such as Web-activity logs from multiple servers, as well as mixtures of text and data, and social-networking information, said Michael Olsen, CEO of Cloudera, which supports companies that deploy the Apache Software Foundation's Hadoop implementation.
A Close Look
Parallel programming frameworks for databases have existed for about 20 years but have been used mostly in academic research, said Michael Stonebraker, chief technical officer of database vendor Vertica and adjunct professor at the Massachusetts Institute of Technology.
This changed in 2004, when Google fellows Jeffrey Dean and Sanjay Ghemawat gave a presentation at a conference about the MapReduce programming framework that they developed and that Google uses to manage the petabytes of data its Web crawlers gather every day.
Google uses its MapReduce framework only on its hardware for internal applications.
The Technology
MapReduce — a way to write distributed-data-analysis software — was inspired by the map and reduce functions used in Lisp and other functional programming languages.
In these languages, the map function applies a given operation to a set of elements and returns a list of results. The reduce function aggregates the results from multiple operations.
In a MapReduce application's map function, a master node — one of the servers in a distributed database system — divides the work required by a query into subproblems and distributes them to worker nodes, said Greenplum president Scott Yara.
The worker nodes send their results back to the master node. They can also further divide and distribute their subproblems, creating a multilevel tree structure.
In the reduce step, the master node retrieves the subproblems' answers. Nodes serving as reducers aggregate the results of multiple workers into a running tally. Reducers further process multiple aggregate results until yielding the final result.
In a MapReduce application, each subproblem must be independent of the others — as occurs in activities such as searches for specific types of records or the counting of keywords on Web pages — so nodes can process them in parallel.
Worker nodes don't share memory and require no intermediate communication. This allows nodes to function independently. It also reduces network traffic, as well as communications-related overhead and problems.
The master node reassigns a task to another worker node if the original assignee fails to either report back periodically or return the results of its work. This reduces the impact of failed nodes.
As is the case with Hadoop, MapReduce runs on a parallel file system, which is a single file system spread over several servers.
Google says it has used MapReduce to process a petabyte of data on 1,000 computers in 68 seconds.
Users have created libraries of functions that can be used to write MapReduce applications in languages such as C, C++, C#, Erlang, F#, Java, Python, R, and Ruby.
Uses and advantages
MapReduce and similar approaches work best on certain types of distributable problems.
For example, it's good at natural language processing involving multiple documents, keyword extraction across Web pages, and analysis of data from different social-networking-activity logs, said Olsen.
MapReduce can work with both structured and unstructured data.
MapReduce and Hadoop don't impose a data model on information. Developers can thus specify the data type — such as structured or unstructured — their applications will work with when writing them, noted Olsen.
Because they function in parallel, MapReduce applications can efficiently cleanse and organize data from multiple servers for processing by business-intelligence tools, a potentially tricky process, noted Dan Graham, data warehouse marketing manager at vendor Teradata.
MapReduce applications don't require communications among nodes and thus are less sensitive to network latency. Hence, they are relatively easy to implement on cloud services, which run on the Internet and thus can experience latency.
Implementations
MapReduce's popularity has led to a number of similar implementations, many both open source and Hadoop-based.
Apache Hadoop Project. Yahoo! and Facebook developed the open source Hadoop in 2006. The Apache Software Foundation manages its ongoing development.
Google had earlier publicly released the MapReduce algorithm, although not the source code or file system that the framework used. Yahoo! and Facebook thus used the algorithm and wrote the rest of Hadoop itself.
Hadoop code runs on Linux or Windows servers. Developers can write applications in C, C++, Java, or any Linux shell command.
Yahoo! has the largest Hadoop implementation with about 4,000 nodes.
Researchers at 33 US universities are using a hosted cloud platform — part of the IBM's Academic Initiative for cloud computing — to write Hadoop applications for weather analysis, social-network analysis, and speech translation, noted .Jay Subrahmonia, IBM's director of advanced customer solutions.
Amazon Elastic MapReduce. Amazon created an implementation of Hadoop that other organizations can run on the company's cloud services.
For example eHarmony has used it to build an application for analyzing data about the users of its matchmaking services.
IBM. IBM created a plug-in that simplifies the creation of MapReduce applications in the Eclipse multilanguage software-development environment.
Others. Aster Data and Greenplum have created MapReduce implementations to run on top of their proprietary parallel databases.
Yale University has announced a prototype called HadoopDB (http://db.cs.yale.edu/hadoopdb/hadoopdb.html), an open source project that combines features of Hadoop and relational databases. This would let a programmer use either SQL or MapReduce on the same database, depending on which works better.
Mapping Out Concerns
Most database experts agree that MapReduce can play a valuable role in certain applications, such as analyzing Web-activity logs. However, some have raised concerns about how widely it can be used.
For example, Vertica's Stonebraker said MapReduce isn't optimal for running queries on or statistical analysis of structured data because SQL tools are more efficient.
MapReduce is still a relatively new technology. Thus, Graham said, it will take time before the approach's report-writing and data-analysis tools are as capable and accepted as more established SQL-based business-intelligence tools.
He added that current MapReduce implementations require Java programmers, which means most business users who aren't experienced developers can't use it.
Use of the new technology may increase in the future because Google has been working with to create a MapReduce curriculum now offered at about 30 US universities.
Cloudera's Olsen predicted that a strong group of independent software vendors will grow around the Hadoop platform.
And once basic MapReduce technology has become established, developers will create a range of new applications for military intelligence, bioinformatics, financial services, and retail operations.
However, Stonebraker stated, MapReduce is not appropriate for all uses. "One size does not fit all," he explained.
Nonetheless, Cloudera's Olsen said, "With MapReduce, enterprises will wake up to the value that is locked within date. Amazon, Google and Facebook are successful because they have collected more information so they can tease out truths that are invisible to the world."
George Lawton is a freelance technology writer based in Guerneville, California. Contact him at glawton@glawton.com.