NEWS


Computing Now Exclusive Content — February 2010

News Archive

July 2012

Gig.U Project Aims for an Ultrafast US Internet

June 2012

Bringing Location and Navigation Technology Indoors

May 2012

Plans Under Way for Roaming between Cellular and Wi-Fi Networks

Encryption System Flaw Threatens Internet Security

April 2012

For Business Intelligence, the Trend Is Location, Location, Location

Corpus Linguistics Keep Up-to-Date with Language

March 2012

Are Tomorrow's Firewalls Finally Here Today?

February 2012

Spatial Humanities Brings History to Life

December 2011

Could Hackers Take Your Car for a Ride?

November 2011

What to Do about Supercookies?

October 2011

Lights, Camera, Virtual Moviemaking

September 2011

Revolutionizing Wall Street with News Analytics

August 2011

Growing Network-Encryption Use Puts Systems at Risk

New Project Could Promote Semantic Web

July 2011

FBI Employs New Botnet Eradication Tactics

Google and Twitter "Like" Social Indexing

June 2011

Computing Commodities Market in the Cloud

May 2011

Intel Chips Step up to 3D

Apple Programming Error Raises Privacy Concerns

Thunderbolt Promises Lightning Speed

April 2011

Industrial Control Systems Face More Security Challenges

Microsoft Effort Takes Down Massive Botnet

March 2011

IP Addresses Getting Security Upgrade

February 2011

Studios Agree on DRM Infrastructure

January 2011

New Web Protocol Promises to Reduce Browser Latency

To Be or NAT to Be?

December 2010

Intel Gets inside the Helmet

Tuning Body-to-Body Networks with RF Modeling

November 2010

New Wi-Fi Spec Simplifies Connectivity

Expanded Top-Level Domains Could Spur Internet Real Estate Boom

October 2010

New Weapon in War on Botnets

September 2010

Content-Centered Internet Architecture Gets a Boost

Gesturing Going Mainstream

August 2010

Is Context-Aware Computing Ready for the Limelight?

Flexible Routing in the Cloud

Signal Congestion Rejuvenates Interest in Cell Paging-Channel Protocol

July 2010

New Protocol Improves Interaction among Networked Devices and Applications

Security for Domain Name System Takes a Big Step Forward

The ROADM to Smarter Optical Networking

Distributed Cache Goes Mainstream

June 2010

New Application Protects Mobile-Phone Passwords

WiGig Alliance Reveals Ultrafast Wireless Specification

Cognitive Radio Adds Intelligence to Wireless Technology

May 2010

New Product Uses Light Connections in Blade Server

April 2010

Browser Fingerprints Threaten Privacy

New Animation Technique Uses Motion Frequencies to Shake Trees

March 2010

Researchers Take Promising Approach to Chemical Computing

Screen-Capture Programming: What You See is What You Script

Research Project Sends Data Wirelessly at High Speeds via Light

February 2010

Faster Testing for Complex Software Systems

IEEE 802.1Qbg/h to Simplify Data Center Virtual LAN Management

Distributed Data-Analysis Approach Gains Popularity

Twitter Tweak Helps Haiti Relief Effort

January 2010

2010 Rings in Some Y2K-like Problems

Infrastructure Sensors Improve Home Monitoring

Internet Search Takes a Semantic Turn

December 2009

Phase-Change Memory Technology Moves toward Mass Production

IBM Crowdsources Translation Software

Digital Ants Promise New Security Paradigm

November 2009

Program Uses Mobile Technology to Help with Crises

More Cores Keep Power Down

White-Space Networking Goes Live

Mobile Web 2.0 Experiences Growing Pains

October 2009

More Spectrum Sought for Body Sensor Networks

Optics for Universal I/O and Speed

High-Performance Computing Adds Virtualization to the Mix

ICANN Accountability Goes Multinational

RFID Tags Chat Their Way to Energy Efficiency

September 2009

Delay-Tolerant Networks in Your Pocket

Flash Cookies Stir Privacy Concerns

Addressing the Challenge of Cloud-Computing Interoperability

Ephemeralizing the Web

August 2009

Bluetooth Speeds Up

Grids Get Closer

DCN Gets Ready for Production

The Sims Meet Science

Sexy Space Threat Comes to Mobile Phones

July 2009

WiGig Alliance Makes Push for HD Specification

New Dilemnas, Same Principles:
Changing Landscape Requires IT Ethics to Go Mainstream

Synthetic DNS Stirs Controversy:
Why Breaking Is a Good Thing

New Approach Fights Microchip Piracy

Technique Makes Strong Encryption Easier to Use

New Adobe Flash Streams Internet Directly to TVs

June 2009

Aging Satellites Spark GPS Concerns

The Changing World of Outsourcing

North American CS Enrollment Rises for First Time in Seven Years

Materials Breakthrough Could Eliminate Bootups

April 2009

Trusted Computing Shapes Self-Encrypting Drives

March 2009

Google, Publishers to Try New Advertising Methods

Siftables Offer New Interaction Model for Serious Games

Hulu Boxed In by Media Conglomerates

February 2009

Chips on Verge of Reaching 32 nm Nodes

Hathaway to Lead Cybersecurity Review

A Match Made in Heaven: Gaming Enters the Cloud

January 2009

Government Support Could Spell Big Year for Open Source

25 Reasons For Better Programming

Web Guide Turns Playstation 3 Consoles into Supercomputing Cluster

Flagbearers for Technology: Contemporary Techniques Showcase US Artifact and European Treasures

December 2008

.Tel TLD Debuts As New Way to Network

Science Exchange

November 2008

The Future is Reconfigurable

Distributed Data-Analysis Approach Gains Popularity

by George Lawton

The more data that organizations collect, the more they must process. At a certain point, they face the enormous challenge of making sense of the information in a reasonable amount of time. 

Running large data sets serially through a single computer can be very time consuming, so organizations prefer to distribute the workload over multiple machines and process it in parallel. Many companies also want to take a parallel processing approach because they store related data in multiple sources.

Traditional relational database techniques can do this but work only with structured data, which is information that can be organized in rows and columns as part of tables. However, a lot of the data that organizations gather — such as word-processing documents and Web-activity logs — is unstructured.

Now, though, Google has popularized the MapReduce parallel-programming framework for developing distributed-processing applications that run on clusters of commodity hardware. MapReduce applications simplify the processing and analysis of large data sources and also enable new analytic models and techniques. 

There are now numerous MapReduce implementations and many commercial database systems that include or work with the technology, said Google Fellow Jeffrey Dean, one of the approach's developers. 

Google uses the approach to analyze petabytes of data every day to analyze its business activities and index the Web. 

MapReduce also promises to help in working with other types of large data sets, such as those found in financial applications, bioinformatics, and military-intelligence analysis.

Yahoo! and Facebook have also helped popularize the use of such parallel programming frameworks by developing an open source implementation called Hadoop. They use Hadoop internally and make it available to other organizations for free.

Companies such as Amazon, IBM, and Oracle have created their own custom Hadoop implementations. The Amazon tool can be accessed only via the company's cloud services. IBM's and Oracle's implementations work with the companies' databases. 

Database vendors Aster Data Systems and Greenplum have developed MapReduce-based tools. 

Nonetheless, there are still concerns regarding MapReduce.

The Need

Researchers and vendors pioneered relational database technology in the early 1970s to provide a framework for storing, managing, and accessing structured data using the Structured Query Language (SQL). 

A relational database is a set of tables containing structured data that fits into predefined categories. Each table contains one or more data categories in columns. Each row contains data for the categories defined by the columns. SQL was designed to work with such data.

This is a problem because organizations now frequently want to make sense of distributed, unstructured data such as Web-activity logs from multiple servers, as well as mixtures of text and data, and social-networking information, said Michael Olsen, CEO of Cloudera, which supports companies that deploy the Apache Software Foundation's Hadoop implementation. 

A Close Look

Parallel programming frameworks for databases have existed for about 20 years but have been used mostly in academic research, said Michael Stonebraker, chief technical officer of database vendor Vertica and adjunct professor at the Massachusetts Institute of Technology.

This changed in 2004, when Google fellows Jeffrey Dean and Sanjay Ghemawat gave a presentation at a conference about the MapReduce programming framework that they developed and that Google uses to manage the petabytes of data its Web crawlers gather every day. 

Google uses its MapReduce framework only on its hardware for internal applications.

The Technology

MapReduce — a way to write distributed-data-analysis software — was inspired by the map and reduce functions used in Lisp and other functional programming languages. 

In these languages, the map function applies a given operation to a set of elements and returns a list of results. The reduce function aggregates the results from multiple operations. 

In a MapReduce application's map function, a master node — one of the servers in a distributed database system — divides the work required by a query into subproblems and distributes them to worker nodes, said Greenplum president Scott Yara.

The worker nodes send their results back to the master node. They can also further divide and distribute their subproblems, creating a multilevel tree structure.

In the reduce step, the master node retrieves the subproblems' answers. Nodes serving as reducers aggregate the results of multiple workers into a running tally. Reducers further process multiple aggregate results until yielding the final result.

In a MapReduce application, each subproblem must be independent of the others — as occurs in activities such as searches for specific types of records or the counting of keywords on Web pages — so nodes can process them in parallel. 

Worker nodes don't share memory and require no intermediate communication. This allows nodes to function independently. It also reduces network traffic, as well as communications-related overhead and problems.

The master node reassigns a task to another worker node if the original assignee fails to either report back periodically or return the results of its work. This reduces the impact of failed nodes. 

As is the case with Hadoop, MapReduce runs on a parallel file system, which is a single file system spread over several servers.

Google says it has used MapReduce to process a petabyte of data on 1,000 computers in 68 seconds. 

Users have created libraries of functions that can be used to write MapReduce applications in languages such as C, C++, C#, Erlang, F#, Java, Python, R, and Ruby.

Uses and advantages

MapReduce and similar approaches work best on certain types of distributable problems. 

For example, it's good at natural language processing involving multiple documents, keyword extraction across Web pages, and analysis of data from different social-networking-activity logs, said Olsen.

MapReduce can work with both structured and unstructured data. 

MapReduce and Hadoop don't impose a data model on information. Developers can thus specify the data type — such as structured or unstructured — their applications will work with when writing them, noted Olsen.

Because they function in parallel, MapReduce applications can efficiently cleanse and organize data from multiple servers for processing by business-intelligence tools, a potentially tricky process, noted Dan Graham, data warehouse marketing manager at vendor Teradata.

MapReduce applications don't require communications among nodes and thus are less sensitive to network latency. Hence, they are relatively easy to implement on cloud services, which run on the Internet and thus can experience latency.

Implementations

MapReduce's popularity has led to a number of similar implementations, many both open source and Hadoop-based. 

Apache Hadoop Project. Yahoo! and Facebook developed the open source Hadoop in 2006. The Apache Software Foundation manages its ongoing development.

Google had earlier publicly released the MapReduce algorithm, although not the source code or file system that the framework used. Yahoo! and Facebook thus used the algorithm and wrote the rest of Hadoop itself. 

Hadoop code runs on Linux or Windows servers. Developers can write applications in C, C++, Java, or any Linux shell command.

Yahoo! has the largest Hadoop implementation with about 4,000 nodes. 

Researchers at 33 US universities are using a hosted cloud platform — part of the IBM's Academic Initiative for cloud computing — to write Hadoop applications for weather analysis, social-network analysis, and speech translation, noted .Jay Subrahmonia, IBM's director of advanced customer solutions.

Amazon Elastic MapReduce. Amazon created an implementation of Hadoop that other organizations can run on the company's cloud services.

For example eHarmony has used it to build an application for analyzing data about the users of its matchmaking services.

IBM. IBM created a plug-in that simplifies the creation of MapReduce applications in the Eclipse multilanguage software-development environment. 

Others. Aster Data and Greenplum have created MapReduce implementations to run on top of their proprietary parallel databases. 

Yale University has announced a prototype called HadoopDB (http://db.cs.yale.edu/hadoopdb/hadoopdb.html), an open source project that combines features of Hadoop and relational databases. This would let a programmer use either SQL or MapReduce on the same database, depending on which works better.

Mapping Out Concerns

Most database experts agree that MapReduce can play a valuable role in certain applications, such as analyzing Web-activity logs. However, some have raised concerns about how widely it can be used. 

For example, Vertica's Stonebraker said MapReduce isn't optimal for running queries on or statistical analysis of structured data because SQL tools are more efficient. 

MapReduce is still a relatively new technology. Thus, Graham said, it will take time before the approach's report-writing and data-analysis tools are as capable and accepted as more established SQL-based business-intelligence tools. 

He added that current MapReduce implementations require Java programmers, which means most business users who aren't experienced developers can't use it.

Use of the new technology may increase in the future because Google has been working with to create a MapReduce curriculum now offered at about 30 US universities.

Cloudera's Olsen predicted that a strong group of independent software vendors will grow around the Hadoop platform. 

And once basic MapReduce technology has become established, developers will create a range of new applications for military intelligence, bioinformatics, financial services, and retail operations.

However, Stonebraker stated, MapReduce is not appropriate for all uses. "One size does not fit all," he explained.

Nonetheless, Cloudera's Olsen said, "With MapReduce, enterprises will wake up to the value that is locked within date. Amazon, Google and Facebook are successful because they have collected more information so they can tease out truths that are invisible to the world."

George Lawton  is a freelance technology writer based in Guerneville, California. Contact him at glawton@glawton.com.