# Not Just for the Birds: Archiving Massive Data Sets

Pam Frost Gorder

Pages: pp. 3-7

Among the 7,000 bird species whose songs are recorded at the Cornell Laboratory of Ornithology's Macaulay Library, the Prothonotary Warbler doesn't seem like a standout. It sings a simple tune—a one-note, "sweet, sweet, sweet, sweet, sweet." But when Macaulay Library engineer Bob Grotke wants to test new audio equipment, this warbler's song is one of his top choices.

"It's high-frequency, with rapid frequency sweeps—and very 'bursty,'" he says, referring to the sharp way the bird punctuates every high-pitched tweet. He calls it, "a difficult little bird to capture accurately."

Accuracy is paramount for Grotke and his colleagues, who are converting the world's largest animal recording collection from analog to digital to preserve as much information as possible for future scientists. Since 1999, they've digitized one-third of the collection and amassed 4 terabytes (Tbytes) of data. The sound collection dates back to 1929, and many of the earliest magnetic tape recordings have long since passed their expected lifetimes. Tapes are degrading—literally losing magnetic particles—with every play.

Macaulay Library engineers are engaged in a particularly dramatic race against time, but they aren't alone in their need to preserve massive amounts of information. From high-energy physics to climate science to biology, new instruments are gathering more experimental data that need to be retained for the long term. Meanwhile, other scientists need to retain results from huge computer simulations.

Right now, all around the world, data is being lost. To preserve its collections, the Macaulay Library is blazing a trail that others will have to follow.

## From Byrd to the Birds

"Music is tough enough to record," Grotke says. He should know: he once engineered a Tony Bennett album and recorded a host of jazz greats, including guitarist Charlie Byrd. In spite of this experience, he claims, "bird song is a technical nightmare." Animals communicate across broad frequency ranges—many with very complex vocal structures, some of which humans can't hear. Then there are "bursty" animals, such as the Prothonotary Warbler, which doesn't start singing low and quiet like many birds—it screams at the top of its lungs for the entire song. Human voices and musical instruments are tame by comparison. Grotke says he came to the Macaulay Library in part to confront this challenge, but also to preserve a precious resource.

"The presence or absence of birds in a given habitat is a well-known indicator of the health of that environment. Accurately preserving these sounds and making them available to future generations for research and global monitoring efforts is something I am very passionate about," he says. When the Ivory-Billed Woodpecker—once thought to be extinct—was recently sighted in the Arkansas woods, scientists used 70-year-old recordings from the Macaulay Library to identify the bird's call.

In a recent issue of The Auk (vol. 122, no. 3, 2005), Macaulay Library engineers described the technical issues they face. The key to preserving the sounds is digitizing them at a high enough sample rate to accommodate the frequency range and then storing them so that all the information can be retrieved easily. Typical CD-quality sound is sampled at 44.1 kilohertz (kHz) with a 16-bit data stream, but many Macaulay Library tapes far exceed the bandwidth this format offers. Grotke opted for 96-kHz sampling for most birds, 192 kHz for bats and marine mammals, and a 24-bit data stream. He pieced together an analog-to-digital converter and a digital audio workstation that fully supported this unique data structure.

For data storage, the engineers needed a technology that would last—and something that future librarians could easily retransfer to new data systems. DVD-Audio seemed the natural choice, until the researchers realized the music industry's copy-protection technology would automatically downsample any sound to 48 kHz (16 bits) when they tried to play it back. Because the whole point of preservation is to make the full frequency range available for study, the engineers turned to DVD-ROMs, storing the high-resolution audio files as data. So far, they've filled three and a half of the library's 12 DVD "jukeboxes," each of which holds 480 disks. A similar effort to archive the library's relatively new video collection takes up half as much disk space. Duplicates reside in a safe location off campus.

## Who Cooks for You?

The ultra-high-quality recordings are used only in-house, but the library will soon offer downsampled copies on the Web ( www.animalbehaviorarchive.org). Right now, users can listen to a few birds, frogs, and marine mammals, but the Web site will ultimately link all the information scientists will need to study these animals. Critical data include the locations in which recordings were made; maps of the animals' habitats; sound, video, and graphical representations of song frequency called spectrograms; and even the mnemonics that field scientists memorize to help them recognize an animal's call when they hear it. (The Prothonotary Warbler's call is often translated as, "sweet, sweet, sweet, sweet, sweet," but the mnemonic for the Barred Owl's hoot is more typical of birds'—and birders'—creativity: "WHO cooks for YOOOU? WHO cooks for YOOOU-ALL?")

Large data sets are often idiosyncratic, says Jeff Dozier, professor of snow hydrology, Earth system science, and remote sensing in the Donald Bren School of Environmental Science and Management at the University of California, Santa Barbara. Speaking at a symposium on data preservation and management at the February 2006 meeting of the American Association for the Advancement of Science in St. Louis, Missouri, Dozier outlined some of the challenges from the data author's perspective. For his own work, he builds environmental models of snow accumulation and snow melt in the Sierra Nevada mountains using satellite data from NASA's Earth Observing System. He downloads 36 Mbytes of data per day—an amount that's easy enough to store on disk—but the challenge is to conveniently store the data in a way that other scientists can use for future research.

The way people access massive data sets is also changing, Dozier says. Rather than simply obtaining data from major data centers such as government agencies, researchers are beginning to share their own data products with each other via the Internet. In this scheme, the data's lineage becomes important. "If you use my data product, you want to know what went into it. In best practices, there would be an electronic version of a research notebook that preserves that information," he says.

An archive should also contain a description of the computations performed on the data so that others can reanalyze them, fill in missing information, or correct errors. Dozier uses a wrapper script that works passively in the background of his applications to store this information. His research group will have their archived data products available online soon at www.snow.ucsb.edu.

## That's No Siren

At the Borror Laboratory of Bioacoustics at Ohio State University, biology graduate student Miles Spathelf sits at a mixing table. He's digitizing reel-to-real audio tape of sounds recorded in a Chinese rain forest. A shrill cry cuts through the forest backchatter, growing in pitch like an ambulance siren. "That's gibbon," Spathelf says. Along with the National Sound Archive at the British Library, the Macaulay Library and the Borror Lab are the world's main repositories of animal sounds. All are digitizing their analog data.

Jill Soha, the Borror Lab's curator, points to a shelf of gold-coated CDs that comprise the lab's growing collection. As they do at the Macaulay Library, her staff keeps copies on an in-house server and stashes duplicate CDs off site for safety. They're also sharing downsampled versions of the sounds through a statewide educational network called OhioLink ( http://worlddmc.ohiolink.edu/media/borror/blbLogin/).

Borror Lab director Doug Nelson coauthored the Auk paper with the Macaulay Library engineers. As computer memory gets cheaper, he sees bioacoustics labs moving toward solid-state storage. With the advent of portable digital recorders, fewer scientists are dragging typewriter-sized reel-to-reel recorders into the field, and some data are coming to the lab already in digital form. "In the future, we'll probably just record straight onto hard drives and memory cards," he says.

Soha sits at a computer and plays one of more than 1,300 Song Sparrow calls from OhioLink. The mnemonic for this sprightly tune, Nelson says, is "maids, maids, put your TEA kettle-ettle ON!" Whereas Nelson studies subtle differences in birds' regional dialects, Soha listens for a change in the song—something to indicate what was happening to the bird at that moment. "You hear that? Right now he's singing a different one," she says a minute later. The bird probably uses the same call to attract a mate and ward off rivals. Other calls could signal a food sighting or the swoop of a predatory Cooper's Hawk from above.

Nelson, Soha, and their colleagues are assembling a kind of avian sociology, and their work gets to the heart of theories in animal cognition. They want to know how birds learn their songs, and whether females prefer males with local accents.

The Borror staff spent three years digitizing their analog tapes; now they want to help smaller labs digitize their collections, too. That's something Grotke would like to do as well. With two-thirds of the Macaulay collection still unarchived, and thousands of reels aging on shelves around the world, he ventures that his staff could stay busy for decades. One problem, though: the manufacturers of tape equipment are retiring the technology, and spare parts will soon be hard to come by. "I suspect that we won't be able to keep our hardware going for much more than 10 years," Grotke says. "So we've got a lot of work to do."

Bob Grotke sees a need for new computers algorithms that compress massive audio and video data losslessly. At the Macaulay Library, the problem is one of sheer bandwidth and storage capacity, but other types of research have data sets with many variables and dimensions that must be "squashed" together for storage and then retrieved intact.

Raymond L. Orbach, director of the Office of Science at the US Department of Energy (DOE), was also at the Association for the Advancement of Science (AAAS) symposium on data collections. His office plans to initiate a long-term research program to address this so-called "curse of dimensionality." As data sets have grown larger, researchers have grown frustrated, he says. Data mining is more cumbersome, meaning that important information might be missed.

During his presentation, Bruce Schatz of the University of Illinois, Urbana-Champaign, put it a bit differently. "This is a problem that hits researchers where they live," he says, because "data is disappearing." A professor of library and information science, Schatz co-leads the university's effort to build an online information system called BeeSpace ( www.igb.uiuc.edu/beespace/). The project will analyze genes and behavior, using the Western Honey Bee as a guide. Scientists will compile a detailed database of gene expressions for hundreds of individual bees and link the genes to each bee's unique societal role.

For its part, the DOE plans to boost data storage funding from US$34 million to US$37.6 million for 2007, Orbach says. His office boasts 100 petabytes (Pbytes) of data storage, which he expects to more than double by 2009 to make room for burgeoning experimental and simulation data. Then comes the challenge of getting the data to the people who analyze them—ideally, in real time. "This is not a dead archival issue," he adds.

Anita Jones, the Lawrence R. Quarles Professor of Engineering and Applied Science at the University of Virginia, chaired a 2003 US National Science Board workshop on this issue. The resulting report, "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century," is available on the Web ( www.nsf.gov/pubs/2005/nsb0540/). At AAAS, she commented that the task of preserving large digital data sets is the purview of all science and engineering. "It may well be the best thing for science that NSF [US National Science Foundation] and other agencies make a long-term investment" in data collection, she says.