, The Evergreen State College
, Corporation for National Research Initiatives
Pages: pp. 8-10
Abstract—This special issue presents researchers' latest efforts to help the scientific community manage increasingly large repositories of data.
Keywords—scientific data management; big data; database management systems; DBMS; scientific computing
Big Data is characterized not only by the enormous volume or the velocity of its generation, but also by the heterogeneity, diversity, and complexity of the data.
—Suzi Iacono, US Interagency Big Data Senior Steering Group
While 2012 saw Big Data entering American business and popular culture, 1,2 the phenomenon was definitively not news to the scientific community nor to CiSE readers. For years scientists have grappled with an exponential growth in data acquisition and generation, and 18 months ago the CiSE Big Data issue emphasized that although Big Data was creating "an extremely exciting time for scientific discovery," many challenges remained before scientists could make optimal use of data-intensive scientific computing. 3 Guest editors Francis J. Alexander, Adolfy Hoisie, and Alexander Szalay concluded that the sheer scale of massive datasets precluded them from being easily moved about for analysis, and the heterogeneous, idiosyncratic, documented or not, structured or unstructured, data types encountered were so prevalent that "even computational scientists now agree that simply faster disk space and more and faster CPU cycles will not solve [Big Data] problems." 3
These Big Data challenges haven't diminished since 2011. Indeed, in the US, the National Science Foundation 4,5 and the computer science research community have been aware of the many and significant challenges and are responding, and similar initiatives are underway internationally. This issue of CiSE features articles by five leading research teams whose scientific data-management projects respond to the challenges outlined in the CiSE Big Data issue and elsewhere. As we chose from among the many exciting researchers who regularly present their work in the annual Scientific and Statistical Database Management Conference ( http://ssdbm.org), we settled on computer scientists who work particularly closely with domain researchers from across the natural and physical sciences on what we consider the root of the Big Science Data challenge: the data. As said so well in Raw Data Is an Oxymoron, data is anything but "raw" and we should "think of it as … a cultural resource … to be generated, protected, and interpreted." 6 Indeed, a focus on generating, protecting, and interpreting is a precursor to maximizing the yield of data collected by diverse science and engineering activities.
In this special issue on Scientific Data Management, Tamás Budavári and his colleagues describe SkyQuery, a system that enables astronomers to take advantage of the massive data provided by numerous telescopes. The system, which has its antecedents in the Sloan Digital Sky Survey (made possible through the efforts of Jim Gray, one of the foremost computer science researchers of the last 20 years), is helping produce a paradigm shift in astronomy. The article describes the effort to build a scalable query engine that dynamically federates the largest all-sky catalogs in parallel on a cluster of relational databases.
In "Data Near Here: Bringing Relevant Data Closer to Scientists," V.M. Megler and David Maier discuss the difficulty of knowing where and how to find and access relevant data in large scientific repositories. They call for an improvement in the tools used to archive and find such data, because unless scientists can easily access the information, large scientific repositories increasingly run the risk of losing value as their holdings expand. The authors go on to describe their novel information retrieval research and its implementation for a major oceanography project, but make the case that the approach is widely applicable to (and needed in) other scientific domains.
As guest editors of this issue, we first thank the contributing authors not only for the effort they put into preparing these articles specifically for CiSE, but for their dedication to helping scientists realize the promise of Big Data science and engineering. We also thank the heretofore anonymous reviewers who provided helpful feedback to both authors and editors as we compiled this issue: Shawn Bowers, Gonzaga University; James Frew, University of California−Santa Barbara; Carole Goble, University of Manchester; Richard Hooper, Consortium of Universities for the Advancement of Hydrologic Science (CUAHSI); Eduard Hovy, Carnegie Mellon University; Rebecca Koskela, DataONE at the University of New Mexico; Christine Laney, University of Texas-El Paso; Peter McCartney, National Science Foundation; Jim Myers, Rensselaer Polytechnic Institute; Margaret O'Brien, University of California−Santa Barbara; Frank Olken, Arlington, Virginia; Eric Schulman, Institute for Defense Analyses; Mark Servilla, Long-Term Ecological Research (LTER) Network Office at the University of New Mexico; Robert Tawa, US National Ecological Observatory Network (NEON); Nancy Wiegand, University of Wisconsin-Madison; and Bruce Wilson, Oak Ridge National Laboratory.
We look forward to your comments on this issue!