For computational simulations, the era of “big data” ended before it began. We’re actually living in the era of infinite data — in which the stream pouring forth from computational models and simulations can be as voluminous as outputing every value at every timestep, drowning disks and researchers alike with high-cadence, arbitrarily large datasets. Rather than struggling to make models bigger, the challenge is now to keep them under control.
Complexity and Volume
Simulators face the challenge of receiving more data than they can process. The deep need to distill it into some meaningful, relevant piece of information requires an “understanding” of the data, and curation of the data stream generated by simulations. A common strategy, driven largely by necessity, is to pass data in memory during a calculation to an analysis task that avoids writing full checkpoints to disk. Yet, even as this process occurs, the analysis must complete before the next set of data can be processed. This Sisyphean task of processing, waiting, processing, and waiting continues as the simulation progresses.
My own work is in computational astrophysics, in which I study the formation of the first stars. These stars formed deep within the gravitational potential wells of dark matter halos. Despite the related simulations needing to run time forward from the early part of the universe over millions of years, the relevant timescales within the centers of these halos are days or even hours. Making sense of the data means not only determining where to look, but how often we should be looking there.
Yet, the inexorable march of progress in computational simulations doesn’t move just toward bigger and bigger; simulations are becoming richer, with physical models that account for more variables and processes, and the questions we can ask of our data become comparatively more complex, as well. The challenge of synthesizing information scales with this complexity, so simulators must develop more complex tools and techniques to interrogate the data. The tools that simulators require must deal with both complexity and quantity, and for this month’s topic on processing, visualizing and understanding HPC data, I have selected articles that address these challenges.
Computing Now’s May theme opens with Hank Childs and his colleagues’ “Research Challenges for Visualization Software,” which clearly and concisely enumerates the difficulties in visualizing vast quantities of data. The authors, luminaries in the field of visualization, identify challenges presented by user and technical requirements.
I’ve also selected “Importance-Driven Isosurface Decimation for Visualization of Large Simulation Data Based on OpenCL,” a recent article from Computing in Science & Engineering that addresses a common problem with large datasets: how can we make the complexity tractable while still preserving the important features? Authors Yi Peng, Li Chen, and Jun-Hai Yong apply this to isosurfaces and detail how they used OpenCL kernels to implement their algorithm.
As I noted earlier, the choices made during the visualization process can identify interesting scientifically relevant features — or miss completely! In “Activity Detection in Scientific Visualization,” Sedat Ozer and his colleagues describe a mechanism for sifting through data to find relevant information and features to examine.
In “Ultrascale Visualization of Climate Data,” Dean Williams and his colleagues describe the challenges of visualizing vast datasets, as drawn from real and simulated climate data. They describe new methods of querying the data as well as how to efficiently and correctly track the provenance of visualization.
Figuring out the right questions to answer can be just as challenging as finding the answers — and often requires considerable thought and development. In “A Novel Approach to Visualizing Dark Matter Simulations,” Ralf Kähler, Oliver Hahn, and Tom Abel address the challenges of understanding the phase space distribution of dark matter in cosmological simulations. Dark matter, a collisionless fluid, is discretized in simulations as particles; most visualization techniques present this collisionless fluid as a collection of points of data. In this article, the authors present a new method for visualizing the dark matter distribution, relying on an understanding of how it moves in phase space, resulting in a much higher-fidelity and physically motivated depiction of the simulation.
Finally, “Adaptive Extraction and Quantification of Geophysical Vortices” describes the process of drawing out features from a complex simulation. These processes must be fast, highly accurate, and motivated by a physical understanding of the system. Sean Williams and his colleagues describe a mechanism for identifying vortices in a simulation, which offers better understanding of the underlying data.
The challenges of the “era of infinite data” are fascinating, and I hope you find the approaches presented in these articles to be as exciting and interesting as I do.
M Turk, “Processing, Visualizing, and Understanding Data from High-Performance Computing,” Computing Now, vol. 7, no. 5, May 2014, IEEE Computer Society [online]; http://www.computer.org/publications/tech-news/computing-now/processing-visualizing-and-understanding-data-from-high-performance-computing.
Matthew Turk is an associate research scientist at Columbia University, studying the formation of the first stars in the Universe and developing the cyberinfrastructure for large-scale simulation and analysis of physical phenomena. He is the CN liaison to Computing in Science & Engineering magazine. Contact him at matthewturk at gmail dot com.