, Bavarian Research Center for Knowledge-Based Systems
, Bavarian Research Center for Knowledge-Based Systems
Abstract—A new system, optimized toward high-performance computing, extends the RasDaMan (Raster Data Management) database management system to allow flexible management of multidimensional spatiotemporal data and to reduce tertiary storage access time.
Large-scale scientific experiments and supercomputing simulations often generate huge multidimensional data sets. Data volume can reach hundreds of terabytes (up to petabytes). An archival mass-storage (tertiary) system permanently stores these data sets as files on thousands of magnetic tapes, cartridges, or optical disks. Access and transfer times of such tertiary storage devices, even if robotically controlled, are relatively slow. Nevertheless, tertiary storage systems are the common state of the art today for storing large volumes of data because magnetic tapes are far cheaper than hard-disk devices. This trend will continue, even if hard disks' cost decreases and capacity increases, because magnetic tapes with more capacity (greater than 1 Tbyte) are already on the way. The development of new satellites, sensors, parameters, and so on will dramatically increase the amount of data that such devices must store, but magnetic tapes are well prepared for the task.
For data access in high-performance computing (HPC), tertiary storage systems' main drawbacks are high access latency compared to hard-disk devices and no random access capability. A major problem for scientific applications is that these systems don't allow access to specific data subsets. Accessing a subset of a large data set requires transferring the entire file from the tertiary storage media. Considering the time required to load, search, read, rewind, and unload several cartridges, such retrieval can take many hours. Furthermore, processing data across a multitude of data sets—for example, time slices—is difficult to support. Analysis along different dimensions is often contrary to storage patterns and necessitates network transfer of each required data set, implying a prohibitively immense amount of data to be shipped. Hence, many interesting, important evaluations are impossible with current technology. 1 Another disadvantage is that access to data sets occurs at an inadequate semantic level. Applications accessing HPC data must deal with directories, file names, and data formats rather than accessing multidimensional data in terms of area of interest and time interval. Data access is also inefficient because data sets are stored according to their generation process—for example, in time slices. Different access patterns, such as spatial regions, require data-intensive extraction processes with a high CPU workload. 1
The European Spatio-Temporal Data Infrastructure for High-Performance Computing (Estedi) project, funded by the European Commission (EC), addresses the delivery bottleneck of large HPC results to users, providing flexible data management for spatio-temporal data. Estedi, an initiative of European database developers, software vendors, and supercomputing centers, seeks to provide a solution for the storage and retrieval of multidimensional HPC array data. The HPC areas that Estedi addresses include climate-modeling simulations, cosmological experiments, atmospheric data transmitted by satellites, flow modeling of chemical reactors, computational fluid dynamics, and simulation of gene expressions' dynamics. Estedi can model such natural phenomena as spatio-temporal array data of some specific dimensionality. Each of these HPC areas requires storing a huge amount (hundreds of terabytes) of multidimensional discrete data (MDD). As part of the Estedi project, we developed a system that extends RasDaMan ( Raster Data Management), the first commercial multidimensional-array database management system (DBMS). Our system solves the shortcomings of data access in HPC and provides flexible data management of spatio-temporal data.
Smart management of large-scale data sets in tertiary storage systems begins by combining the efficient retrieval and data set manipulation capabilities of a multidimensional-array DBMS with the huge storage capacity of tertiary storage media such as magnetic tape. This DBMS must be extendable with easy-to-use functionalities to automatically store and retrieve data to and from tertiary storage systems without user interaction. 2
Within the Estedi project, we implemented such intelligent concepts, optimized toward HPC, 1 and we integrated them into the kernel of RasDaMan. Although RasDaMan was the outcome of an EC-funded project, since 1999 it has been a commercial product of RasDaMan GmbH. The Bavarian Research Center for Knowledge-Based Systems designed RasDaMan for generic multidimensional-array data of arbitrary size and dimensionality. By generic, we mean that functionality and architecture are not tied to particular application areas. Figure 1 depicts the extended RasDaMan system's architecture with tertiary storage connection.
Figure 1 Extended RasDaMan architecture with tertiary storage connection.
The original RasDaMan architecture included a client, a server, and an underlying conventional DBMS (such as Oracle) that served as a storage and transaction manager. The additional components for the tertiary storage interface are the tertiary storage (TS) manager, the file storage (FS) manager, and the hierarchical storage management (HSM) system. The RasDaMan server includes the TS manager and the FS manager. The HSM system is a conventional product such as Sun Microsystems' Storage Archiving System or Legato Systems' DiskXtender. It's essentially a normal file system with unlimited storage capacity. In reality, the HSM system's virtual file system has two main sections: a limited cache, in which the user loads or stores data, and a tertiary storage system with robot-controlled tape libraries. The HSM system automatically migrates or stages data to and from the tertiary storage media, as necessary.
To overcome the major bottleneck in accessing specific subsets of multidimensional data (MDD), we introduced the tiling concept of the RasDaMan DBMS. An MDD object includes an array of cells of some base type (integer, float, or arbitrary complex types) located on a regular multidimensional grid. An often-discussed approach is to chunk or tile large data sets. This approach is common for multidimensional arrays in different applications areas. 3,4 Basically, chunking means subdividing multidimensional arrays into disjoint subarrays. All chunks have the same shape, size, and dimensionality, and are therefore aligned. Tiling is more general than chunking because it doesn't require that subarrays be aligned or have the same size. 5,6 RasDaMan allows subdividing MDD into regular or arbitrary tiles. Consequently, an MDD object is a set of multidimensional tiles. RasDaMan stores every tile as a single binary large object (BLOB) in the underlying relational DBMS. This makes it possible to transfer only a subset of large MDD from the DBMS (or tertiary storage media) to client applications because a single tile governs access granularity. Moreover, RasDaMan's tiling strategy circumvents the problem of inefficient access to data sets stored according to their generation process order. This reduces access time and network traffic. The query response time scales with the query box size rather than with the MDD size. The tiling strategy is appropriate for users' typical access patterns. For certain access patterns, tiling is inefficient.
We must distinguish two possibilities for storing data in RasDaMan. First, we can explicitly store data on hard disks in the underlying DBMS. This is useful if the data access time is critical—for example, if some users require frequent access to the data sets. After several months, if the data sets are less important, these users can use the query language to export them. To allow such exporting, we integrated a new statement ( export from <object> where <condition>) into RasDaMan's query language. The second possibility for storing data is to export data sets automatically and independent of user interaction to tertiary storage media.
The export of data sets to tertiary storage media has two steps. The first step is to migrate the data sets from RasDaMan to the cache area—RAID (redundant array of independent disks) system—of the HSM system. Transferring data sets from the hard disk of RasDaMan's underlying DBMS to a RAID system is very fast. The second step is the migration of the data sets from the HSM system's cache area to the tertiary storage media. The HSM system handles this step without affecting RasDaMan's I/O workload. In parallel with this process, RasDaMan can perform other tasks, such as executing another export process or handling user requests. Exported data sets aren't displaced immediately in the underlying DBMS. RasDaMan classifies the data sets as cached data and evicts them only if the cache size reaches its upper limit.
We can distinguish three well-defined areas regarding data accessibility: online, near-line, and offline (see Figure 1). Online access means data sets are stored on hard disk, so access time is very fast. Near-line access means data sets are stored on magnetic tape integrated in robot-controlled libraries. Access is far slower than with online access, but data retrieval is automatic. Offline access means data sets (for example, backed-up or rarely used data) are stored on magnetic tapes, which are not integrated in robot-controlled libraries. In this case, retrieving data sets requires user interaction. Before someone can request such data, the HSM system administrator must catch the magnetic tape and put it manually into the HSM system.
The new RasDaMan tertiary storage functionality is based on the TS manager module, which we implemented and integrated into the RasDaMan kernel. During query execution, the TS manager knows (by metadata) whether the needed data sets are stored on hard disk (online, in a DBMS or an HSM cache) or on tertiary storage media (near-line or offline). If the data sets are on hard disk, the TS manager processes the query according to the RasDaMan system's original procedure without a tertiary storage connection. If the data sets are stored on tertiary storage media, the TS manager must first import the data sets into the database system (cache area for tertiary storage data). The TS manager automatically handles the import of data sets stored on tertiary storage media whenever a user executes a query and requests those data sets. After the import process, RasDaMan can handle the data sets as it normally would. The complexity of the RasDaMan storage hierarchy (hard-disk devices and tertiary storage media) is completely hidden from the user. Only a query's response time is different, depending on whether a user requests data sets held on hard-disk devices or on tertiary storage media.
RasDaMan's algebraic query language, RasQL, extends SQL (Structured Query Language) with powerful multidimensional features, such as geometric, induced, and aggregation operators. 2,6 The primary benefit of such a complex query language is that it minimizes data transfer between database server and client. RasQL lets users specify areas of interest with geometric operators, and allows execution of complex calculations on the server side. Only the result, not the entire object, goes to the client. 1,7 Because RasDaMan need transfer only a minimum amount of data to the client, this feature overcomes the shortcoming of processing data across a multitude of data sets. Furthermore, RasQL provides data access at an adequate semantic level. Users can formulate queries such as, "average temperature on the earth surface of altitude y in the area of latitude x and longitude z."
If data sets are stored on hard-disk devices, most queries on multidimensional array data are CPU bound due to the query execution time of multidimensional operations. 7 For data sets stored on tertiary storage media, most queries are I/O bound due to the access time to load, rewind, and unload the medium, and then to search, read, and transfer the data. The high access latency of tertiary storage systems, compared to hard-disk devices, requires techniques for reducing tertiary storage access time.
The access time for tape systems is an order of magnitude longer than for hard-disk devices. Thus, data management techniques must support efficient retrieval of arbitrary areas of interest from large data sets. Our techniques partition data sets into clusters on the basis of optimized data access patterns and storage device characteristics to reduce access time for tertiary storage devices.
In the RasDaMan DBMS, tiles (BLOBs) are the smallest data access unit. They typically range in size from 64 Kbytes to 1 Mbyte and are optimized for hard-disk access. 5 This is far too small for data sets held on tertiary storage media. We must choose different granularities for hard disks and tape access because they differ significantly in their access characteristics. Hard disks have fast random access, whereas tape systems have sequential access with much higher access latency. The average access time for tape systems (20 seconds to 180 seconds) is an order of magnitude longer than for hard-disk drives (5 ms to 12 ms), whereas the difference between their transfer rates isn't so significant (a factor of 2 or 3). 8,9 For this reason, we exploit tertiary storage systems' good transfer rate while preserving the tiling concept's advantages. The main goal is to minimize the number of media load and search operations.
The size of the data blocks, which move from the tape to the hard-disk cache, is particularly critical. On the one hand, to minimize network traffic, it's preferable to transfer only small data blocks over the network. On the other hand, the transfer rate of tertiary storage systems is acceptable, whereas we must minimize the amount of tape access. Consequently, the size of data blocks stored on tertiary storage media should be larger than 100 Mbytes. For example, the average access time of a DLT 8000 tape drive is about 60 s, and the transfer rate is about 6 Mbytes/s. Reading 100 Mbytes from tape takes 76.7 s (60 s access time plus 16.7 s transfer time). Reading only 10 Mbytes from tape requires 61.7 s (60 s access time plus 1.6 s transfer time). Obviously, we need to avoid such extreme dominance of access time over transfer time.
Increasing the tile size of RasDaMan's MDD (64 Kbytes to 1 Mbyte) would be unreasonable because then we would lose the advantage of reducing transfer volumes when accessing data on hard disk. The solution is to introduce an additional data granularity, which a super tile provides. We developed an algorithm that intelligently combines several small MDD tiles into one super tile, thus minimizing tertiary storage access cost. 2,6 In this way, we exploit the good transfer rate of tertiary storage devices while preserving the advantages of other concepts, such as data clustering. Figure 2a illustrates a 3D MDD with a super tile and tile granularity.
Figure 2 The super tile concept: (a) visualization of a multidimensional object with super tile and tile granularity; (b) example R+ tree index of a multidimensional object with super tile nodes.
Our algorithm computes super tiles by combining tiles in the spatial neighborhood surrounding the multidimensional object. The algorithm uses information from the RasDaMan R+ tree index (which is a multidimensional index structure based on the well-known B tree). 10 We extended the multidimensional DBMS' conventional R+ tree index structure to handle super tiles stored on tertiary storage media. This means that whether tiles are stored on hard disk or on tertiary storage media, the TS manager integrates the information into the index. The algorithm combines tiles of the same subindex of the R+ tree into a super tile and stores them within a single file on a tertiary storage medium (see Figure 2b). super tile nodes can exist at arbitrary levels of the R+ tree.
To detect super tile nodes, users can predefine the size of the super tiles, optimizing them for the data and tertiary storage access characteristics they want. 11 If the user does not define a super tile size, the default is 200 Mbytes. Extensive tests within the Estedi project have shown that this size gives the best performance characteristics for typical scenarios. Super tiles govern the access (import-export) granularity of MDD on tertiary storage media, preserving the advantages of the RasDaMan tiling concept (minimizing the data load) and exploiting the good transfer rates of tertiary storage devices.
For tertiary storage systems in which the device's positioning time is high, clustering is important. The main goal is to minimize the number of search and media load operations and reduce the access time of clusters read from the tertiary storage system when subsets are needed. 12 Clustering exploits the spatial neighborhood of tiles in data sets. Clustering tiles according to spatial neighborhoods on one disk or tertiary storage system proceeds one step further in preserving spatial proximity. This is important for the typical access patterns of array data because users often request data using range queries, for which the spatial-neighborhood concept is suitable.
We can illustrate the importance of clustering through a worst-case scenario: storing data sets on tertiary storage media without managing clustering. In this scenario, super tiles are randomly distributed on magnetic tapes, and tiles are randomly combined with super tiles. Now let's say that one user requests data using a range query with a result containing 23 tiles. Without clustering, all the tiles could be included in different super tiles. So, 23 super tiles would have to be loaded from magnetic tape. Thus, many seek, rewind, and load operations would have to occur because the super tiles are scattered on the tape. With clustering (which exploits the spatial-neighborhood concept) all tiles are included in one or two contiguous super tiles on the magnetic tape. So, loading is considerably faster. Consequently, the access time for these two cases varies significantly.
The R+ tree index already defines the clustering of the stored MDD. Our super tile algorithm lets us distinguish intra- from inter-super tile clustering. 6 This algorithm maintains the R+ tree index's predefined clustering of subtrees (super tile nodes) to achieve intra-super tile clustering (see Figure 2a). The export algorithm (which exports super tiles to tertiary storage) implements the inter-super tile clustering in a multidimensional object. Our system writes super tiles of a given object to tertiary storage media according to the predefined R+ tree clustering order.
Our system optimizes inter- and intra-super tile clustering of objects for a typical access pattern, according to the chosen (predefined) tiling strategy of a single multidimensional object. Normally, users' access patterns don't change dramatically. Consequently, the physical storage pattern is acceptable for most scenarios. Nevertheless, if users' access patterns dramatically change and no longer match the physical storage patterns, they can reconfigure the tiling strategy of one multidimensional object. After the reconfiguration, the system will export this object to the tape again.
To reduce expensive tertiary storage media access, we use RasDaMan's underlying DBMS as a hard-disk cache for data sets held on tertiary storage media. The general goal of caching tertiary storage data (super tile granularity) is to minimize expensive loading, rewinding, and reading operations from slower storage levels (for example, magnetic tape). The tertiary storage version of RasDaMan migrates requested data sets held on tertiary storage media to the underlying DBMS (see Figure 1). The migrated super tiles are now cached in the DBMS. After the migration, the RasDaMan server transfers only requested tiles from the DBMS to the client application.
The advantage of caching is that it eliminates the need to import data from tertiary storage media to the DBMS cache area for multiple requests. Such imports are extremely expensive because the retrieval from tertiary storage systems is slow. Furthermore, requests accessing this data are noticeably fast because there's already data in the DBMS cache area. The tertiary storage cache manager evicts data (super tile granularity) from the DBMS cache area only if the cache size reaches its upper limit. The current implementation of our system supports the LRU (least recently used) and FIFO (first in, first out) replacement strategies. In our access scenarios, the system performs better with the LRU (least recently used) strategy than with FIFO. Along with the HSM system's caching component, we built a caching hierarchy (see Figure 1). If data sets (super tiles) are cached in the RasDaMan DBMS cache area, we can achieve the fastest data access, identical to data stored directly in RasDaMan. Accessing data held in the HSM cache typically increases access time by a factor of 1.8 to 3.
Additional techniques for reducing tertiary storage access time include scheduling, prefetching, and I/O parallelism. Scheduling tertiary storage media access means optimizing the media read order. This optimization reduces expensive media seek and exchange operations. The focus is on scheduling policies that bundle all requests on a (loaded) medium before exchanging them from the robotic library's loading station. The prefetching algorithm loads data (from tertiary storage media) needed in the near future into RasDaMan's cache area. If the system can predict these data requests, it can optimize the data-loading process. Additionally, I/O parallelism is possible because HSM systems have several read-write drives and robot arms for handling thousands of magnetic tapes. One elemental concept for realizing I/O parallelism is the technique of intelligently distributing files (super tiles) on more than one magnetic tape. 2
Figure 3 shows the performance of the tertiary storage version of RasDaMan. For the retrieval functionality, we distinguish three cases. In the first case, the data sets needed are already in the RasDaMan's DMBS cache area (left bar). The system retrieves 3.1 Gbytes of data in 127 seconds, which is very fast, because the system doesn't need to load any data from the HSM system.
Figure 3 Performance comparison of different retrieval scenarios.
In the second case, the data sets required are in the HSM system's hard-disk cache. This is quite likely because the HSM cache's size is typically hundreds of Gbytes (up to 5 Tbytes). In this case, the import of the data sets is comparable to the export of the data sets, because it's not necessary to load the data sets from the tertiary storage media. This access is a factor of about 1.8 to 3 slower than normal DMBS access, depending on the transfer rate, network traffic, and so on).
In the third case, the HSM cache (third bar from the left) does not hold the requested data sets. Hence, the HSM system must first transfer the data sets needed from the tertiary storage media (DLT 8000) to the HSM cache, and then transfer them to the RasDaMan system. Compared with DBMS access (DBMS cache area), there's an extreme slowdown—a factor of about 7 to 20, depending on the tertiary storage device, the transfer rate, and so on. The rightmost bar shows the traditional access method. In this case, the application must load the complete data set (21.3 Gbytes) from the tertiary storage media.
The main differences between our proposed solution and other systems are the integration of features of DBMS-like transaction mechanisms, the use of a declarative query language, and the optimization of access methods for tertiary storage devices. Here, we compare related storage and retrieval systems (see Table 1) with our system with respect to the following:
FileTek's StorHouse/RM system is optimized for relational data (as in data warehouses), with the possibility of storing MDD as BLOBs. 13 StorHouse/RM has no further optimization methods for retrieving MDD. CERA, 14 developed at the Max-Planck Institute of Meteorology, supports semantic-orientated data access and storage for climate simulations with a main focus on storing metadata. CERA supports only rudimentary storage of MDD as BLOBs within the Oracle DBMS. APRIL is more optimized toward HPC, supporting chunking and the retrieval of subsets. 15 APRIL also supports rudimentary clustering and query scheduling. However, our system also allows object tiling, thus increasing flexibility. Furthermore, ours is the only system that has an MQL (RasQL) and that fully supports concepts such as clustering, caching, query scheduling, and transfer compression.
We are currently focusing on extending RasDaMan's multidimensional query language using a new concept called object framing. With this extension, users will no longer be restricted to performing range queries in the shape of multidimensional hypercubes. Now they will be able to formulate range queries by indicating complex frames. Thus, object framing represents a generalization of geometric operations.
Another research topic is parallel query support for multidimensional data. We hope to exploit RasDaMan's inter- and intra-object parallelism techniques to improve tertiary storage access performance. First, performance measurements prove the validity of our concept. On a two-processor machine, speed increased by a factor of up to 1.8, which we consider an extremely good result.