, University of Cambridge
Since they first became commercially available, database systems have played a well-defined role in enterprises: managing business-critical information and making it accessible to clients. During the past 40 years, the database research community has worked to significantly improve data storage, organization, management, and access, as reflected in many of today's most popular database management systems. Enterprises have been moving their critical data to (mainly relational) databases, and the number of databases, applications, and domains within each enterprise has multiplied.
Moreover, by enabling cross-enterprise communication, the Internet has helped convert closely controlled environments into highly distributed, dynamic, large-scale information spaces. In today's global marketplace, most enterprises must actively share information with their partners, suppliers, and customers to achieve their business goals.
The use of message-oriented middleware has become a popular approach to supporting enterprises' information-sharing needs. Increasingly, database systems must interact with MOM technologies to share and consume information. This has recently inspired the idea of implementing messaging features directly at the database. In this column, I discuss some of the research issues involved in integrating database and messaging functionalities.
Consider an online retailer's Web site that sells its own goods along with those of many different partners. Besides maintaining information about its own orders, the retailer hosts its partners' catalogs and storefronts, so it must maintain multiple prices for products when catalogs overlap. Once a client places an order, the order system must handle invoicing for each partner and send a fulfillment request to each partner's delivery system. Ideally, the system should check for the nearest warehouse to the customer that stocks each product and attempt to minimize the number of shipments required to fulfill the order. The system must monitor each step so that customers can track an order's status. In such a system, coordination and active information sharing between the retailer and partner systems are fundamental, both to maintain up-to-date catalogs of partners' products and to keep current information regarding each warehouse's contents.
Databases maintain most of a business's critical information and reflect its daily processes. So, for many years, enterprises have used numerous commercial and homemade tools with their databases for information sharing. For example, database administrators use extract-transform-load tools such as Informatica's PowerCenter ( http://www.informatica.com/products/ powercenter/default.htm) and Sunopsis' DataConductor ( http://www.sunopsis.com/corporate/us/products/ sunopsis/snps_dc.htm)extensively to extract operational data, transform it against a common schema, and load the result somewhere for subsequent querying. Similarly, data federation tools, such as DB2 Information Integrator ( http://www.ibm.com/us/) and Informix Enterprise Gateway Manager ( http://www-306.ibm.com/software/data/informix/tools/egm/), enable the logical integration of multiple databases by setting them up as data sources for a global query federation engine.
However, the Internet and the need to share information in real time with systems outside the enterprise make these tools less appealing. Few companies will let external entities extract data from their operational systems. Also, consolidating and maintaining a view for every information-sharing need is often impossible, especially for a large number of domains or for enterprises collaborating on an ad hoc basis. For these reasons, one recommendation from the last Database Research Self Assessment at Lowell ( http://research.microsoft.com/~gray/lowell/ LowellDatabaseResearchSelfAssessment.htm) was for database researchers to focus on providing stronger facilities for cross-enterprise information sharing.
Using MOM is another way to connect databases and applications within and outside enterprises. MOM provides generic interfaces that let components pass messages asynchronously to one another—that is, they're loosely coupled. Java Message Service ( http://java.sun.com/products/jms/) and WebSphere MQ ( http://www-306.ibm.com/software/integration/wmq/) are popular MOM representatives. In general, MOM offers different delivery semantics (such as exactly-once and at-least-once) and reliability guarantees (persisted/nonpersisted). A few years ago, MOM focused just on message queuing and point-to-point communication, but it now supports many-to-many communication—for example, following the publish-subscribe paradigm. More recent offerings include security, transactional messaging, and message transformation functions. Most enterprise application integration architectures today are based on MOM combined with data representation technologies such as XML. MOM's primary disadvantage is that it requires additional components in the architecture. As with any system, adding components can reduce performance and reliability and make the system more difficult and expensive to maintain.
The methods for database information sharing via MOM differ depending on whether the database system acts as a message publisher or consumer. Consider the online retailer example. Keeping an updated version of partners' catalogs at the retailer's database requires
For a partner, this involves creating a set of triggers on the update of the relevant product tables, where each trigger gathers the required product information, encodes it as a message, and publishes it via the MOM API. In many cases, limitations in the database trigger system (such as the impossibility of invoking Java calls) make directly publishing messages into the MOM impossible. We can generally handle this by pushing updates into intermediate tables that an external application polls periodically; that application then encodes updates as messages and publishes them into the MOM. For the retailer, keeping track of catalog changes will involve maintaining a MOM client application that subscribes to messages related to vendors, catalogs, or products of interest. After a message's delivery, the application accesses the database (for example, via Java Database Connectivity) to update the relevant catalog tables.
Large-scale, loosely coupled systems architectures are emerging as a natural pattern for wide-area information sharing. We're also seeing database systems increasingly interacting with established messaging technologies such as MOM. These developments introduce new research directions for the database community and present new challenges.
Ten years ago, Jim Gray made the argument that messages are data too, and queues are tables with special characteristics. 1 More recently, this has inspired the idea of implementing queue-based messaging features directly at the database systems (as in Oracle Streams http://www.oracle.com/technology/products/ dataint/htdocs/streams_fo.html). This allows sending, receiving, and processing messages to and from queues within the database system. Such integration not only brings operational simplicity to the system but also allows new levels of functionality, security, scalability, and reliability. Two basic integration approaches exist:
The first approach's obvious drawback is that databases aren't designed to be message-queuing systems, inherently limiting their scalability and functionality. In the second approach, you must enhance the database system in one or more of the following ways to support messaging.
The database system must be able to act as a (message publisher/subscriber/propagator) node in a distributed messaging system. At a low level, this requires supporting TCP sockets, HTTP, SOAP, or some other communication protocol for sending and receiving messages from the database. At a high level, it involves realizing the relevant node's functionality in the messaging system.
Message queuing and point-to-point communication provide an initial basis for database information sharing. However, the distributed systems research community has been working for years on other, more efficient data-dissemination mechanisms (such as multicast and content-based messaging systems). 2,3
Just as for tables, you must index access to message queues for efficiency—for example, according to message priority. Because the index to a queue must deal with an unusually high insertion and deletion rate, it can become unbalanced quickly. So, the database system must periodically rebalance and possibly reconfigure such an index. Researchers have done a considerable amount of work on self-tuning databases, 4 but support for message queuing introduces new performance challenges.
Providing information sharing at the database requires support for active conditional expressions to describe changes of interest in database content. Publishers could determine these expressions by defining what information is of interest and making it available for subscribers to query. Or, it could be specified at a finer grain via direct subscriptions on the database. From the mid 1980s to the mid 1990s, researchers did a lot of important work in the area of active databases. 5 Most mainstream database systems eventually included support for active technology, generally in the form of triggers that followed (to some degree) the SQL-99 standard. Active technology provides information sharing with the means to detect interesting situations at the database. However, information sharing brings a new set of requirements that require extending trigger implementations. These include the following.
Flexible coupling modes. Current trigger implementations deal only with a single event that's detected after the execution of a data manipulation language operation and is processed within the transaction that issued the operation. In information sharing, the system raises events to indicate that a condition of interest has happened and then creates a relevant message. Although the trigger that enqueues the message must act within the triggering transaction to ensure consistent notification, the system should publish the message from the queue asynchronously and outside transaction boundaries.
Scalable trigger evaluation. A main reason that users are reluctant to use triggers is anxiety about performance. Trigger processors lack sophistication: trigger technology typically isn't scalable, and only a small number of triggers can be defined on a database without drastically affecting its transaction processing performance. Supporting information sharing requires evaluating a potentially large number of active conditional expressions specified on a database after an event occurs. These expressions might be simple predicates on the state transition data associated with the event, or they could involve complex conditions referring to multiple tables.
So, an important challenge is to monitor a large number of such expressions efficiently. Researchers haven't done much work along these lines; an exception is Eric Hanson and his colleagues. 6 However, the research community has recently given considerable attention to slightly modified versions of this problem in systems other than databases. 7,8
Composite event detection. Many applications must combine events from various sources into a single event to detect a situation of interest. Such events could be raised locally at the database or generated at other databases or applications and be delivered as messages. Composite event detection (CED) requires evaluating streams of individual (primitive) events and relating these to each other while verifying whether two or more events satisfy a certain constraint, such as time or sequence. You can find rich event algebras and prototype implementations such as Snoop in Sentinel. 9 This work provides the basis for building extended models that consider the new requirements created by information sharing.
Because event sources can be highly distributed, CED must consider the lack of global time within the system. The distributed systems community has devised different ways to deal with this, such as using time stamps associated with an uncertainty level. 10 In many scenarios, once the system detects a composite event, database content in that context will determine what actions to perform. So, we need support for active conditional expressions that seamlessly tie CED with the evaluation of conditions on database content. Lately, we've seen considerable research on supporting CED outside the database system—for example, over MOM. 11 When considering how to integrate messaging and database systems, we'll need to decide whether to place CED functionality inside or outside the database; the choice could have major implications for operational simplicity and scalability.
Integrating messaging functionality into the database system represents a promising approach to cross-enterprise information sharing. On one hand, database systems already provide many features that a messaging-based architecture can exploit, such as reliable storage, transactions, and triggers. On the other hand, integrated messaging capabilities in the database account for information-sharing systems that are simpler to deploy and maintain. However, messaging and database technology have evolved independently, so designing and implementing database-messaging systems will require bringing together concepts and functionality from two separate worlds. Our goal, then, is to support functional, reliable, and scalable mechanisms for detecting, propagating, and consuming of information of interest from the database. To reach this goal, we should consider the many relevant contributions from distributed systems as well as research done within the database community.
Cite this article: Luis Vargas, "Database Challenges in Enterprise Information Sharing," IEEE Distributed Systems Online, vol. 7, no. 10, 2006, art. no. 0610-ox002.
I thank Ken Moody for his valuable comments.