Future automated video surveillance systems demand sophisticated middleware support. This middleware approach supports large-scale, intelligent video surveillance by partitioning systems according to an activity topology.
Video surveillance networks are a class of sensor networks that serve several purposes, including protecting major facilities from terrorism and other threats. At the hardware level, standard IP networking devices and IP video cameras enable building thousand-camera networks at a reasonable cost. However, monitoring surveillance networks through human inspection is expensive and remarkably ineffective. Trained operators lose concentration and miss a high percentage of significant events after only 10 minutes. Consequently, surveillance users are turning to software for automated video surveillance. 1
Most research in this area concentrates on the computer vision algorithms required to detect and interpret activity in video. Such work is limited to networks of less than 100 cameras. We need to address the real-world issues raised by scaling to thousands of cameras and integrating a diverse, evolving collection of surveillance approaches into continuously operating surveillance networks.
Significant challenges arise in enabling a collection of video surveillance algorithms to cooperate in monitoring a large video surveillance network. Many of these challenges are generic to the video surveillance domain rather than specific to particular surveillance algorithms. Middleware can help address these generic challenges, including support for both communication and computation. We've structured our middleware around an activity topology—a graph describing the flow of observed activity through the network. This topology is both a generic tool facilitating multicamera algorithms and a way to partition distributed surveillance computations. Specifically, our middleware aligns computational partitions instantiating the blackboard architecture to subgraphs within the estimated topology. Blackboard partitions communicating through a service-oriented architecture (SOA) form a distributed blackboard, with communication requirements optimized (reduced) by aligning the partitions to topology subgraphs. (The "Related Work in Automated Video Surveillance Systems" sidebar discusses the field of automated video surveillance systems in more detail.)
Surveillance network characteristics
The high-level characteristics of video surveillance networks during normal operation (when monitoring a facility to detect threats, as opposed to during or after an incident) drive the design of our middleware.
Like early sensor networks, video surveillance networks aim to deliver information from many sources (sensors) to a few sinks (human operators). We therefore can view a surveillance network as a forest, in which data flows upward from sensors (leaves) to human operators (roots).
Unlike most other sensor networks, our system reports information to the sinks that describes anomalies detected by surveillance rather summaries that describe the network's overall state or complete information from sensors.
Many sensor networks are location centric, with sensors deployed to monitor locations of interest. In contrast, surveillance networks are ideally target centric, with sensors following targets as they move though the network. Signal quality constraints militate strongly against physically mobile cameras, so physically immobile cameras must provide this target-centric approach virtually.
During normal operation, the surveillance network aims to minimize workload on human operators. Therefore, data from cameras, rather than queries of instructions from operators, must drive control in surveillance network software.
Approximate and specific nature of surveillance algorithms
The complexity of video surveillance means that algorithms capture only an approximation of all activity of interest. Furthermore, although some generally useful algorithms exist (for example, background subtraction), many more are highly specific to a particular activity (for example, algorithms for detecting people climbing fences). Therefore, practical video surveillance systems must support many different approaches operating concurrently and in cooperation.
Abundance of power and capacity
Video surveillance networks are usually not highly constrained in terms of power and network bandwidth. Facilities such as airports typically invest many thousands of dollars per camera to provide the power, network, and processing infrastructure necessary so that sensors can continuously operate at full capacity. The cameras typically include computers running general-purpose operating systems such as Linux (in contrast to the special-purpose operating systems that many other sensor networks use).
Systems built using our middleware must meet not only surveillance application requirements but also various systemic (nonfunctional) requirements impinging on practical deployment of surveillance software. These include the following:
• Scalability. This requirement is particularly important in relation to computation; a large class of surveillance algorithms aren't inherently scalable. Middleware must provide tools suitable for scalable reimplementation of these algorithms.
• Availability. This requirement emerges from the network's scale. The large number of components results in a high probability that some components will be faulty at any given time. Consequently, the middleware must support fault tolerance sufficient to maintain acceptable availability levels.
• Evolvability. The surveillance network must accommodate change, including alterations to the hardware and software. Surveillance networks are large, long-running software systems, which must evolve to remain viable. System implementation needs to evolve to respond to changing system requirements.
• Integration. The network must integrate with external systems, including building-access systems, elevator control systems, threat databases, and separately developed surveillance systems. Surveillance networks are real-world systems, not academic prototypes operating in isolation. Middleware is the intermediary for communication with external systems.
• Security. An attacker's primary target is the facility that a surveillance network protects; attacking the network itself could be a means to this end. Middleware must provide facilities to counter such attacks.
• Manageability. During routine operation, automated surveillance software must be substantially autonomous to relieve human operators of active involvement. However, on detecting a threat, a phase change occurs, and human operators must exert control over those parts of the network where the threat is extant.
Many researchers identify real-time performance as a systemic requirement. This might be premature: the first challenge is to develop surveillance software that can accurately detect potential threats, bringing them to human operators' attention for response. Typical first-response times are on the order of 5 minutes. In this light, considering threat detection software to be real time is rather pointless. Software for automated management of ongoing incidents might well need to be real time, but that's a challenge to face after developing software that can accurately detect threats.
Surveillance network application requirements
shows functional requirements typical of video surveillance systems. Signal-processing functions
take place at the network edge to prepare video footage for higher-level processing. Distributed-video-interpretation functions
occur within the network and make extensive demands on middleware, both for computation and communication. Distributed (scalable) processing of these functions is a major challenge in surveillance because extant systems centralize this processing. Command, control, and inspection functions
let human operators examine an activity more closely during possible incidents. Finally, forensic functions
let investigators extract detailed information after an incident has occurred.
Table 1. Video surveillance application requirements.
Surveillance middleware architecture
shows the architecture of surveillance systems using our middleware. Lower layers support signal-processing functions. Our current implementation uses dedicated compute servers at the network edge to run detection pipelines executing signal-processing functions. As camera-processing capacity improves, these pipelines can move onto the cameras. The higher layers of the architecture on top of a distributed blackboard model support the distributed-video-interpretation functions. Finally, support for the forensics functions and the command, control, and inspection functions cuts through the majority of layers. A separate management and query subsystem supports the user interface that these functions require.
Figure 1. Surveillance system architecture.
The services layer provides active interfaces between the various layers. The camera management service includes facilities for registering cameras and allocating them to detection pipelines for load balancing and fault tolerance. In the future, this service will include automatic detection of disruptive changes in the camera signal (specifically, the signal's background model), like those changes that occur when a camera is moved. Such changes indicate that the existing state held for the camera is invalid.
The video-streaming service maintains an archive of video footage and makes it available when and where needed. Footage includes raw video saved from cameras as well as derived video produced by detection pipelines, such as foreground detected by background subtraction.
The activity topology estimation service is a key system component responsible for estimating relationships between cameras' fields of view. This service's outputs are essential to tractable implementation of higher-level functions such as intercamera tracking. Furthermore, identification of subgraphs within the topology allows partitioning the distributed blackboard used in higher layers, and thus enables those layers' scalability.
The activity topology is a graph describing the observed (past) behavior of target objects in the network. The (directed) edges in the activity topology describe the paths that surveillance targets take between segments within the cameras' fields of view (topology nodes). The edges can be labeled, typically with probabilities that targets leave a given node via the edge, thus providing a predictor for the next location of a target after it leaves a given node.
Statistical approaches for tracking a target within a camera's field of view fail when the target leaves that field of view. In such cases, a search is necessary to discover in which camera the target next appears. In the absence of an activity topology, tracking algorithms must search all other cameras' fields of view. Intercamera tracking for t
targets and n
cameras is thus O
) and, in the likely case in which the number of targets is proportional to the number of cameras, approaches O
) for n
cameras. The activity topology restricts the cameras that algorithms must search to those adjacent to the current camera. The topology also enables prioritizing the search according to the likelihood of the target's next appearance.
The activity topology also enables partitioning computations in a surveillance network, thus providing a generic tool for achieving scalability. Our middleware partitions cameras in the network into zones within the activity topology. Zones are subgraphs of the topology such that there are many edges between nodes within a zone but few edges between zones. This partitioning enables a divide-and-conquer approach in which most processing occurs within zones, with a small amount of communication between zones (determined by the number of interzone edges).
Our work has defined an activity topology estimation approach based on exclusion. 2
Conceptually, this approach begins with the assumption of a complete graph for the activity topology, then removes those edges contradicted by observed evidence. This negative-evidence approach contrasts radically with existing approaches, which seek to discover edges by attempting to follow targets. We have recently verified that exclusion can produce a useful topology estimate for an operational camera network of more than 100 cameras. 3
This is a significant result. Previous approaches haven't demonstrated an ability to scale beyond 100 cameras. Our exclusion-based topology estimator, on the other hand, provides the basis for middleware-supporting systems with hundreds, or even thousands, of cameras. 4
The conceptual model for computation in higher layers of our middleware is the blackboard paradigm. 5-7
A blackboard is both a repository of information and a vehicle for computation. The paradigm is one of multiple workers collaborating on some shared task, each contributing to the overall solution by processing work deposited by others. Data flows from low-level inputs, through intermediate (working) results stored in the data space, and eventually to high-level results reported to the blackboard's users. This data flow aligns well with the surveillance problem's hierarchical structure. In contrast to a database, control in a blackboard is data driven rather than user driven, which aligns well with the data-driven characteristic of surveillance. This paradigm is prevalent across AI application areas, particularly those concerned with complex signal processing. In addition, software architecture researchers have studied blackboards, 8
providing an architectural basis for the construction of middleware for distributed video surveillance. 9
As figure 2
shows, a blackboard includes a data space containing data units, and a knowledge-source space containing programs. The controller element runs an enable-select-execute
loop, conceptually forever. The enable
phase considers events raised by previous execution and the external environment to determine which knowledge sources can run. The select
phase chooses one of the enabled knowledge sources, and the execute
phase runs it. Each knowledge source encapsulates an algorithm that contributes to the surveillance network's operation in some way. Typical computation for a knowledge source is the derivation of new results (hypotheses) from existing results. The knowledge source then adds the new result to the information base within the blackboard. The appearance of these results might in turn trigger the execution of another knowledge source, leading to a forward-chaining style of processing.
Figure 2. Blackboard architecture.
A second mode of operation involves goals. Knowledge sources might try to fulfill a goal by searching for proof from results in the data space. Knowledge sources might also delegate the proof process by creating subordinate goals, leading to a backward-chaining style of processing. To support this processing style, the controller accounts for goals as well as events when making scheduling decisions. You can think of a blackboard as a special case of multiagent systems in which (among other constraints) the agents (knowledge sources) have a predetermined way of sharing—namely, publication of results to the blackboard.
Distribution of the blackboard
The straightforward way to use a blackboard in a distributed system is to encapsulate it within a centralized server, which clients access via RPC. However, this approach doesn't scale. Therefore, our middleware instead instantiates multiple blackboard partitions, each wrapped within an HTTP shell (providing both client and server facilities), as figure 2
shows. A specialized knowledge source called the input driver
adds data sent from other partitions into the data space. Another set of specialized knowledge sources called output filters
evaluates results in the data space according to significance
(the result is somehow noteworthy) and confidence
(the result is likely to be correct). The output filters then forward highly rated results to other partitions. Notice that the flow of data into and out of knowledge sources follows a publish-subscribe model. This model, widely used in middleware, also provides the basis for communication between partitions.
The middleware divides processing into three layers: low-level detection, medium-level activity, and high-level reasoning. The detection layer is responsible for the signal-processing functions identified in table 1.
It processes video footage from individual cameras and produces results (again corresponding to individual cameras) that are more amenable to subsequent processing. The activity layer combines these results into results containing information related to multiple cameras. Finally, the reasoning layer interprets results of lower layers to determine what to report to the system's human operators, and restructures lower layers to optimize their operation. Thus, the activity and reasoning layers together are responsible for the distributed-video-interpretation and target-following functions in table 1.
These layers comprise a collection of blackboard partitions, each structured as shown in figure 2.
It's also possible to implement the detection layer as a set of blackboard partitions. However, the main requirement at this level is efficiency rather than flexibility. Therefore, we replace blackboard partitions with pipelines, which apply a series of transformations to input data as it arrives from cameras and forward the results of various stages of processing to higher layers. The overall effect is similar to blackboard operation: producing higher-level results from low-level inputs. In this sense, we're using a pipes-and-filters architecture to provide an optimized implementation of blackboard partitions. Measurements of detection pipelines indicate that mainstream current CPUs (specifically, dual-core Opteron 2212s at 2.0 GHz) can process about seven camera streams in real time. 4
Communication and scalability
Each blackboard partition and detection pipeline executes on a different computing node, providing considerable potential for concurrency and hence computational scalability. However, realizing this potential requires minimizing coupling between nodes to avoid blocking the computations they're running. Communication between nodes takes three forms: reporting, direction, and dissemination.
Each node has a small set of superior nodes within the hierarchical structure. Each node reports the results it derives from its input data to its superior. For example, detection nodes report objects identified by background subtraction to activity nodes. For reasoning-layer nodes, the superiors are human operators.
Each node at the reasoning and activity layers has several subordinates, and it can issue goals to these subordinates. For example, when a reasoning node has detected a person moving along a path, it can send its subordinates a signature for that person and instructions to track him or her.
As nodes derive results, they proactively disseminate results that exceed importance thresholds with a small set of peers in the same layer. Dissemination replicates the important information in the blackboard, providing a weakened consistency model in which information rated highly by output filters will propagate quickly (and will thus be reasonably consistent). In contrast, information rated as having little importance might never leave its originating node.
Dissemination and reporting are essentially publication processes in which nodes share information with other nodes. The way in which publishers (that is, nodes) share information follows the SOA principle of description, not prescription. Thus, publishers provide information according to a defined schema, but the receiver determines how to process this information.
Both dissemination and direction tend to increase coupling between communicating nodes, thus reducing scalability. Reliance on these styles of communication delays computation until the arrival of information—through dissemination and in response to direction, respectively. In contrast, nodes process reported information as it arrives and never wait for future reports (with the exception of a small fraction of reports issued in response to directions). Hence, reporting doesn't increase coupling.
Therefore, a key factor in system scalability is the ability to minimize the proportion of dissemination and direction styles, thus increasing the use of reporting. Our middleware achieves this through the camera management service and the activity topology estimation service operating in concert. Specifically, the association of detection pipelines aligns with activity-level blackboard partitions to the estimated activity topology so that the set of pipelines reporting to a given partition corresponds to a zone within the activity topology. By definition, the majority of intercamera transitions from cameras within a zone are to cameras within that same zone. Algorithms can process these transitions within the activity-level partition for the zone, purely as information is delivered by reporting and without coupling to any other partition. Dissemination is required (and hence coupling arises) only when an activity transitions between zones. Direction serves mainly as a way for operators and reasoning-level partitions to control processing by the injection of new goals, and this limited use has little effect on scalability.
Our recent experience provides strong evidence that zones of 10 or fewer cameras are common in real surveillance networks. Thus, we can partition activity processing into such zones, thereby scaling this processing by a factor approaching the average zone size. We don't yet have enough evidence for the existence of larger zones (superzones, or zones of zones) to be completely confident of further scalability.
Implementation and evaluation
We recently demonstrated activity processing (tracking) for an operational network of 100 cameras. We used a dedicated cluster including 16 dual-core servers (detection processing), one eight-core server (activity topology estimation service and activity processing), and a file server (video-streaming service). The activity topology estimation service is a key component of our middleware, both in terms of facilitating multicamera activity functions such as tracking and in providing a basis for partitioning and thus system scalability. Our current implementation, based on a centralized topology estimation server, has demonstrated the capability of supporting networks of approximately 1,000 cameras. (A quantitative evaluation of this work is available elsewhere. 4
) Ongoing work is investigating how to partition and distribute the activity topology estimation service.
We've implemented a blackboard toolkit in the reflective object-oriented language Ruby, 10
taking inspiration from the GBBopen system, the latest in a series of blackboard toolkits developed by Daniel Corkill. 5
This toolkit instantiates blackboard partitions at the activity and reasoning levels. Ruby's extensive support for Web services assists the implementation of an SOA, providing communication between blackboard partitions.
We've focused on scalability to reach our goal of building the first automated surveillance system for thousand-camera networks. However, our approach also supports the other systemic requirements: availability, evolvability, integration, security, and manageability.
Our approach to availability is simplified, because the intrinsically approximate nature of surveillance means there are no requirements for state consistency between replicas. Essentially, each producer reports a copy of each of its products to two or more consumers (in high layers). The receivers then independently process this replicated information, redundantly but not necessarily in the same way. This redundancy provides increased availability for detecting anomalous behavior, but without making unnecessarily precise guarantees about how this detection is performed.
The loosely coupled nature of blackboard computation inherently promotes software evolvability. Users can add new software by simply inserting new knowledge sources into a partition, by sending a direction message to that partition. These knowledge sources are then available for scheduling by the partition's controller. Knowledge sources that no longer provide utility within a partition are removed implicitly by the controller's neglecting to schedule them. (However, we don't currently have any algorithms to detect loss of utility and consequently initiate neglect.) We accommodate change in hardware through the activity topology's ongoing estimation process, which drives system structure changes in terms of the communication topology to suit the changed hardware environment.
SOAs serve to loosen coupling in distributed systems and thus permit integration between separately developed services. 11
We use the REST ( re
ransfer) approach to SOA implementation, passing documents (typically XML) containing results and goals between blackboard partitions via HTTP. There's no reason why external systems shouldn't also produce and consume these documents, although we haven't conducted any experimentation to this end.
The security requirements of large-scale surveillance systems demand extensive further study. Threats are of two kinds: those targeting vulnerabilities in the implementation of the surveillance system, and those targeting vulnerabilities in surveillance algorithms. The former is perhaps a conventional security problem, but lack of physical security of cameras and other devices makes this problem more difficult. The latter is less well understood. In both cases, the changes in the adversary model implied by suicide terrorist attacks and other irrational behavior create complications that are yet to be fully understood.
During normal operation (that is, not during or in the aftermath of an incident), surveillance systems should operate as autonomic (self-managing) computing systems. One example of an autonomic approach is to use ongoing estimations from the activity topology as feedback driving changes in the communication topology so as to optimize system performance. Further work is needed to refine measurement and (especially) control in this process.
During an incident, a phase change needs to occur in the manageability regime so that operators can take control of at least some parts of the system in the vicinity of the incident. This requires more of a command-and-control style of management, in stark contrast to the autonomic style, and the integration of these two styles is a matter for future research.
We are building middleware to support construction of thousand-camera networks, operating over multiyear lifetimes. Present indications show progress toward meeting the scalability requirements inherent in that goal. Future work will develop the middleware further, to support evolvability, availability, and other systemic requirements.
is a lecturer in the School of Computer Science at the University of Adelaide, Adelaide, Australia; a member of the Jacaranda Research Group; and the leader of the Software Architectures for Distributed Video Processing project at the Australian Centre for Visual Technologies. His research interests include software architectures for large-scale distributed systems and service orientation. He received his PhD in computer science from the University of Adelaide, Adelaide, Australia. He's a member of the ACM. Contact him at the School of Computer Science, Univ. of Adelaide, Adelaide, South Australia, 5005, Australia; email@example.com
Anton van den Hengel
is the director of the Australian Centre for Visual Technologies and an associate professor in the School of Computer Science at the University of Adelaide, Adelaide, Australia. His research interests include large-scale video surveillance and interactive 3D modeling from video. He received his PhD in computer vision from the University of Adelaide, Adelaide, Australia. Contact him at the School of Computer Science, Univ. of Adelaide, Adelaide, South Australia, 5005, Australia; firstname.lastname@example.org
is a lecturer at the University of Adelaide, Adelaide, Australia, and a research leader in the Australian Centre for Visual Technologies. His research interests focus on computer vision, especially 3D modeling from video, probabilistic methods for vision, and visual surveillance. He received his PhD in engineering from the University of Cambridge. Contact him at the School of Computer Science, Univ. of Adelaide, Adelaide, South Australia, 5005, Australia; email@example.com
is a lecturer in the School of Computer Science at the University of Adelaide, Adelaide, Australia, where she's a member of the Jacaranda Research Group. Her research interests include software architectures and programming-language design, particularly in distributed systems and systems requiring support for dynamic evolution. She received her PhD in computer science from the University of Adelaide, Adelaide, Australia. Contact her at the School of Computer Science, Univ. of Adelaide, Adelaide, South Australia, 5005, Australia; firstname.lastname@example.org
David S. Munro
is an associate professor in the School of Computer Science at the University of Adelaide, Adelaide, Australia, where he leads the Jacaranda Research Group. His research interests include distributed- and persistent-object systems, distributed garbage collection, and compliant systems architectures. He received his PhD in computer science from the University of St Andrews. He's a member of the British Computer Society and a chartered engineer. Contact him at the School of Computer Science, Univ. of Adelaide, Adelaide, South Australia, 5005, Australia; email@example.com
is a professor of software engineering and chair of the School of Computer Science at the University of St Andrews, where he leads the Systems Architecture Group. He is also an adjunct professor in the School of Computer Science at the University of Adelaide, Adelaide, Australia. His research interests include programming-language design and implementation, distributed and persistent architectures, and dynamic systems evolution. He received his PhD in computer science from the University of St Andrews. He's a fellow of the British Computer Society and the Royal Society of Edinburgh. Contact him at the School of Computer Science, Jack Cole Building, North Haugh, St Andrews, KY16 9SX, UK; firstname.lastname@example.org