Apr.-June 2014 (21, 02) pp. 42-70
1070-986X/14/$31.00 © 2014 IEEE

Published by the IEEE Computer Society
Joint Video and Text Parsing for Understanding Events and Answering Queries
Kewei Tu, ShanghaiTech University, China

Meng Meng, University of California, Los Angeles

Mun Wai Lee, Intelligent Automation

Tae Eun Choe, ObjectVideo

Song-Chun Zhu, University of California, Los Angeles
This article proposes a multimedia analysis framework to process video and text jointly for understanding events and answering user queries. The framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events), and causal information (causalities between events and fluents) in the video and text. The knowledge representation of the framework is based on a spatial-temporal-causal AND-OR graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes, and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. The authors present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs, and the joint parse graph. Based on the probabilistic model, the authors propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text, respectively. The joint inference module produces a joint parse graph by performing matching, deduction, and revision on the video and text parse graphs. The proposed framework has the following objectives: to provide deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; to perform parsing and reasoning across the spatial, temporal, and causal dimensions based on the joint S/T/C-AOG representation; and to show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where, and why. The authors empirically evaluated the system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.
Video understanding is an important aspect of multimedia content analysis that aims to automatically recognize objects, scenes, and events in the video. This step is a crucial component in systems attempting to improve human-computer interaction, such as automatic text generation from videos to facilitate the search and retrieval of visual information from the Web and surveillance systems that answer who, what, when, where, and why human queries. However, understanding video content is challenging because of factors such as low resolution, deformation, and occlusion. In addition, some aspects of objects and events—for example, a person's intentions and goals—are hidden in video footage.
A significant portion of videos are accompanied by textual descriptions that provide supplementary information. Figure 1 shows three examples, with decreasing levels of overlap between the video and text content: movies and TV shows with their screenplays, surveillance videos with human intelligence descriptions, and news report footage with closed-caption narratives.
Examples of videos accompanied with texts. (a) Television show footage and its screenplay, (b) surveillance video and the human intelligence descriptions, and (c) news report footage and its closed captions.



Figure 1.Examples of videos accompanied with texts. (a) Television show footage and its screenplay, (b) surveillance video and the human intelligence descriptions, and (c) news report footage and its closed captions.



In multimedia research, there has been growing interest in processing video and text jointly to integrate vision and language—the two major sensory modalities of human intelligence. On the one hand, vision provides rich sensory signals to ground language symbols denoting objects, scenes, and events; it supplements language with low-level details that are typically not mentioned in textual descriptions. On the other hand, language conveys semantic concepts and relations underlying the visual world, some of which may be hidden or obscured from the visual channel (such as names, certain attributes, functions, intents, and causality). Furthermore, because vision and language provide information about the same scene using two different channels, they could disambiguate and reinforce each other and produce a more accurate joint interpretation.
In contrast to the bag-of-words representation widely used in the video understanding and multimedia literature, 1– 4 we represent a joint interpretation of video and text in the form of a parse graph. (See the “Related Work in Multimedia Content Analysis” sidebar for prior research in this field.) A parse graph is a labeled directed graph that can be seen as an extension of the constituency-based parse tree used in natural language syntactic parsing. 5 Computer vision researchers have used it to model objects, 6 scenes, 7 , 8 and events. 9 , 10 A node in a parse graph represents an entity that can be an object, an event, or a status of an object (or fluent). An edge in a parse graph represents a relation between two entities. Together they represent the compositional structures of spatial information (for objects and scenes), temporal information (for actions and events), and causal information (causalities between events and object statuses). Such an explicit semantic representation facilitates subsequent applications such as text generation and query answering. We specify the prior probabilistic distribution of parse graphs using a spatial-temporal-causal AND-OR graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes, and events as well as their interactions and mutual contexts.
We present a probabilistic generative model for joint parsing that captures the relations between the input video and text, their corresponding parse graphs, and the joint parse graph. Based on the probabilistic model, we propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference. Video and text parsing produce two parse graphs from the input video and text, respectively. The joint inference module produces a joint parse graph by performing matching, deduction, and revision of the video and text parse graphs. We empirically evaluated our system by comparing it against ground truth as well as the query answering accuracy and obtained satisfactory results.
Overview of Proposed Framework
Our work focuses on joint parsing of surveillance videos and narrative text descriptions, but the proposed framework can be employed or extended to process other types of video-text data as well. Figure 2 shows our probabilistic generative model. Let vid and txt denote the input video and text, respectively; $pg_{vid}$$pg_{vid}$ the parse graph that interprets the input video; $pg_{txt}$$pg_{txt}$ the parse graph that interprets the input text; and $pg_{jnt}$$pg_{jnt}$ the joint parse graph for both the video and text. Our goal can be formulated as optimizing the posterior probability of the three parse graphs $pg_{jnt}$$pg_{jnt}$, $pg_{vid}$$pg_{vid}$, and $pg_{txt}$$pg_{txt}$ given the input video vid and text txt:
P($pg_{jnt}$$pg_{jnt}$, $pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$ | vid, txt)
P($pg_{vid}$$pg_{vid}$) P($pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$ | $pg_{jnt}$$pg_{jnt}$) P( vid | $pg_{vid}$$pg_{vid}$) P( txt | $pg_{txt}$$pg_{txt}$)
The graphical model for joint parsing of video and text.



Figure 2.The graphical model for joint parsing of video and text.



Our system for optimizing this posterior probability consists of three modules. First, a video parser generates candidate video parse graphs $pg_{vid}$$pg_{vid}$ from the input video vid. The video parser encodes the conditional probability P( vid | $pg_{vid}$$pg_{vid}$). It is also guided by a spatial-temporal-causal AND-OR graph (S/T/C-AOG), 6 , 7 , 9 , 10 which models the compositional structures of objects, scenes, and events. Second, a text semantic parser parses the text descriptions txt into the text parse graph $pg_{txt}$$pg_{txt}$. We build our semantic parser on top of the Stanford Lexicalized Parser ( http://nlp.stanford.edu/software/lex-parser.shtml), focusing on parsing narrative event descriptions to extract information about objects, events, fluent changes, and the relations between these entities. The text parser specifies the conditional probability P( txt | $pg_{txt}$$pg_{txt}$). Third, a joint inference module produces a joint parse graph $pg_{jnt}$$pg_{jnt}$ based on the video and text parse graphs produced by the first two modules. The joint inference takes into account the relationships between the joint parse graph and the video and text parse graphs, which areencoded in the conditional probability P($pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$ | $pg_{jnt}$$pg_{jnt}$), as well as the compatibility of the joint parse graph with the background knowledge represented in the S/T/C-AOG, as encoded in the prior probability P($pg_{jnt}$$pg_{jnt}$).
To handle different levels of overlap in the video and text, joint parsing must overcome several challenges. First, when an overlap exists between video and text, we need to solve the coreference problem by identifying the correspondence between the entities and relations observed in the video and those described in the text. For example, we may need to determine that a person detected in the video and a name mentioned in the text correspond to the same person. This problem can be complicated when multiple entities or relations of the same type appear in the same scene. Second, when little overlap exists between the video and text, we need to fill in additional information to construct a coherent and comprehensive joint interpretation. For example, if the video shows a person waiting at a food truck and the text mentions a person buying food, we need to infer that waiting at a food truck is typically followed by buying food in order to connect the video and text. Lastly, the content detected from the video may occasionally conflict with the content of the text because of either erroneous parsing or inaccurate input. We need to resolve the conflicts and provide a consistent joint interpretation.
In our joint inference module, we use three types of operators to construct the joint parse graph from the video and text parse graphs:
  • solving the coreference problem by matching the video and text parse graphs,
  • filling in the potential gap between video and text by deduction, and
  • resolving possible conflicts via revisions to the video and text parse graphs.
Figures 3 and 4 show two examples of joint parsing from videos and texts with and without overlap, respectively. In each of these parse graphs, we use different shapes to represent objects, events, and fluents and use different edge colors to represent different relations. The label on a node or edge indicates the semantic type of the denoted entity or relation. For each event node, the Agent relation connects it to the action initiator, and the Patient relation connects it to the action target. The set of event nodes are laid out horizontally based on the time they occur and vertically based on their compositional complexity. For a fluent node, we use levels to indicate its value at each time point.
An example surveillance video and the text descriptions of a road scene. (a) Input video and text. (b) The video parse graph, text parse graph, and joint parse graph. The video and text parses have significant overlap. In the joint parse graph, the green shadow denotes the subgraph from the video, and the blue shadow denotes the subgraph from the text.



Figure 3.An example surveillance video and the text descriptions of a road scene. (a) Input video and text. (b) The video parse graph, text parse graph, and joint parse graph. The video and text parses have significant overlap. In the joint parse graph, the green shadow denotes the subgraph from the video, and the blue shadow denotes the subgraph from the text.



An example surveillance video and the text descriptions of a courtyard scene. (a) Input video and text. (b) The video parse graph, text parse graph, and joint parse graph. The video and text parses have no overlap. In the joint parse graph, the green shadow denotes the subgraph from the video, and the blue shadow denotes the subgraph from the text.



Figure 4.An example surveillance video and the text descriptions of a courtyard scene. (a) Input video and text. (b) The video parse graph, text parse graph, and joint parse graph. The video and text parses have no overlap. In the joint parse graph, the green shadow denotes the subgraph from the video, and the blue shadow denotes the subgraph from the text.



Figure 3a shows a surveillance video and the human intelligence description of a road scene. In this case, the video and text cover the same event and have significant overlap. Figure 3b shows the results of video parsing, text parsing, and joint inference. The video and text parses reinforce each other in the joint parse via overlapping (such as the stopping event and building), and they also supplement each other by providing additional information (for example, the picture-taking event is too subtle to recognize from the video but is provided in the text).
Figure 4a shows a surveillance video and text description of a courtyard scene. The video shows a few people waiting and taking turns, but the video parser does not recognize the target of their action; the text only mentions a food truck. Thus, the video and text parses have no direct overlap in terms of object entities, but these objects are involved in a common food purchase event. Figure 4b shows the results of video parsing, text parsing, and joint inference. The joint parse graph identifies the food truck as the target of people's actions and infers the types of events as well as the fluent changes resulting from the events, thus providing a more coherent and comprehensive scene interpretation.
We encode the prior knowledge of parse graphs in the S/T/C-AOG and use it to guide both video parsing and joint inference, which is another contribution of this work. An AOG is an extension of a constituency grammar used in natural language parsing 5 to represent hierarchical compositions of objects, scenes, and events. An AOG has a hierarchical structure with alternating layers of AND-nodes and OR-nodes. An AND-node represents the configuration of a set of subentities to form a composite entity (such as waiting followed by ordering composes a buying event). An OR-node represents the set of alternative compositional configurations of an entity (for example, a buying event can be either “waiting and ordering” or “ordering without waiting”).
  • A spatial AND-OR graph (S-AOG) models the spatial decompositions of objects and scenes. 6 , 7
  • A temporal AND-OR graph (T-AOG) models the temporal decompositions of events to subevents and atomic actions. 9
  • A causal AND-OR graph (C-AOG) models the causal decompositions of events and fluent changes. 10
These three types of AOGs are interconnected, and together they form a joint S/T/C-AOG. A parse graph is an instance of an AOG that selects only one of the alternative configurations at each OR-node of the AOG. The prior probability of a parse graph is specified by theenergy of configuration selections made at the AOG's OR-nodes as well as the energy of the compositional configurations at the AOG's AND-nodes contained in the parse graph.
Representations
Ontology
An ontology is a “formal, explicit specification of a shared conceptualization.” 11 We use an ontology to specify the types of entities and relations that may appear in the parse graphs and therefore define the scope of our study. This ontology can be manually constructed using an ontology editor, 12– 14 adapted from an existing ontology, such as WordNet ( http://wordnet.princeton.edu) or Cyc, 15 or automatically or semiautomatically learned from data. 16– 18
As is typical for an ontology, we organize the types of entities (concepts) into a taxonomy. A taxonomy is a directed acyclic graph in which a directed edge between two concepts denotes a supertype-subtype relationship. As we will show later, we rely on the taxonomy to compute semantic distances between concepts when doing joint inference. We divide the concepts into three main categories at the top level of the taxonomy.
  • object, which represents any physical objects including human;
  • event, which includes any basic actions (such as sitting and walking) as well as any activities consisting of a sequence of actions (such as buying); and
  • fluent, which is used in the AI community to refer to an attribute of a physical object that can change over time, 19 such as the openness attribute of a door and the hunger attribute of a human.
Note that the change of fluents helps detect or explain human actions in a scene and thus is important in joint parsing.
Figure 5 shows an example taxonomy for the courtyard scene.
The taxonomy in an example ontology for the courtyard scene. The symbol at the root node represents the universal type, which is a supertype of every other type.



Figure 5.The taxonomy in an example ontology for the courtyard scene. The symbol at the root node represents the universal type, which is a supertype of every other type.



We also specify a set of relations in the ontology. These relations can be organized into a relation taxonomy, allowing the computation of semantic distances between relations. A small set of core relations are introduced in the next section.
Knowledge Representation by AND-OR Graphs
We represent the compositional structures of objects, scenes, and events using an S/T/C-AOG. (We will define probabilistic models on the AOG to address the stochasticity of the compositional structures later.) An AOG is an extension of a constituency grammar used in natural language parsing. 5 Similar to a constituency grammar, an AOG represents possible hierarchical compositions of a set of entities. However, an AOG differs from a constituency grammar in that it contains additional annotations and relations over the entities.
More specifically, an AOG consists of a hierarchical structure with alternating layers of AND-nodes and OR-nodes. An AND-node represents the decomposition of an entity of a specific type into a set of subentities. A Part-Of relation is established between each subentity and the composite entity. The AND-node also specifies a set of relations between the subentities that describes how these subentities form the composite entity. An OR-node represents the alternative configurations of an entity of a specific type.
AOGs has been used in computer vision to model the hierarchical decompositions of objects, 6 scenes, 7 , 8 and events. 9 , 10 In this work, we use an AOG to model objects, scenes, events, and perceptual causal effects in the joint interpretation of video and text.
Spatial AND-OR Graph.
An S-AOG models the spatial decompositions of objects and scenes. 6 , 7 Figure 6a shows an example S-AOG for indoor scenes. Each node in the S-AOG represents a type of scene, object, or their parts. An AND-node in the S-AOG represents a scene or object that is the spatial composition of a set of parts. For example, a table can be the spatial composition of a tabletop above four legs, and the background of an indoor scene can be the spatial composition of multiple 2D planes (that is, the walls, floor, and ceiling) that are hinged. An OR-node in the S-AOG represents alternative configurations of a scene or object. For example, a table can have many different styles, and an indoor scene may have a few different typical viewpoints. A leaf node in the S-AOG represents an atomic object that cannot be further decomposed.
Spatial AND-OR graph (S-AOG) example. (a) In this example S-AOG for indoor scenes, each node represents a type of scene, object, or their parts, an AND-node represents the spatial composition of its child nodes, an OR-node represents a selection among its child nodes, and a leaf node represents an atomic object. (b) An example spatial parse graph from earlier work 7 that is a realization of the S-AOG by making selections at OR-nodes.



Figure 6.Spatial AND-OR graph (S-AOG) example. (a) In this example S-AOG for indoor scenes, each node represents a type of scene, object, or their parts, an AND-node represents the spatial composition of its child nodes, an OR-node represents a selection among its child nodes, and a leaf node represents an atomic object. (b) An example spatial parse graph from earlier work that is a realization of the S-AOG by making selections at OR-nodes.



Temporal AND-OR Graph.
A T-AOG models the hierarchical decompositions from events to subevents and then to atomic actions. 9 Figure 7a shows an example T-AOG. Each node in the T-AOG represents a type of event or action occurred in a time interval. An AND-node in the T-AOG represents an event that is the temporal composition of a set of subevents. For example, a buying event can be the temporal composition of waiting followed by ordering. An OR-node in the T-AOG represents alternative configurations of an event—for example, a buying event can be either “waiting and ordering” or “ordering without waiting.” A leaf node in the T-AOG represents an atomic action that is defined by properties of the action initiator(s) and the action target(s) (such as a human's pose or an object's state) as well as the interactions between the initiator(s) and target(s). Agent denotes the relation between an action and the action initiator, and Patient denotes the relation between an action and the action target.
Temporal AND-OR graph (T-AOG) example. (a) In this example T-AOG, each node represents a type of event or action, an AND-node represents the temporal composition of its child nodes, an OR-node represents a selection among its child nodes, and a leaf node represents an atomic action defined by the properties of and interactions between the action initiator(s) and target(s). (b) An example temporal parse graph, which is a realization of the T-AOG by making selections at OR-nodes.



Figure 7.Temporal AND-OR graph (T-AOG) example. (a) In this example T-AOG, each node represents a type of event or action, an AND-node represents the temporal composition of its child nodes, an OR-node represents a selection among its child nodes, and a leaf node represents an atomic action defined by the properties of and interactions between the action initiator(s) and target(s). (b) An example temporal parse graph, which is a realization of the T-AOG by making selections at OR-nodes.



Causal AND-OR Graph.
A C-AOG captures the knowledge of the Causal relation between events and fluent changes. 10 Figure 8a shows an example C-AOG. An AND-node in a C-AOG represents a composition of conditions and events that can cause a fluent change. For example, a door may open because it is unlocked and someone pushes it. An OR-node in a C-AOG represents alternative causes that can result in a fluent change. For example, a door may open because either “someone unlocked and pushed the door” or “the door is automatic and someone walked close by.” A leaf node in a C-AOG is either an event or a fluent.
Causal AND-OR graph (C-AOG) example. (a) In this example C-AOG, an AND-node represents the composition of its child nodes (conditions and subevents), an OR-node represents a selection among its child nodes, and a leaf node represents an event or a fluent. (b) An example causal parse graph, which is a realization of the C-AOG by making selections at OR-nodes.



Figure 8.Causal AND-OR graph (C-AOG) example. (a) In this example C-AOG, an AND-node represents the composition of its child nodes (conditions and subevents), an OR-node represents a selection among its child nodes, and a leaf node represents an event or a fluent. (b) An example causal parse graph, which is a realization of the C-AOG by making selections at OR-nodes.



Joint S/T/C-AOG.
Although researchers have proposed and investigated S-AOGs, T-AOGs, and C-AOGs separately before, in this work we show for the first time that the three types of AOGs are interrelated and form a joint S/T/C-AOG, as Figure 9 shows. An action represented by a leaf node in the T-AOG is connected via the Agent, Patient, and Location relations to the action initiator, target, and location, which are objects modeled in the S-AOG. An event that has a Causal relation with a fluent change in the C-AOG is itself modeled by the T-AOG. There may also be a Trigger relation between a fluent in the C-AOG and an event in the T-AOG. Finally, there are HasFluent relations between objects in the S-AOG and fluents in the C-AOG.
An example of the relations between S-AOG, T-AOG, and C-AOG.



Figure 9.An example of the relations between S-AOG, T-AOG, and C-AOG.



The connections across the three types of AOGs enable the propagation of information between them during parsing, which improves parsing accuracy. Connecting the three types of AOGs also leads to a more coherent and comprehensive joint interpretation, which facilitates subsequent applications such as answering who, what, when, where, and why queries.
Parse Graphs
We represent an interpretation of the input video and/or text with a graphical representation called a parse graph, which is a realization of the S/T/C-AOG by selecting at each OR-node one of the alternative configurations. For example, Figures 6b , 7b , and 8b show a valid spatial, temporal, and causal parse graph, respectively, as realizations of the S/T/C-AOGs given in Figures 6a, 7a , and 8a . A parse graph can be represented as a labeled directed graph. Each node in a parse graph represents an entity with a label that indicates that entity's semantic type. A node may also have a set of spatial-temporal annotations indicating the spatial-temporal coordinate or range of the entity that the node represents. A parse graph may also contain a special type of nodes representing attribute values in the form of strings or numbers. A directed edge between two nodes in a parse graph represents a relation between the two entities denoted by the two nodes. The label on the edge indicates the type of relation. We adopt the open world assumption, 20 which states that a parse graph may not be a complete interpretation of the scene and any statement not included or implied by the parse graph can be either true or false.
Figures 3b and 4b show the parse graphs that represent the interpretations of the videos and texts shown in Figures 3a and 4a . For better visualization of the parse graph, we use different node shapes to denote different types of entities and use different edge colors to denote different types of relations. For nodes with temporal annotations (such as events), we lay them out in accordance with the time line. And for a fluent that may have its value changed over time, we use levels to indicate the value of the fluent at each time point.
We infer parse graphs from video and text by maximizing the posterior probability, which we shall discuss in the following few sections. As an explicit semantic representation of the video and text, our parse graphs facilitate subsequent applications such as text generation and query answering. In addition, our parse graph can benaturally represented as a Resource Description Framework (RDF) data model, one of the standard knowledge representations in the Semantic Web ( www.w3.org/TR/REC-rdf-syntax). This allows easy distribution of the parse graph data in the Semantic Web and enables the use of a multitude of approaches and tools for data editing, storage, inference, and retrieval.
Probabilistic Modeling
In this section, we define a probabilistic distribution over parse graphs to account for their stochasticity. Because the valid parse graphs are specified by an AOG, we assign energy terms to the AOG to make it a probabilistic model of parse graphs (so an AOG embodies a stochastic grammar).
First, at each OR-node of the AOG, we specify an energy that indicates how likely each alternative configuration under the OR-node is selected in a parse graph. Second, at each AND-node of the AOG, we specify an energy for each relation between the child nodes of the AND-node that captures the uncertainty of the relation. These energy terms can be either manually specified by domain experts or learned from data. 10 , 21 , 22 A parse graph's energy is thus determined by the energy of the configuration selections made at the OR-nodes and the energy of the relations between the child nodes of the AND-nodes contained in the parse graph. Let $V^{or} (pg)$$V^{or} (pg)$ be the set of OR-nodes in pg,$E_{or} (v)$$E_{or} (v)$ be the energy associated with the configuration selected at the OR-node v, R( pg) be the set of relations specified at the AND-nodes in pg, and $E_R (r)$$E_R (r)$ be the energy associated with the relation r. The energy of a parse graph is defined as follows:$$E\left ({pg} \right) = \sum\limits_{v \in V^{or} \left ({pg} \right)} {E_{or} \left (v \right)} + \sum\limits_{r \in R\left ({pg} \right)} {E_R \left (r \right)} \tag {2}$$$$E\left ({pg} \right) = \sum\limits_{v \in V^{or} \left ({pg} \right)} {E_{or} \left (v \right)} + \sum\limits_{r \in R\left ({pg} \right)} {E_R \left (r \right)} \tag {2}$$



We also allow part of a parse graph to be missing with a penalty to accommodate incomplete observation or description.
As we explained earlier, we use a joint S/T/C-AOG to define the valid parse graphs. In our framework, a parse graph's energy is therefore the summation of four energy terms.$$E_{{\rm{STC}}} \left ({pg} \right) = E_{\rm{S}} \left ({pg} \right) + E_{\rm{T}} \left ({pg} \right) + E_{\rm{C}} \left ({pg} \right) + \sum\limits_{r \in R * \left ({pg} \right)} {E_R \left (r \right)} \tag {3}$$$$E_{{\rm{STC}}} \left ({pg} \right) = E_{\rm{S}} \left ({pg} \right) + E_{\rm{T}} \left ({pg} \right) + E_{\rm{C}} \left ({pg} \right) + \sum\limits_{r \in R * \left ({pg} \right)} {E_R \left (r \right)} \tag {3}$$



where $E_{\rm S} (pg)$$E_{\rm S} (pg)$, $E_{\rm T} (pg)$$E_{\rm T} (pg)$, and $E_{\rm C} (pg)$$E_{\rm C} (pg)$ are the energy terms defined by the S-AOG,T-AOG, and C-AOG, respectively, according to Equation 2; $R*(pg)$$R*(pg)$ is the set of relations that connect the three types of AOGs (as we discussed earlier) and are not contained in any of the AOGs; and$E_R (r)$$E_R (r)$ is the energy associated with the relation r.
The prior probability of a parse graph pg is then defined as follows:$$P\left ({pg} \right) = {1 \over Z}e^{ - E_{{\rm{STC}}} \left ({pg} \right)} \tag {4}$$$$P\left ({pg} \right) = {1 \over Z}e^{ - E_{{\rm{STC}}} \left ({pg} \right)} \tag {4}$$



where Z is the normalization factor. Equation 4 defines one of the four factors in the posterior probability of the parse graphs given the input video and text (Equation 1). Three additional factors are involved in video parsing, text parsing, and joint inference, respectively, and will be introduced later.
Video Parsing
In video parsing, we aim to detect objects, scenes, and events that form an interpretation of the input video. Because our prior knowledge of objects, scenes, and events is encoded in the S/T/C-AOG, we regard the S/T/C-AOG as the prior of the video parse graph $pg_{vid}$$pg_{vid}$ and want to optimize the posterior of $pg_{vid}$$pg_{vid}$. Our objective energy function for video parsing is$$E_v (pg_{vid} ) = E_{STC} (pg_{vid} ) - \log {\rm{ }}p(vid|pg_{vid} ) \tag {5}$$$$E_v (pg_{vid} ) = E_{STC} (pg_{vid} ) - \log {\rm{ }}p(vid|pg_{vid} ) \tag {5}$$



where $E_{\rm STC}$$E_{\rm STC}$ is defined in Equation 3 and p( vid | $pg_{vid}$$pg_{vid}$) is the probability of the input video vid given the objects, scenes, and events specified in the parse graph $pg_{vid}$$pg_{vid}$. Because video parsing often contains a significant level of uncertainty, we output a set of candidate video parse graphs with low energy instead of a single best parse graph. The use of the objective energy function $E_v$$E_v$ as defined in Equation 5 can be justified as follows. If we assume that the set of nodes and edges in the joint parse graph $pg_{jnt}$$pg_{jnt}$ is a superset of that in the video parse graph $pg_{vid}$$pg_{vid}$, then it can be shown that $\exp \left ({ - E_v \left ({pg_{vid} } \right)} \right)$$\exp \left ({ - E_v \left ({pg_{vid} } \right)} \right)$ is a factor of the joint posterior probability P($pg_{jnt}$$pg_{jnt}$, $pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$ | vid, txt) (Equation 1). Therefore, a video parse graph that has high energy according to Equation 5 would likely lead to a low joint posterior probability, so we use Equation 5 to prune such video parse graphs.
Spatial Parsing
We first perform spatial parsing on each video frame following an approach called hierarchical cluster sampling. 7 The approach consists of two stages. In the first stage, the approach performs bottom-up clustering in which lower-level visual entities (such as line segments) of a video frame are composed into possible higher-level objects according to the S-AOG. To keep the number of candidate objects tractable, compositions with high energy are pruned.
In the second stage, the parser applies the Metropolis-Hastings algorithm in the space of spatial parse graphs. The sampler performs two types of operations to change the current spatial parse graph. The first is to add a candidate object composed in the first stage into the parse graph, where the proposal probability is defined based on the energy of the candidate object as well as the compatibility of the object with the rest of the parse graph. The second type of operation is to randomly prune the current parse graph. At convergence, the approach outputs the optimal spatial parse graph.
Temporal Parsing
We perform temporal parsing following the approach proposed by Mingtao Pei and colleagues, 9 which is based on the Earley parser. 23 The input video is divided into a sequence of frames. We use the spatial parser and special detectors to identify the agents, objects, and fluents in each frame. The temporal parser reads in frames sequentially and, at each frame, maintains a state set that contains pending derivations of AND-nodes that are consistent with the input video up to the current frame. (That is, for each AND-node in the state set, only a subset of its child nodes has been detected from the video up to the current frame.)
At the beginning of parsing, the AND-nodes at the top level of the T-AOG are added into the state set of frame 0, with all their child nodes pending. With each new frame read in, three basic operations are iteratively executed on the state set of the new frame. The prediction operation adds into the state set new AND-nodes that are expected according to the existing pending derivations in the state set. The scanning operation checks the detected agents, objects, and fluents in the new frame and advances the pending derivations in the state set accordingly. The completion operation identifies the AND-nodes with complete derivations and advances the pending derivations of their parent AND-nodes.
During this process, we prune derivations that have high energies to make the sizes of state sets tractable. After all the frames are processed, a set of candidate parses of the whole video can be constructed from the state sets.
Causal Parsing
After all the events are detected in temporal parsing, we perform causal parsing. For each fluent change detected in the video using special detectors, we collect the set of events that occur within a temporal window right before the fluent change and run the Earley parser again based on the subgraph of the C-AOG rooted at the detected fluent change.
After spatial, temporal, and causal parsing is done, we combine the resulting spatial, temporal, and causal parse graphs into a complete video parse graph. Figure 10 shows a parse graph for the input video in Figure 4a.
A candidate parse graph for the input video shown in Figure 4a.



Figure 10.A candidate parse graph for the input video shown in



Text Parsing
In parallel to video parsing, we perform text semantic parsing to convert the input text descriptions into the text parse graph $pg_{txt}$$pg_{txt}$. Our semantic parsing approach is similar to many existing information extraction approaches, but our objective is distinctive in two aspects. First, we focus on the domain of narrative event descriptions to extract information about objects, events, fluent changes, and the relations between these entities. Important relations in this domain include spatial, temporal, functional, and causal relations. Second, we express information extracted from the text using the same set of semantic types and parse graph representation that are also used in video parsing. This lets us connect the two parse graphs and perform joint inference.
We assume that text descriptions are unambiguous and therefore only a single text parse graph is produced from the input text. However, our approach can also handle text ambiguity and produce multiple candidate text parse graphs with different probabilities.
Our semantic parsing of text is a processing pipeline that consists of four main steps: text filtering, POS tagging and dependency inference, dependency filtering, and parse graph generation. In the case where the text descriptions are accompanied by time stamps or durations, we add temporal annotations to the corresponding event nodes in the resulting parse graph. We will use the parsing process of the sentence “Tom is buying food” as a running example in the rest of this section.
Text Filtering
This is a preprocessing step that locates and labels certain categories of words or phrases in the input text to facilitate the subsequent text parsing steps. First, it performs the named entity recognition (NER) to identify text elements related to time and the names of people, organizations, and locations. Existing NER tools such as Stanford NER 24 can be used. Second, it recognizes certain compound noun phrases that refer to a single entity (such as “food truck” and “vending machine”). The words in a compound noun phrase are then grouped together into a single word. Currently, this is done using the chunker tool in Apache OpenNLP ( http://opennlp.apache.org). Our example sentence, “Tom is buying food,” is not changed in this step.
Part-of-Speech Tagging and Dependency Inference
This step performs POS tagging and syntactic dependency parsing using the Stanford Lexicalized Parser. During POS tagging, each word is assigned a grammatical category such as noun, verb, adjective, adverb, article, conjunct, or pronoun. The tagger uses the Penn treebank tag set, which has 45 tags. The parser then performs dependency analysis to identify relations between the words in the sentence. It generates a dependency graph in which each node is a word and each directed edge represents a basic grammatical relation between the words. The Stanford Dependencies include more than 50 types of grammatical relations such as subject, modifier, complement, direct, and direct object. 25 For example, the dependency nsubj(buying,Tom) identifies Tom as the subject of the buying event. Together with POS labels, these dependency relations let us extract information from the text more easily. Figure 11 shows the result of this step for our example sentence.
An example dependency tree from step 2 of text parsing for the example sentence, “Tom is buying food.”



Figure 11.An example dependency tree from step 2 of text parsing for the example sentence, “Tom is buying food.”



Dependency Filtering
In this step, we map each word in the sentence to an entity type defined in the ontology or to a literal type. Table 1 shows examples of such mappings. This step is currently implemented using a lookup table, but it can also be automated using the WordNet dictionary ( http://wordnet.princeton.edu).

Table 1.Example mappings between words and entity or literal types.

Entity type or literal type Words in the input text
Human person | individual | somebody | …
Beverage drink | beverage | …
Buying buying | purchasing | …
Vending machine vending_machine | slot_machine | coin_machine | …
Number one | two | three | …
Color red | blue | …
Aux is | are

In some cases, it is important to check the POS tag of a word to determine the correct entity type. For example, consider the following two sentences:
  • The door is open(/JJ).
  • They open(/VBP) the door.
The word “open” in the first sentence refers to a fluent state, whereas in the second sentence, it refers to an action. The POS labels “JJ” and “VBP” let us map the words to the correct entity types. Because we focus on a small set of entity types as specified in our ontology, words that can denote multiple entity types under the same POS tags are rare.
For our example sentence, here is the dependency tree resulting from this step:
 
nsubj(Buying, Human_name)
aux(Buying, Aux)
root(ROOT-0, Buying)
dobj(Buying, Food)
We further replace each entity type with its top-level supertype specified in the ontology. In our example, “Buying” is replaced with “Event” and “Food” is replaced with “Object.”
Compared with the original dependency tree, which describes the relations between words, we have now added the entity types of these words. This provides additional cues to infer more precise semantic relations between the entities denoted by the words. For instance, from this example, the dependency nsubj(Event, Human_name) implies that the person (Tom) is the agent of the event (buying). Without the entity types, the dependency relation nsubj could be ambiguous—for example, nsubj(teacher, Tom) from the sentence “Tom is a teacher” refers to an occupation attribute and not an agent-event relation.
Parse Graph Generation
To extract semantic relations from dependencies more accurately, we need to consider not only individual dependencies but also their contexts (the semantic types of the words and the neighboring dependencies). Therefore, we design an attribute grammar to parse the set of dependencies and infer the semantic relations. The final text parse graph is then generated from the semantic relations inferred in this step as well as the entity types produced from the previous step.
In the attribute grammar, the terminal symbols are the mapped dependencies from the previous step. A set of production rules describes how the terminal symbols can be generated from ROOT, the start symbol. The Earley parsing algorithm 23 , 26 is used to parse the dependencies using the attribute grammar, which employs top-down dynamic programming to search for a valid set of production rules that generate the dependencies. Figure 12 shows the parse tree from the attribute grammar for our example sentence.
An illustration of steps 3 and 4 of text parsing for the example sentence, “Tom is buying food.” The bottom half of the figure shows dependencies filtering from step 3, and the top half of the figure shows attribute grammar parsing from step 4.



Figure 12.An illustration of steps 3 and 4 of text parsing for the example sentence, “Tom is buying food.” The bottom half of the figure shows dependencies filtering from step 3, and the top half of the figure shows attribute grammar parsing from step 4.



In each node of the parse tree, information from the input sentence is organized into an attribute-value structure, which may include attributes such as the event or action type, person's name, and so forth. Such information structures are propagated from the lower-level nodes to the higher-level nodes of the parse tree via the attribute functions associated with the production rules of the attribute grammar. For example, in the following production rule (which is used in the left-most branch of the parse tree in Figure 12 ),
 
nsubj(Event1, Object)
 → nsubj(Event2, Human_name)
The associated attribute functions are
 
Event1.* := Event2.*
Object.* := Human_name.*
Object.type := Human
where * is a wildcard character that stands for any subtext of the attribute name. After all the information from the input sentence is propagated to the top-level nodes in the parse tree of the attribute grammar, the final text parse graph is generated based on these top-level nodes. For example, the top-level node Agent(Event,Object) in Figure 12 contains the following information:
 
Event.id = “buying-3”
Event.type = Buy
Object.id = “Tom-1”
Object.type = Human
Object.name = “Tom”
from which we can construct a part of the text parse graph containing the Agent relation between the buying event and the human as well as the human's name attribute. Figure 13 shows the final text parse graph for the example sentence.
The parse graph for the example input text, “Tom is buying food.”



Figure 13.The parse graph for the example input text, “Tom is buying food.”



In addition to extracting the Agent and Patient relations, as shown in the running example, by using the attribute grammar, we also extract the spatial, temporal, and causal relations as well as functional properties of objects based on the function words in the input sentence. Table 2 shows examples of such relations and the function words used to extract such relations. Some function words (such as “at,” “from,” and “to”) have multiple meanings, but their semantic context in the input sentence allows the correct relations to be inferred during parsing.

Table 2.Examples of relations extracted from text and the corresponding function words.

Category Relations Function words
Spatial location, locationFrom, locationTo at, on, in, from, to, into
Temporal timeAt, timeFrom, timeTo, timeBefore, timeAfter at, from, to, before, after
Causal causal, trigger therefore, because
Functional occupation, religion, race is a

Joint Inference
In joint inference, we construct the joint parse graph from the video and text parse graphs produced by the first two modules. Although researchers have studied spatial-temporal-causal parsing of videos and semantic parsing of text, this is the first approach to integrate them together.
Challenges in Joint Inference
The joint parse graph cannot be a simple union of the video parse graph and the text parse graph. We need to consider three main difficulties: coreference, missing information, and conflicts.
First, the entities and relations covered by the two input parse graphs may be overlapping (coreference). Figure 14 shows such an example. We need to identify and merge the overlapping entities and relations. One challenge is that the same entity or relation might be annotated with different semantic types in the two parse graphs. For example, in Figure 14 the food truck entity is annotated with “Vehicle” in the video parse graph and “Food_Truck” in the text parse graph. In addition, the spatial-temporal annotations of the same entity may not be exactly the same in the two parse graphs. As another challenge, there may be multiple candidate matching targets for each entity. For example, a person mentioned in the text may be matched to a few different people in the video.
An example of overlapping video and text parse graphs (coreference). The ovals and bidirectional arrows mark the subgraphs that represent the same entities and relations.



Figure 14.An example of overlapping video and text parse graphs (coreference). The ovals and bidirectional arrows mark the subgraphs that represent the same entities and relations.



Second, some information may not be explicitly presented in the input video and text but can be deduced using the prior knowledge encoded in the S/T/C-AOG (missing information). Such information is necessary to fill in the gap between the input video and text or is useful in downstream tasks such as query answering. Thus, we need to include such information in the joint parse graph. Figure 15 shows an example in which the fluent “Has_Food” is neither observable in the video nor stated in the text but can be deduced from the event of buying food.
An example of information that is not presented in the input but can be deduced from the parse graph (missing information). The dashed shapes and arrows represent inferred entities and relations.



Figure 15.An example of information that is not presented in the input but can be deduced from the parse graph (missing information). The dashed shapes and arrows represent inferred entities and relations.



Lastly, the parse graphs from the input video and text may contain conflicting information. Such conflicts manifest themselves once overlapping entities and relations from the two parse graphs are merged. Figure 16 shows an example parse graph that contains conflicting information. We need to detect such conflicts using the prior knowledge and correct them in the joint parse graph.
An example of conflicting information in the parse graph. Here the conflict is between “buy book” and “buy from food truck.”



Figure 16.An example of conflicting information in the parse graph. Here the conflict is between “buy book” and “buy from food truck.”



Figure 17 shows a schematic diagram of how the joint parse graph is constructed from the video and text parse graphs, demonstrating the three difficulties in joint inference. There are two scenarios. In the first scenario ( Figure 17a ), where there is significant overlap between the video and text parse graphs, we first need to identify the overlap and solve the coreference problem. Then, we can deduce missing information and correct conflicts to obtain a more accurate and comprehensive joint parse graph. In the second scenario ( Figure 17b ), where there is no overlap between the video and text parse graphs, at first no coreference can be identified to connect the two parse graphs. Only after the two parse graphs are expanded by deducing implicit information do they overlap, and then we can connect them by solving the coreference and produce a coherent joint parse graph.
A schematic diagram of how the joint parse graph $\hbox{pg}_{\hbox{jnt}}$$\hbox{pg}_{\hbox{jnt}}$ is constructed from the video parse graph $\hbox{pg}_{\hbox{vid}}$$\hbox{pg}_{\hbox{vid}}$ and text parse graph $\hbox{pg}_{\hbox{txt}}$$\hbox{pg}_{\hbox{txt}}$, when (a) there is overlap between the video and text parse graphs, and when (b) the video and text parse graphs do not overlap until after the two parse graphs are expanded by deducing implicit information.



Figure 17.A schematic diagram of how the joint parse graph is constructed from the video parse graph and text parse graph , when (a) there is overlap between the video and text parse graphs, and when (b) the video and text parse graphs do not overlap until after the two parse graphs are expanded by deducing implicit information.



To solve the three difficulties (coreference, missing information, and conflicts), we propose three types of operators (which will be defined in the “Joint Inference Algorithm” section): matching, which matches and merges subgraphs of the video and text parse graphs to solve the coreference problem; deduction, which inserts implicit information to fill in the potential gap between video and text and makes the joint parse graph more comprehensive; and revision, which resolves possible conflicts between video and text.
Relating the Three Parse Graphs
To relate the video parse graph $pg_{vid}$$pg_{vid}$, the text parse graph $pg_{txt}$$pg_{txt}$, and the joint parse graph $pg_{jnt}$$pg_{jnt}$, we define the conditional probability P($pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$ | $pg_{jnt}$$pg_{jnt}$). By multiplying it with the prior probability of the joint parse graph P($pg_{jnt}$$pg_{jnt}$) (based on Equation 4), we get the joint probability of the three parse graphs, which is used to guide our joint inference. Intuitively, the conditional probability P($pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$ | $pg_{jnt}$$pg_{jnt}$) models the generation of the video and text parse graphs given the joint parse graph by considering how likely it is that different types of entities and relations in the joint parse graph will be included in the video and text parse graphs, while penalizing the changes of the semantic, temporal, and spatial annotations of these entities and relations in the video and text parse graphs. We do not assume that the video and text parse graphs are generated independently because in many scenarios they are in fact generated jointly. For example, in TV news reporting, the news editor chooses the footage and narrative that cover the same aspect of the whole news story. For tractability, we assume that the inclusion and change of each entity or relation is independent of that of any other entity or relation in the parse graph, so we can factorize the conditional probability as follows.$$\eqalign { P\left ({pg_{vid},pg_{txt} \left |{pg_{jnt} } \right.} \right) & = \prod\limits_{x \in pg_{jnt} } {{1 \over {Z_x }}} e^{ - E\left (x \right)} \cr & \times {1 \over {Z_\psi }}e^{ - \alpha _\psi \left \|{pg_{vid} \cup pg_{txt} \backslash pg_{jnt} } \right\|} \cr} \tag {6}$$$$\eqalign { P\left ({pg_{vid},pg_{txt} \left |{pg_{jnt} } \right.} \right) & = \prod\limits_{x \in pg_{jnt} } {{1 \over {Z_x }}} e^{ - E\left (x \right)} \cr & \times {1 \over {Z_\psi }}e^{ - \alpha _\psi \left \|{pg_{vid} \cup pg_{txt} \backslash pg_{jnt} } \right\|} \cr} \tag {6}$$



where x ranges over all the entities and relations in the joint parse graph $pg_{jnt}$$pg_{jnt}$, $Z_x$$Z_x$ and $Z_\psi$$Z_\psi$ are normalization factors, $\alpha _\psi$$\alpha _\psi$ is a constant representing the penalty of having an entity or relation in the video or text parse graph that is not contained in the joint parse graph, and E( x) is defined as follows:$$\eqalign{E\left (x \right) = \cr & \left \{{\matrix { {\alpha _v \left (x \right) + d\left ({x_j,x_v } \right)} & \!\!\!\!\!\!{{\rm{if }}\,x \in pg_{vid} \backslash pg_{txt} } \cr {\alpha _t \left (x \right) + d\left ({x_j,x_t } \right)} & \!\!\!\!\!\!\!{{\rm{if }}\,x \in pg_{txt} \backslash pg_{vid} } \cr {\alpha _{vt} \left (x \right) + d\left ({x_j,x_v } \right) + d\left ({x_j,x_t } \right)} & \!\!\!\!\! {{\rm{if }}\,x \in pg_{vid} \cap pg_{txt} } \cr {\alpha _\phi \left (x \right)} & \!\!\!\!\!{{\rm{if }}\,x\,\ {/\!\!\!\!\!}\in pg_{vid} \cup pg_{txt} } \cr } } \right.}$$$$\eqalign{E\left (x \right) = \cr & \left \{{\matrix { {\alpha _v \left (x \right) + d\left ({x_j,x_v } \right)} & \!\!\!\!\!\!{{\rm{if }}\,x \in pg_{vid} \backslash pg_{txt} } \cr {\alpha _t \left (x \right) + d\left ({x_j,x_t } \right)} & \!\!\!\!\!\!\!{{\rm{if }}\,x \in pg_{txt} \backslash pg_{vid} } \cr {\alpha _{vt} \left (x \right) + d\left ({x_j,x_v } \right) + d\left ({x_j,x_t } \right)} & \!\!\!\!\! {{\rm{if }}\,x \in pg_{vid} \cap pg_{txt} } \cr {\alpha _\phi \left (x \right)} & \!\!\!\!\!{{\rm{if }}\,x\,\ {/\!\!\!\!\!}\in pg_{vid} \cup pg_{txt} } \cr } } \right.}$$



We enforce the constraint that a relation in the joint parse graph can be contained in the video or text parse graph only if the two entities connected by the relation are also contained in the video or text parse graph. We use the following notations in the energy function:
  • $\alpha _v \left (x \right)$$\alpha _v \left (x \right)$ is the energy that x is contained in the video parse graph $pg_{vid}$$pg_{vid}$ but not in the text parse graph $pg_{txt}$$pg_{txt}$. Ideally, this energy shall be set to a low value for elements that represent visible low-level details of objects, scenes, and events—for example, the atomic actions of a drinking event (such as reaching for, lifting, holding, and putting down a cup of water), which are seldom mentioned in the text description.
  • $\alpha _t \left (x \right)$$\alpha _t \left (x \right)$ is the energy that x is contained in the text parse graph $pg_{txt}$$pg_{txt}$ but not in the video parse graph $pg_{vid}$$pg_{vid}$. Ideally, this energy shall be set to a low value for elements representing objects and events or their attributes that are either not visible in the video (such as a person's name or whether that person is hungry) or hard to detect by the video parser (for example, gender can be hard to identify in a low-resolution surveillance video).
  • $\alpha _{vt} \left (x \right)$$\alpha _{vt} \left (x \right)$ is the energy that x is contained in both the video parse graph $pg_{vid}$$pg_{vid}$ and the text parse graph $pg_{txt}$$pg_{txt}$. Ideally, this energy shall be set to a low value for elements representing visible high-level objects, scenes, and events.
  • $\alpha _\phi \left (x \right)$$\alpha _\phi \left (x \right)$ is the energy that x is contained in neither the video parse graph $pg_{vid}$$pg_{vid}$ nor the text parse graph $pg_{txt}$$pg_{txt}$. Ideally, this energy shall be set to a low value for elements representing low-level details of objects, scenes, and events that are invisible or hard to detect in the video.
  • $x_j$$x_j$,$x_v$$x_v$, and $x_t$$x_t$ are the corresponding entities or relations of x in the joint parse graph $pg_{jnt}$$pg_{jnt}$, video parse graph $pg_{vid}$$pg_{vid}$, and text parse graph $pg_{txt}$$pg_{txt}$, respectively.
  • d is a distance measure that combines the semantic, temporal, and spatial distances.
The distance measure models the difference in the annotations of the entity or relation x in the joint parse graph and the video or text parse graph.$$d(x,{\rm{ }}y) = \alpha _o d_o (x,{\rm{ }}y) + \alpha _t d_t (x,y) + \alpha _s d_s (x,y) \tag {7}$$$$d(x,{\rm{ }}y) = \alpha _o d_o (x,{\rm{ }}y) + \alpha _t d_t (x,y) + \alpha _s d_s (x,y) \tag {7}$$



where $\alpha _o$$\alpha _o$, $\alpha _t$$\alpha _t$, and $\alpha _s$$\alpha _s$ are constant weights, and we define the three distance measures $d_o (x,{\rm{ }}y),{\rm{ }}d_t (x,{\rm{ }}y),{\rm{ and }}d_s (x,{\rm{ }}y)$$d_o (x,{\rm{ }}y),{\rm{ }}d_t (x,{\rm{ }}y),{\rm{ and }}d_s (x,{\rm{ }}y)$ as follows:
  • The semantic distance $d_o \left ({x,y} \right)$$d_o \left ({x,y} \right)$ is the distance between the semantic types of x and y in the ontology. Several approaches can be used to measure the semantic distance given the taxonomy in an ontology. 27– 29 We require the distance to be very large if the two semantic types are disjoint (that is, having no common instance).
  • The temporal distance $d_t \left ({x,y} \right)$$d_t \left ({x,y} \right)$ is the time difference between the temporal annotations of x and y if both have temporal annotations.
  • The spatial distance $d_s \left ({x,y} \right)$$d_s \left ({x,y} \right)$ is the Euclidean distance between the spatial annotations of x and y if both have spatial annotations.
Currently, we set the values of all the constants in the definition heuristically, but it is possible to optimize them by learning from annotated training data, as in earlier work. 30
Joint Inference Algorithm
We start with the simple union of the video and text parse graphs and then iteratively apply three types of operators that make changes to the joint parse graph: matching, deduction, and revision.
In a matching operator, we match a node a from the video parse graph with a node b from the text parse graph and merge them into a single node c, which inherits all the edges attached to either node a or node b. To reduce the search space, we only match nodes that have a small distance between them (as defined in Equation 7). The semantic type of the new node c is set to be more special of the semantic types of nodes a and b. For example, if node a is of semantic type “Vehicle” and node b is of semantic type “Food_Truck,” then the semantic type of node c is set to “Food_Truck” because “Food_Truck” is a subtype of “Vehicle.” The temporal and spatial annotations of nodes a and b are averaged before being assigned to the new node c. After the merging, if two edges connect to node c that have the same direction and a small distance (as defined in Equation 7), then the two edges are also merged.
In a deduction operator, we insert a new subgraph into the joint parse graph, such that the prior probability of the joint parse graph as defined by the S/T/C-AOG is increased. Specifically, for each entity in the joint parse graph, we find the type of this entity in the S/T/C-AOG and insert an instantiation of its immediate surrounding subgraph in the AOG into the joint parse graph. Because the energy defined in the last section actually penalizes the size of the joint parse graph, we require the insertion to be minimal—that is, each inserted node or edge does not duplicate any existing node or edge. This can be achieved by following the deduction operator with a series of matching operators to match and merge the inserted nodes and edges with the existing parse graph.
In a revision operator, we either remove a subgraph from the parse graph, or change the annotation of a node or edge in the parse graph, to solve a conflict as defined in the S/T/C-AOG (for example, an instantiation of zero probability or instantiation of more than one child of an OR-node). Because the energy defined in the last section penalizes such revisions, we require the removal or change to be minimal—that is, no extra node or edge is removed or changed that is unnecessary to solve the conflict.
Our objective function is the joint posterior probability defined in Equation 1. Because both the prior probability P($pg_{jnt}$$pg_{jnt}$) (Equation 4) and the conditional probability P($pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$ | $pg_{jnt}$$pg_{jnt}$) (Equation 6) can be factorized by the entities and relations in the parse graph, we can compute the change to the joint posterior probability caused by each operator.
For the surveillance data that we focus on in this paper, a typical parse graph contains a few tens of nodes and edges, so we use depth-first search with pruning to optimize the joint parse graph. For data of larger scales, it is straight forward to extend our algorithm with beam search. Given the initial joint parse graph made of a simple union of the video and text parse graphs, we first apply the matching operators until no further matching is possible. We then apply the revision operators to solve conflicts and finally apply the deduction operators (with the follow-up matching operators) to deduce implicit information. Each operator typically contains a number of possibilities (for example, multiple nodes in the text parse graph can be matched with a specific node in the video parse graph), and we exam all the possibilities in a depth-first manner. To reduce the search space, we prune the possibilities that have significantly higher energy than the others.
When applying the deduction operator, in some cases we may deduce multiple candidate subgraphs that have similar energy and are mutually exclusive (for example, different children of an OR-node). In other words, there is significant uncertainty in the deduction as measured by information entropy. For example, we may observe a man in the video but cannot see what he is doing because of low resolution or occlusion, and based on the background knowledge, we may infer a number of actions that are equally possible. Because we adopt the open world assumption, it is reasonable to remain agnostic here and not add any new information. Specifically, we cancel the deduction operator if the entropy of the deduced subgraph is higher than a threshold:$$\eqalign{ H\left ({pg_{de} \left |{pg_{jnt} } \right.} \right) & = - \sum\limits_{i = 1}^N {P\left ({pg_{de}^i \left |{pg_{jnt} } \right.} \right)} \cr & \log P\left ({pg_{de}^i \left |{pg_{jnt} } \right.} \right) \gt {{\log N} \over c} \cr}$$$$\eqalign{ H\left ({pg_{de} \left |{pg_{jnt} } \right.} \right) & = - \sum\limits_{i = 1}^N {P\left ({pg_{de}^i \left |{pg_{jnt} } \right.} \right)} \cr & \log P\left ({pg_{de}^i \left |{pg_{jnt} } \right.} \right) \gt {{\log N} \over c} \cr}$$



where $pg_{jnt}$$pg_{jnt}$ is the joint parse graph before applying a deduction operator, $pg_{de}$$pg_{de}$ denotes the subgraph inserted by the deduction operator, N is the number of possible candidate subgraphs that can be deduced by the operator, and c > 1 is a prespecified constant.
Note that in video parsing we produce multiple candidate video parse graphs. Thus, for each video parse graph, we run the joint inference algorithm to find a joint parse graph, and then we output the parse graph triple 〈$pg_{vid}$$pg_{vid}$, $pg_{txt}$$pg_{txt}$, $pg_{jnt}$$pg_{jnt}$〉 with the highest joint posterior probability. Although the algorithm described here only outputs a single optimal joint parse graph for each pair of input video and text parse graphs, it is easy to adapt the algorithm to output multiple candidate joint parse graphs along with their joint posterior probabilities.
Answering Natural Language Queries
The joint parse graph is a semantic representation of the objects, scenes, and events contained in the input video and text, which is useful in many applications. In this section, we show how the joint parse graph can be used in semantic queries to answer who, what, when, where, and why questions and to provide a summary of scenes or events (such as how many people are in a scene).
Text Query Parsing
Given a simple plain text question about the objects, scenes, and events in the input video and text, we parse the question into a formal semantic query representation. Because our joint parse graph can be represented in RDF, we represent a query using SPARQL ( www.w3.org/TR/rdf-sparql-query/), the standard query language for RDF.
The text query parsing steps are identical to those in the text parsing module we introduced earlier. The attribute grammar for analyzing the dependencies is extended to include interrogative wh-words, such as who, what, when, where, and why. These wh-words indicate the query's objective, and the rest of the text query defines the query conditions. For example, the question “Who has a cap?” is parsed into the following dependencies and attributes:
 
nsubj(Event, who)
root(ROOT-0, Event);
Event.type = Possess
det(Object, Det)
dobj(Event, Object);
Object.type = Cap
The first dependency indicates that the query is asking for the agent of an event. The rest of the dependencies specify the event type (“Possess”) and the event patient (“Cap”). These query conditions are then converted into the SPARQL format:
 
SELECT ?name ?who1
WHERE {
?who1 hasName ?name.
?cap4 rdf:type Cap.
?has2 rdf:type Possess.
?has2 hasAgent ?who1.
?has2 hasPatient ?cap4.
}
Queries that require aggregates can also be expressed in SPARQL via aggregate functions (COUNT, MAX, and so on) and grouping. For instance, the question “How many people are there in the scene?” is parsed into the following SPARQL query:
 
SELECT (COUNT(DISTINCT ?agent) as ?count)
WHERE {
?agent rdf:type Human
}
After the SPARQL query is generated, it is evaluated by a SPARQL query engine that essentially performs graph matching of the patterns in the query condition with the joint parse graph stored in the RDF knowledge base. One implementation of the SPARQL query engine is included in Apache Jena ( http://jena.apache.org). The values obtained from the query engine are then processed to produce the query answers.
Computing Answer Probabilities from Multiple Interpretations
Our system can produce multiple joint parse graphs, each associated with its posterior probability. These joint parse graphs correspond to different interpretations of the input video and text. To answer a query accurately, we execute the query on all the interpretations and then compare the collected answers. We associate the answers from different interpretations by matching their semantic types and spatial-temporal annotations. For matched answers, their probabilities are combined. Formally, the probability of an answer a is computed as$$P\left (a \right) = \sum\limits_{pg} {P\left ({pg} \right)} 1\left ({pg \,| \!\!= a} \right)$$$$P\left (a \right) = \sum\limits_{pg} {P\left ({pg} \right)} 1\left ({pg \,| \!\!= a} \right)$$



where P( pg) denotes the posterior probability of the joint parse graph pg and 1( pg |= a) is the indicator function of whether parse graph pg entails the answer a. In this way, different possible answers can be ranked based on their marginal probabilities. Figure 18 gives a schematic diagram of this process.
The schematic diagram of querying multiple interpretations.



Figure 18.The schematic diagram of querying multiple interpretations.



Text Query GUI Tool
We developed a GUI tool for text query answering; see Figure 19 for a screenshot. With this tool, the user types a query in natural language text, and the system parses the query and retrieves the result, which is shown on the left of the GUI. If the answer is an event, the text description of the event is also presented in the bottom right of the GUI. (The text descriptions are generated automatically from the joint parse graphs, which we will discuss in the following “Text Generation” section.) The corresponding video segment of the retrieved answer is shown on the top right with overlays surrounding the event's agent and patient. With this tool, the user can efficiently query about the events occurring in the video and review the relevant video segments. In our experiment, each query response typically took a few seconds.
Text query GUI. See www.stat.ucla.edu/∼tukw/JointParsing/demo.html#demo2 for a demonstration video.



Figure 19.Text query GUI. See for a demonstration video.



Evaluation and Experiments
Datasets
We collected two datasets of surveillance videos and text descriptions in indoor (hallway) and outdoor (courtyard) scenes. Compared with other types of datasets (such as movies with screenplays and news report footages with closed captions), surveillance videos contain a well-defined set of objects, scenes, and events that is amenable to controlled experiments. The text accompanying the surveillance videos provides descriptions of the scenes, whereas other types of datasets may contain text that is not descriptive of the scenes, such as dialog and commentary. In addition, using surveillance videos we can control the level of detail in the text descriptions and study its impact on the performance of joint parsing.
For each of the two datasets, we first recorded the video of the scene and divided the video into 15 clips. Each video clip contains one or more events and lasts from a few seconds to more than one minute. Based on the objects and events captured in the video, we manually constructed the ontology and the S/T/C-AOG for each dataset.
We then asked five human subjects from different countries and backgrounds to provide text descriptions after watching the video clips. The human subjects were asked to only describe the events and the involved agents, objects, and fluents in the scenes using plain English. We provided the ontology to the human subjects to restrict the scope of their descriptions.
The human subjects were instructed to provide descriptions at three levels of detail so we could study how the number of text descriptions and their degrees of overlap with the video content impact performance. At the first level, the human subjects were told to give the simplest descriptions, consisting of one or two sentences. At the second level, they were asked to give more details (typically less than 10 sentences). And at the third level, they were asked to provide information that is not directly observable in the video but can be inferred from it (typically zero to two sentences). Figure 20 shows two examples from our datasets.
Examples from the (a) hallway and (b) courtyard datasets. The text descriptions shown on the right side are divided into the three levels of detail.



Figure 20.Examples from the (a) hallway and (b) courtyard datasets. The text descriptions shown on the right side are divided into the three levels of detail.



Qualitative Results
We ran our system to produce video, text, and joint parses from the datasets. On a typical video-text pair, video parsing took a few minutes to finish when running on a 12-node computer cluster, whereas text parsing and joint parsing finished within a few seconds on a desktop computer. We compared the information contained in the three types of parse graphs and made the following observations.
  • Prominent objects and events are typically contained in both the video and the text parse graphs (such as “a person drinking at the water fountain”).
  • Low-level visible details of objects, events, and fluents are often contained in the video parse graphs but not in the text parse graphs (such as “a person approaching the water fountain before drinking and the door closing after a person enters”).
  • Two types of information are typically contained in the text parse graphs but not in the video parse graphs. The first is information that is invisible in the video but can be inferred (such as “a person is thirsty because he is drinking water”). The second type of information is visible in the video but is not correctly detected by the video parser.
  • The joint parse graphs may contain information that can be found in neither the video nor the text parse graphs; this includes low-level details of objects and events that are not detected by the video parser or invisible information that is not mentioned in the input text. Such information is deduced in joint inference from other information in the video and text parse graphs.
Text Generation
We extend the text generator described in earlier work 31 to generate text descriptions from our joint parse graphs. The generated text presents the content of parse graphs in a user-friendly manner and helps nonexpert users to understand the joint parsing results.
The text generation process goes through two modules: sentence planner and sentence realizer. The sentence planner selects the content to be expressed, specifies hard sentence boundary, and provides information about the content. During this process, the information from the parse graph is converted to a functional description, which specifies the sentence elements, such as the event, subjects, and objects, and their functional properties. 32 Using a functional description as input, the sentence realizer generates the sentence by determining a grammatical form and performing word substitution. A head-driven phrase structure grammar (HPSG) 33 of English syntax is used to generate text sentences.
Table 3 shows some examples of the generated text. By comparing the input and generated texts, we can see that, when the input text is simple, the majority of the information in the joint parse graph comes from video parsing as well as deduction in joint inference. In turn, with more input text, an increasing part of the joint parse graph comes from text parsing.

Table 3.Example generated text from joint parse graph. *

Input video Input text Generated text from the joint parse graph

(Level 1)

The person drinks water.
A person is thirsty. He approaches a fountain. He drinks water at a fountain. He leaves a fountain at the bottom of the image between 0:40 and 0:45.
(Levels 1–3)

The person drinks water. He walks to a fountain. He bends down to drink water and then leaves. He is thirsty.
A person is thirsty. He approaches a fountain. He bends down. He drinks water at a fountain. He bends down because he wants to drink water. He leaves a fountain at the bottom of the image between 0:40 and 0:45.

(Level 1)

0:00–1:02 The person waits for food.
A person is hungry. He waits for food between 0:00 and 1:02. He talks at a food truck between 0:00 and 0:38. He walks to a food truck between 0:39 and 0:44. He walks from a food truck between 0:52 and 1:02.
(Levels 1–3)

0:00–1:02 The person waits for food.

0:00–0:08 He stands and talks.

0:09–0:16 He bends down.

0:17–0:38 He squats.

0:39–0:45 He walks to the food truck.

0:46–0:51 He receives food.

0:51–0:54 He walks.

0:55–1:02 He walks away from the food truck.

0:00–1:02 He is hungry.
A person is hungry. He waits for food between 0:00 and 1:02. He stands between 0:00 and 0:08. He talks at a food truck between 0:00 and 0:08. He bends down between 0:09 and 0:16. He squats between 0:17 and 0:38. He walks to a food truck between 0:39 and 0:45. He receives food between 0:46 and 0:51. He walks away from a food truck between 0:51 and 1:02.
*
For each input video, we show the results produced from joint parsing using level 1 text and using text from all three levels.

We also evaluated the generated text based on the joint parse graph against the merged text descriptions from human subjects using quantitative metrics such as Bleu 34 and Meteor, 35 which were developed for evaluating machine translation. We find the resulting scores uninformative about the quality of joint parsing for two reasons: the automatic text generator often chooses different words and expressions than a human (for example, “bend because he wants to drink” versus “bend to drink”) and the generated text often contains information that is not included in the human text descriptions, such as the information from video parsing and deduction. These differences are heavily penalized by the machine translation evaluation metrics, although they either are unimportant or should be rewarded for evaluating joint parse graphs. Therefore, we adopted two alternative quantitative evaluation approaches.
Quantitative Evaluation
On each video clip in the datasets, for each human subject we used the text descriptions of the video clip as the text input to produce three joint parse graphs. The first joint parse graph is based on the level 1 text descriptions, the second is based on the combined text from level 1 and 2, and the third is based on the text of all three levels. We consider four baseline parse graphs: the video parse graph and the three text parse graphs based on level 1 text, level 1+2 text, and the text from all three levels. We evaluated the joint parse graphs as well as the baseline parse graphs using two different evaluation approaches.
Comparison against Ground Truth.
For each video clip, we constructed the ground truth parse graph by merging the parses of all the text descriptions from all the human subjects except the one whose text description was used as the input in the current run of experiments. The merged parse graph was manually checked to ensure correctness.
We compared the parse graphs to be evaluated against the ground truth parse graph and measured the precision, recall, and F-score. Let pg represent the parse graph to be evaluated, pg* the ground truth parse graph, pgpg* the overlap between the two parse graphs, and || pg|| the size of parse graph pg.$$\eqalign {{\rm{Precision}} & = {{\left \|{pg \cap pg^ * } \right\|} \over {\left \|{pg} \right\|}} \cr {\rm{Recall}} & = {{\left \|{pg \cap pg^ * } \right\|} \over {\left \|{pg^ * } \right\|}} \cr}$$$$\eqalign {{\rm{Precision}} & = {{\left \|{pg \cap pg^ * } \right\|} \over {\left \|{pg} \right\|}} \cr {\rm{Recall}} & = {{\left \|{pg \cap pg^ * } \right\|} \over {\left \|{pg^ * } \right\|}} \cr}$$



F-score is defined as the harmonic mean of precision and recall. When computing precision and recall, to estimate the overlap between the parse graph to be evaluated and the ground truth parse graph, we ran a depth-first search using the matching operator described earlier to match the nodes and edges in the two parse graphs. Because there might be information that is correct but missing from our ground truth parse graph constructed from merged human text descriptions, we also manually checked the parse graph to be evaluated when computing its precision.
Figure 21 shows the precision, recall, and F-score of the baseline parse graphs and joint parse graphs averaged over all the video clips and human subjects of each dataset. It can be seen that all the precision scores are close to 1, suggesting the correctness of the information contained in all types of parse graphs. On the other hand, the recall and F-score of different types of parse graphs vary significantly. Each level of joint parse graphs has higher recall and hence higher F-score than the video parse graphs and the text parse graphs of the same level, which demonstrates the advantage of joint parsing. In particular, the joint parse graphs based on level 1 text have significantly higher recall and F-score than both the video parse graphs and the level 1 text parse graphs, implying that the information from the video and level 1 text is highly complementary; it also shows that providing a simple text description (such as level 1 text) to a video already boosts the performance of video understanding to the level close to a full description of the video (all three levels of text combined). Adding more text (levels 2 and 3) into joint parsing further improves the joint parse graphs, but with diminishing returns.
The average precision, recall, and F-score of the video, text, and joint parse graphs compared against the ground truth parse graphs constructed from the merged human text descriptions. (a) Hallway, (b) courtyard.



Figure 21.The average precision, recall, and F-score of the video, text, and joint parse graphs compared against the ground truth parse graphs constructed from the merged human text descriptions. (a) Hallway, (b) courtyard.



Comparing the results of the two datasets, we can see that the recall scores of the text and joint parse graphs on the courtyard dataset are much lower than those on the hallway dataset. This is because the courtyard dataset contains more complicated events and therefore the text descriptions from the five human subjects differ more significantly. Here is an example of such a difference for the courtyard video clip:
  • 1.
    The person walks.
  • 2.
    The person walks and talks.
  • 3.
    The person walks out of a building.
  • 4.
    The person walks across the courtyard.
Because none of the text descriptions cover all the event details, the average recall of the text parse graphs is low. The joint parse graphs are produced with the text parse graphs as input, so their average recall is also lower on the courtyard dataset.
To compensate for this phenomenon in computing recall, we changed the ground truth parse graphs by keeping only the entities and relations that are mentioned by a minimal number of human subjects. The number of human subjects that mention an entity or relation indicates the degree of consensus among the human subjects regarding the existence and importance of the entity or relation. The maximum degree of consensus is 4/4 in our experiments because we used the text from four of the five human subjects to construct the ground truth parse graphs (excluding the provider of the input text of joint parsing).
Figure 22 shows the average recall scores of all types of parse graphs evaluated against the reconstructed ground truth parse graphs based on different degrees of consensus. The figure shows that all the recall scores are improved with the increase of the ground truth degree of consensus. Most of the improvements are large, implying that the text descriptions from different human subjects indeed differ, leading to significant changes of the ground truth parse graphs at different degrees of consensus. At the highest degree of consensus, the recall scores of most of the parse graphs are more than 0.9, suggesting that these parse graphs cover the most important entities and relations in the scene. Among all types of parse graphs, only the video parse graphs on the hallway dataset do not have large improvement in recall, suggesting the limitation of the video parser on the hallway dataset.
The average recall of the video, text, and joint parse graphs evaluated against the ground truth parse graphs reconstructed based on different degrees of consensus. (a) Hallway, (b) courtyard.



Figure 22.The average recall of the video, text, and joint parse graphs evaluated against the ground truth parse graphs reconstructed based on different degrees of consensus. (a) Hallway, (b) courtyard.



We further studied the influence of the degree of deduction on joint parsing. On a subset of the courtyard dataset that contain complex events, we adjusted the value of $\alpha _\phi$$\alpha _\phi$ to control the degree of deduction during joint inference and then measured the precision and recall of the resulting joint parse graphs. Figure 23 shows the results, which indicate that, with an increasing degree of deduction, at first the recall is greatly improved without decreasing the precision, suggesting that the deduced information is mostly correct. However, with further deduction the recall is improved modestly while the precision drops dramatically, which is caused by the increase of erroneous information being deduced. Eventually, the recall stops improving while the precision continues to drop, implying that all the additional information being deduced is false.
The influence of the degree of deduction on the precision and recall of joint parse graphs.



Figure 23.The influence of the degree of deduction on the precision and recall of joint parse graphs.



Query Answering Evaluation.
Earlier, we introduced how the joint parse graph produced by our system can be used in query answering. Here we evaluate the quality of the joint parse graph by measuring the accuracy of query answering. This evaluation approach of video understanding directly measures the utility of the video understanding system in human-computer interaction, which goes beyond the conventional evaluation frameworks, such as those based on classification or detection.
For each video clip in our datasets, we constructed a natural language query set that mimics the scenario in which a human user continuously queries the computer in order to acquire information about the objects, scenes, and events captured in the video clip. The natural language queries can take the forms of who, what, when, where, and why. We also asked a human subject to provide the correct answers for each query set. Table 4 shows an example query set and the correct answers. We then entered the queries into our system and retrieved the answers. (See www.stat.ucla.edu/∼tukw/JointParsing/demo.html#demo2 for a demonstration video of a user continuously asking queries using our GUI tool.)

Table 4.An example sequence of natural language queries and the correct answers

Video clip Queries Correct answers

Who is in the designated area? A human.
What is he doing? Buying food; waiting; walking; picking up food.
Where does he buy food? At the food truck.
Why does he buy food? He is hungry.
When does he pick up food? 0:46–0:51.

We compared the answers produced by the system against the correct answers provided by humans. For each who/what/where/why query, we manually matched the system answers with the human answers and computed the precision (the percentage of the system answers that are correct) and recall (the percentage of the human answers that are returned by the system). For each when query, we checked the overlap between the time range returned by the system and that provided by human, and then computed the precision (the percentage of the time range returned by the system that is correct) and recall (the percentage of the correct time range that is covered by the system answer). We then computed the F-score.
Figure 24 shows the average F-score of the what, where, when, and why queries and of all the queries based on the video, text, and joint parse graphs. We do not show the results of the who queries because the video clips used in the experiments either contain or designate a single person and therefore the who queries become trivial to answer.
The average F-score of the what, where, when, and why queries and of all the queries based on the video, text, and joint parse graphs in the query answering experiments. (a) Hallway, (b) courtyard.



Figure 24.The average F-score of the what, where, when, and why queries and of all the queries based on the video, text, and joint parse graphs in the query answering experiments. (a) Hallway, (b) courtyard.



From the results, we can see that query answering based on the joint parse graphs clearly outperforms that based on the video and text parse graphs. Compared with the experimental results from the first evaluation approach ( Figure 21 ), in the query answering experiments the joint parse graphs have a significantly larger advantage over the video and text parse graphs. This is because answering a query involves perfect matching of the conditions of a query with one or more subgraphs in the parse graph. Therefore, any small error in the parse graph will be magnified in the query answering evaluation.
Among the four types of queries shown in Figure 24, the joint parse graphs have the largest advantage for the why queries over the video and text parse graphs because the causal relations required for answering the why queries are typically not detected in video parsing and not always mentioned in the input text, but they can be inferred by the deduction operator in joint inference. The advantage of the joint parse graphs for the where queries is also large because many events being queried are not detected in video parsing and the location information is sometimes skipped in the text (as the examples in Figure 20 show), whereas additional location information can be deduced in joint parsing. For the when queries, the average F-score of the text parse graphs is well below 1.0 although the time annotations provided by human subjects in the input text are accurate. This is because some of the events being queried are not described in the input text, so the corresponding when queries cannot be answered based on the text parse graphs.
Discussion
In future work, we will first further improve both video parsing and text parsing with respect to parsing quality and efficiency. Currently, we manually construct the S/T/C-AOG to model objects, scenes, and events. In the future, we want to investigate automatic approaches to learning AOG from data, which would facilitate the application of our approach to video-text data from novel domains. We also plan to test our system on video and text data that is more challenging than surveillance videos with human intelligence descriptions. Example data might include news report data, which contains more diverse content and larger gaps between video and text and requires more sophisticated background knowledge in joint parsing.
The work is supported by the DARPA grant FA 8650-11-1-7149, the Office of Naval Research (ONR) grants N00014-10-1-0933 and N00014-11-C-0308, and the US National Science Foundation Cyber-Enabled Discovery and Innovation (CDI) grant Computer and Network Systems (CNS) 1028381. We thank Mingtian Zhao, Yibiao Zhao, Ping Wei, Amy Morrow, Mohamed R. Amer, Dan Xie, and Sinisa Todorovic for their help in automatic video parsing.
1. S.L. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli Relevance Models for Image and Video Annotation,” Proc. 2004 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR), vol. 2, 2004, pp. 1002–1009.
2. K. Barnardetal., “Matching Words and Pictures,” J. Machine Learning Research, vol. 3, Mar.2003, pp. 1107–1135.
3. F. Monay and D. Gatica-Perez, “Modeling Semantic Aspects for Cross-Media Image Indexing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 10, 2007, pp. 1802–1817.
4. C. Wang, D. Blei, and F.-F. Li, “Simultaneous Image Classification and Annotation,” Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1903–1910.
5. C.D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.
6. S.-C. Zhu and D. Mumford, “A Stochastic Grammar of Images,” J. Foundations and Trends in Computer Graphics and Vision, vol. 2, no. 4, 2006, pp. 259–362.
7. Y. Zhao and S.-C. Zhu, “Image Parsing with Stochastic Scene Grammar,” Proc. 24th Ann. Advances in Neural Information Processing Systems (NIPS), 2011, pp. 73–81.
8. Y. Zhao and S.-C. Zhu, “Scene Parsing by Integrating Function, Geometry and Appearance Models,” Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3119–3126.
9. M. Pei, Y. Jia, and S.-C. Zhu, “Parsing Video Events with Goal Inference and Intent Prediction,” Proc. 2011 IEEE Int'l Conf. Computer Vision (ICCV), 2011, pp. 487–494.
10. A. Fire and S.-C. Zhu, “Using Causal Induction in Humans to Learn and Infer Causality from Video,” Proc. Ann. Meeting of the Cognitive Science Soc. (CogSci), 2013, pp. 2297–2309; http://mindmodeling.org/cogsci2013/papers/0419/paper0419.pdf.
11. T.R. Gruber, “A Translation Approach to Portable Ontology Specifications,” Knowledge Acquisition, vol. 5, no. 2, 1993, pp. 199–220.
12. J.H. Gennarietal., “The Evolution of Protégé: An Environment for Knowledge-Based Systems Development,” Int'l J. Human-Computer Studies, vol. 58, no. 1, 2003, pp. 89–123.
13. O. Corchoetal., “Webode: An Integrated Workbench for Ontology Representation, Reasoning, and Exchange,” Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web (EKAW), LNCS 2473, Springer, 2002, pp. 138–153.
14. L. Zhangetal., “Orient: Integrate Ontology Engineering into Industry Tooling Environment,” The Semantic Web – ISWC, LNCS 3298, Springer, 2004, pp. 823–837.
15. D. Foxvog, “Cyc,” Theory and Applications of Ontology: Computer Applications, Springer, 2010, pp. 259–278.
16. A. Maedche and S. Staab, “Ontology Learning for the Semantic Web,” IEEE Intelligent Systems, vol. 16, no. 2, 2001, pp. 72–79.
17. P. Buitelaar and B. Magnini, “Ontology Learning from Text: An Overview,” Ontology Learning from Text: Methods, Applications and Evaluation, P. Buitelaar, P. Cimiano, and B. Magnini, eds., IOS Press, 2005, pp. 3–12.
18. P. Cimiano, Ontology Learning and Population from Text: Algorithms, Evaluation and Applications, Springer, 2006.
19. E.T. Mueller, Commonsense Reasoning, Morgan Kaufmann, 2006.
20. S.J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed., Prentice Hall, 2002.
21. Z. Si and S. Zhu, “Learning AND-OR Templates for Object Modeling and Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 9, 2012, pp. 2189–2205.
22. Z. Sietal., “Unsupervised Learning of Event AND-OR Grammar and Semantics from Video,” Proc. 2011 IEEE Int'l Conf. Computer Vision (ICCV), 2011, pp. 41–48.
23. J. Earley, “An Efficient Context-Free Parsing Algorithm,” Comm. ACM, vol. 26, no. 1, 1983, pp. 57–61.
24. J.R. Finkel, T. Grenager, and C. Manning, “Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling,” Proc. 43rd Ann. Meeting Assoc. Computational Linguistics (ACL), 2005, pp. 363–370.
25. M.-C. de Marneffe and C.D. Manning, “Stanford Typed Dependencies Manual,” 2008; http://nlp.stanford.edu/software/dependencies_manual.pdf.
26. A. Hakeemetal., “Semantic Video Search Using Natural Language Queries,” Proc. 17th ACM Int'l Conf. Multimedia (MM), 2009, pp. 605–608.
27. W.-N. Leeetal., “Comparison of Ontology-Based Semantic-Similarity Measures,” AMIA Ann. Symp. Proc., vol. 2008, 2008, pp. 384–388.
28. R. Thiagarajan, G. Manjunath, and M. Stumptner, “Computing Semantic Similarity Using Ontologies,” HP Labs, tech. report, 2008.
29. C. Pesquitaetal., “Semantic Similarity in Biomedical Ontologies,” PLoS Computational Biology, vol. 5, no. 7, 2009, doi:10.1371/journal.pcbi.1000443.
30. J. Dodgeetal., “Detecting Visual Text,” Proc. 2012 Conf. North Am. Chapter of the Assoc. Computational Linguistics: Human Language Technologies (NAACL), 2012, pp. 762–772.
31. B. Yaoetal., “I2T: Image Parsing to Text Description,” Proc. IEEE, vol. 98, no. 8, 2010, pp. 1485–1508.
32. I. Langkilde and K. Knight, “Generation That Exploits Corpus-Based Statistical Knowledge,” Proc. 17th Int'l Conf. Computational Linguistics (ACL-COLING), vol. 1, 1998, pp. 704–710.
33. C. Pollard and I.A. Sag, Head-Driven Phrase Structure Grammar, Univ. of Chicago Press, 1994.
34. K. Papinenietal., “Bleu: A Method for Automatic Evaluation of Machine Translation,” Proc. 40th Ann. Meeting Assoc. Computational Linguistics (ACL), 2002, pp. 311–318.
35. S. Banerjee and A. Lavie, “Meteor: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” Proc. ACL Workshop Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65–72.
Kewei Tu is an assistant professor with the School of Information Science and Technology at ShanghaiTech University, China. His research interests include machine learning, natural language processing, and computer vision. Tu has a PhD in computer science from Iowa State University. Contact him at tukw@shanghaitech.edu.cn.
Meng Meng is a doctoral candidate in the Department of Statistics at the University of California, Los Angeles. Her research interests include communicative learning in scene understanding, analogical learning, digital visual art rendering, and human perception of likeness. Meng has an MS in computer science from Beijing Institute of Technology, China. Contact her at joycemeng@ucla.edu.
Mun Wai Lee is a lead scientist at Intelligent Automation, Inc. (IAI). His research interests include computer vision, machine learning, pattern recognition, statistical modeling, artificial intelligence, semantic web, and robotic vision. Lee has a PhD from the Department of Computer Science at the University of Southern California. Contact him at mlee@i-a-i.com.
Tae Eun Choe is a senior research scientist at ObjectVideo. His research interests include computer vision, video surveillance, scene understanding, and medical imaging. Choe has a PhD in computer science from the University of Southern California. Contact him at tchoe@objectvideo.com.
Song-Chun Zhu is a professor of statistics and computer science and the director of the Center for Vision, Learning, Cognition and Arts at the University of California, Los Angeles. His research interests include computer vision, statistical modeling and learning, cognition, and visual arts. Zhu has a PhD in computer science from Harvard University. He is a fellow of IEEE Computer Society. Contact him at sczhu@stat.ucla.edu.
FIRST
PREV
NEXT
LAST
Page(s):
[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment:
 

Computing Now Blogs
Business Intelligence
by Ray Major
Cloud Computing
A Cloud Blog: by Irena Bojanova
Enterprise Solutions
Enterprise Thinking: by Josh Greenbaum
Healthcare Technologies
The Doctor Is In: Dr. Keith W. Vrbicky
Hot Topics
NealNotes: by Neal Leavitt
Industry Trends
Insights
Mobile Computing
Shay Going Mobile: by Shay Shmeltzer
Networking
NGN-Insights: by Martin Nuss and Uday Mudoi
Programming
No Batteries Required: by Ray Kahn
Software
Software Technologies: by Christof Ebert
Sponsored
RESET