The Community for Technology Leaders

Browsing within Lecture Videos Based on the Chain Index of Speech Transcription

Stephan Repp
Christoph Meinel, IEEE

Pages: pp. 145-156

Abstract—The number of digital lecture video recordings has increased dramatically since recording technology became easier to use. The accessibility and ability to search within this large archive is limited and difficult. Additionally, detailed browsing in videos is not supported due to the lack of an explicit annotation. Manual annotation and segmentation is time-consuming and therefore useless. A promising approach is based on using the audio layer of a lecture recording to obtain information about the lecture's contents. In this paper, we present an indexing method for computer science courses based on their existing recorded videos. The transcriptions from a speech-recognition engine (SRE) are sufficient to create a chain index for detailed browsing inside a lecture video. The index structure and the evaluation of the supplied keywords are presented. The user interface for dynamic browsing of the e-learning contents concludes this paper.

Index Terms—Multimedia retrieval, speech recognition, index, semantic, user interface, browsing, recorded lecture videos.


In terms of streaming media, audiovisual recordings are used more and more frequently for correspondence-course institutions where, independent of time and place, learners can access libraries of recorded lectures. Fig. 1 illustrates an example of such a system 1 that delivers three main parts of the lecture: the desktop-recording (the captured desktop of the presenter's laptop) on the right side, a manual annotation in the lower left corner, and a video of the speaker in the upper left corner. In fact, such recorded lecture videos are ideal for correspondence courses and could be regarded as a complement to traditional classroom courses [ 1]. However, the accessibility and traceability of the content from large lecture archives for further use is rather limited.


Figure    Fig. 1. Information resources in recorded lecture videos.

Two major challenges arise while preparing recorded lectures for content-based retrieval: the automated indexing of multimedia videos and the retrieval of semantically appropriate information from a lecture knowledge base. It is evident that the rapid growth of multimedia data available in e-learning systems requires more efficient methods for content-based browsing and retrieval of video data. The requested information is often covered by only a few minutes of the lecture recording and is therefore hidden within a full 90-minute recording stored among thousands of others. It is often not the problem to find the proper lecture in the archive but rather to find the proper position inside the video stream. It is not practical for learners to watch the whole video to get the desired information inside the lecture video. The problem becomes how to retrieve the appropriate information in a large lecture-video database more efficiently. Segmentation of video lectures into smaller units, each segment related to a specific topic, is a highly necessary approach to finding the desired piece of information.

Traditional video retrieval based on feature extraction cannot be efficiently applied to lecture recordings. Lecture recordings are characterized by a homogeneous scene composition. Most of the time, the lecturer is in focus, presenting a topic which is not visible. Thus, image analysis of lecture videos fails even if the producer tries to loosen the scene with creative camera trajectories.

A promising approach is based on using the audio layer of a lecture recording to gain information about the lecture's contents (e.g., [ 2], [ 3], [ 4], [ 5], [ 6]). Transcriptions of lectures recorded live consist of unscripted and spontaneous speech. Thus, lecture data has much in common with casual or natural speech data, including false starts, extraneous filler words, and nonlexically filled pauses [ 7]. Furthermore, a transcript is a stream of words without punctuation marks. Of course, there is also great variation between one tutor's speech and another's. For example, one may speak accurately, the next completely differently with many grammatical errors, for example. One can also easily observe that the colloquial nature of the data is dramatically different in style from the same presentation of this material in a textbook. In other words, the textual format is typically more concise and better organized. Furthermore, speech-recognition software produces outcomes prone to error, approximately 20 to 30 percent of the detected words are incorrect.

The not-in-the-vocabulary problem is a dilemma of the recognition software as well [ 8]. The software needs a database of all words used by the lecturer for the transcription process. If a word that occurs in the speech of the lecturer is not in the vocabulary, the wrong word is transcribed by the engine. Also, changing the language in a presentation may lead the software to unsolvable problems. A lecture presented in German may sometimes use English terminology. This causes false words to be included in the transcript. Another issue is that the analysis of the audio layer of a lecture video recording provides two types of information: the spoken words (transliteration) and prosody. Prosodic cues such as pitch, duration, pauses, and energy distribution mark topic changes and points of interest [ 9]. A state-of-the-art speech-recognition engine (SRE) does not support this prosodic information and, further, most lecture recordings do not provide optimal sound quality. It has been shown in [ 8], however, that a keyword-based search in an imperfect transcript yields reliable results. Some universities 2 offer a keyword-based search for their lecture-video archive. However, such solutions fail if word-sense disambiguation is required, i.e., in the case of words with multiple uses or meanings.

Our focus is on recorded lectures from computer science. These kinds of lectures have a typical internal structure. The tutor talks about section topics. The section topics are parts of the main topic the tutor is talking about. For example, the tutor talks successively about the section topics "star-topology," "bus-topology," and "tree-topology." The main topic is "topology" in this case. The lecture structure in other courses (e.g., in social science, history of art, etc.) and the lecture structure between lecturers could vary and is not analyzed in our research. Further, the spoken language is a monologue and not a dialogue as it is in a seminar or in classroom courses.

In this article, we present our efforts at putting together results from different fields and projects in order to create a user interface for browsing educational lecture videos based on a chain index. Our proposed algorithm requires neither semantically annotated lecture videos nor an external knowledge source such as WordNet [ 10]. The automatically supplied keywords are advantageous for dissolving word sense ambiguity.

The structure of this article is as follows: Section 2 analyzes the current state of the indexing of lecture videos and we show that a new dynamic indexing technique is required. Section 3 qualifies our indexing and describes the advantage of such an index. To obtain an estimate of the accuracy of the retrieval, an exploratory study is presented in Section 4. A browsing system based on the index created, as well as the appropriate interface, is presented in Section 5.

Statusof Lecture Video Indexing

The process of establishing access points to facilitate the retrieval of information is called indexing [ 11]. Indexing of a recorded lecture video requires two main steps. First, the video is split into coherent segments (topic segmentation), i.e., each segment represents a topic and stands for a specific assertion. Second, the segments have to specify the topic with descriptors, keywords, a summary, or a semantic description. The indexing is a part of an Information Retrieval System. Information Retrieval (IR) is the representation, storage, organization, and access for information items [ 12].

IR in lecture videos is a very active, multidisciplinary research area. Spoken language, any kind of action (e.g., demonstrations, experiments, writing on a blackboard, mouse clicks, etc.), the gestures of the speaker, social tagging, manual annotations, desktop-recordings (e.g., GUI of a demo program, slides, formula, etc.), or the presented slides in document form could act as a source for better access to the contents of the lecture. Fig. 1 gives an overview of the resources in a recorded lecture video. It is clear that all resources are not available in every recorded lecture.

In this article we will not have a look at systems which store the presentation in proprietary formats and have additional information about the activity in the lecture, such as [ 13] or [ 14], [ 15], [ 16]. For instance, Chu and Chen have already presented an approach for a cross-media synchronization [ 13]. They match audio recordings with text transliterations of these recordings based on dynamic programming methods [ 17]. Chu and Chen make use of explicitly encoded events for synchronization instead of implicitly automated annotation.

However, most lecture recordings that are accessible on the Web ordinarily consist of spoken speech and slides that are stored in a common video format (mpg, rm). The indexing of these lecture videos is highly important for further access.

There are four main indexing opportunities for this kind of lecture recording. The first is that the audiovisual lecture recording is indexing manually after the lecture. A different approach is that the users of the videos do the indexing (annotation) themselves with tags, which is called tagging. The slides can also be used as a source for the automatic indexing and the implication in this method is that a slide is a summary of the presenter's speech and thus a good source of data for the search. A fourth option is that the speech can be used as a resource for automatic indexing.

2.1 Manual

Here, both the segmentation and the annotation are done manually. Typically, a professional who is familiar with the issue does segmentation and marks each segment with descriptors or other descriptions. This is done after the lecture in a time consuming and expensive process. Maybe this is useful for a small lecture archive but it is not realistic for an existing large archive. It is clear that this solution obtains the best results depending on the accuracy and the diligence of the professional.

As a complex example that is limited to manageable videos segments, we will present a system for a semantic search based on Description Logics (DL) [ 18]. It has been recognized that a digital library benefits from having its contents understandable and available in a machine-processable form, and it is widely agreed that ontologies will play a key role in providing much of the enabling infrastructure to achieve this goal. A fundamental part of the system is a common domain ontology. The ontology is basically composed of a hierarchy of concepts ( taxonomy) and a language. In the former case, a list of semantically relevant words is created with regards to a domain, e.g., internetworking, and organized hierarchically. In the latter case, the Description Logics are used to formalize the semantic annotations. Description Logics [ 19] are a family of knowledge representation formalisms that allow the knowledge of an application domain to be represented in a structured way and to be reasoned through. In DL, the conceptual knowledge of an application domain is represented in terms of concepts such as IPAddress and roles such as $\exists$composedOf. Concepts denote sets of individuals and roles denote binary relations between individuals. Complex descriptions are built inductively using concept constructors which rely on basic concepts and role names. Concept descriptions are used to specify terminologies that define the intentional knowledge of an application domain. For example, the following imposes that a router is a network component that uses at least one IP address: Router$\sqsubseteq$NetComp$\sqcap \exists$uses.IPAddress. Definitions allow meaningful names to be given to concept descriptions such as ${\rm LO}_1 \equiv$IPAdress$\sqcap \exists$composedOf.HostID.

Graphic: Fig. 2. Example of terminology concerning learning objects.

Figure    Fig. 2. Example of terminology concerning learning objects.

To index a video, the video is split up into segments. One segment stands for one Learning Object (LO) where the lecturer speaks about one topic. The semantic annotation of the four LOs is shown in Fig. 2, describing the following contents:

  • ${\rm LO}_1$ : general explanation of IP addresses,
  • ${\rm LO}_2$ : explanation that IP addresses are used in the protocol TCP/IP,
  • ${\rm LO}_3$ : explanation that an IP-address is composed of a host identifier, and
  • ${\rm LO}_4$ : explanation that an IP-address is composed of a network identifier.

Some advantages of using DL are that DL terminologies can be serialized as OWL ( Semantic Web Ontology Language) [ 20], a machine-readable and standardized format for semantically annotating resources. Second, DL allows the definition of detailed semantic descriptions about resources and logical inference from these descriptions [ 19]. Finally, the link between DL and natural language (NL) has already been shown [ 21].

The query of the user represents an OWL description as well. The way the NL processing works is described in detail in [ 22], [ 18]. The query DL OWL and all LO OWL descriptions are the input of the semantic search engine. This checks over the LO and calculates the suitable ones. In tests, an ${\rm MMR}_5$ -value 3 ( Medium Reciprocal Rank of the answers [ 23]) of 75 percent is reached [ 24]. This system has been successfully used in a school in Luxembourg [ 25] for the domain fraction arithmetic.

When using DL, four types of manual tasks have to be done for the manual indexing:

  • Creating the taxonomy manually or using an existing one.
  • Creating dictionaries which map NL words to the concepts and roles.
  • Segmenting the videos into parts (LO).
  • Annotating the video segments with concepts and rules.

In contrast, a pure manual indexing is based only on the segmentation of the videos into parts (LO) and the enrichment of the parts with the basics contents. Certainly, the manual process is very time-consuming and not really practical for the indexing of a large video library.

2.2 Tagging

The recent emergence and success of folksonomies and tagging have shown the great potential of this simple approach to collecting metadata about resources. Unlike traditional categorization systems, the process of tagging is nothing more than annotating documents with an unstructured list of keywords. Although the amount of research on tagging is still comparatively low, several studies have already analyzed the semantic aspects of this process and why it is so popular in practice [ 26], [ 27], [ 28]. Tagging strikes a balance between the individual and the community: The cost of participation is low for the individual and tagging a document benefits both the individual and the community.

An approach to video annotation is described in [ 29]. There the user is involved in the annotation process by deploying collaborative tagging for the generation and enrichment of video metadata annotation to support content-based video retrieval. When a user bookmarks a position in a lecture video, he/she stores it for later use. Users only bookmark documents that are valuable or relevant to them. When a document is of no interest to a specific user, it is unlikely that he/she would bookmark it, and when users actually do store bookmarks in order to find them again later, they have an incentive to add meaningful metadata to them.

Finding word associations for describing a document in the form of tags is a subjective user task. For the annotation of lecture videos, the timeline is a further dimension not present in text documents. Surely the words are also ordered in the text document but the exact time position of the word is lost. In order for the annotation to be useful and exact, the user has to annotate all the segments in the lecture video, not just the one(s) relevant to him/her. The quality and the quantity of tagging are improved by the square of the numbers of users if the quantity of users increases (Metcalfe's law 4), but it is very doubtful that enough learners tag one lesson so that the quality of the annotation is appropriate under the consideration that many universities' lecture videos cover similar topics.

To the best of our knowledge, no research exists on the quality of the annotation of lecture videos, nor do studies exist on the quality of the tagging information of lecturer videos compared with a reference data set. The main disadvantage is that this annotation needs a lot of time and depends on many learners who have to watch and put to use the ability to annotate.

2.3 Slides

Slides (e.g., PowerPoint slides) represent the main information in video segments, particularly in computer science [ 8], [ 30]. Hürst evaluated that the slides carry the most information for a keyword search. He evaluated a retrieval performance based on slides of a 40 percent precision value and a 63 percent recall value compared to a precision of 33 percent and a recall of 54 percent based on corrected transcriptions. The recall and the precision value are standard evaluation measures in Information Retrieval [ 12].

One problem is the synchronization of the slides and the video stream. An approach is described in [ 31] for synchronizing presentation slides by maintaining a log file during the presentation that keeps track of slide changes. Sack and Waitelonis [ 31] mention optical character recognition (OCR) for the identification and synchronization of the presentation slide currently being shown within a desktop-recording. If the log files and the slides are given, then these annotations have good retrieval results based on keywords. But, OCR recognition is erroneous and only adapted to the special video format and to the PowerPoint format. Another solution is to synchronize the existing PowerPoint slides with the speech of the lecture recordings [ 32]. Repp et al. synchronized with a difference of plus or minus one slide the speech transcription.

If slides exist for the lectures and the time stamps of each slide are available, then this source is one of the best for access to the lecture contents [ 30]. However, most lecture recordings available neither support desktop recordings nor maintain a dedicated log file with the appropriate slides. So, the speech itself remains the only reliable source of information despite positive results from these methods.

2.4 Transcripts of Speech Recognition Engines

The spoken language is one of the main information resources in lectures. The lecturer speaks about a topic with a characteristic vocabulary. Each transition of the vocabulary (or other features in the spoken language, e.g., a pause) could mark a segment boundary and so the video could be classified into video sections. The task is to partition the text into a sequence of topically coherent segments and thereby induce a content structure (called topic segmentation). For each video segment, a index could be created with several standard indexing methods [ 11], [ 12], e.g., the distribution of words in the video segments is used for the automatic indexing process.

Topic segmentation has been extensively used in text information retrieval and text summarization [ 33], [ 34], [ 9], [ 35], [ 36]. The user would prefer a document passage in which the occurrence of the word or topic is concentrated in one or two passages. The development of text segmentation algorithms is a central concern in natural language processing. This can be stated as the issue of detecting the boundaries with standard text segmentation algorithms.

For the first step in automatically indexing the lecture video based on the transcripts, a small test is arranged. Standard text segmentation algorithms are implemented to show whether it is possible to recognize the segments based on spoken language. A more detailed study about this research is presented in [ 37].

2.4.1 Test Setup

After a preprocessing step which deletes all the stop words (i.e., words without semantic relevancy) and transforms all words into their canonical form (stemming) [ 38], the following algorithms are implemented:

  • Linear. A linear distribution of the topics during the presentation time is implemented. The algorithm assumes the number of topics is given.
  • Pause. The duration of silence (pause in speech) is used as a feature for a segment boundary. The time stamps of the longest silences are used as the boundaries. The algorithm assumes the number of topics is given.
  • C99 and LCseg. The boundaries number is given and a sentence length of 10 words is defined. A segment boundary is assumed to be a text boundary and can be detected by the C99 [ 39] and the LCseg [ 40] algorithms. The description and the implementations of the algorithms can be downloaded from the appropriate author pages [ 39], [ 40]. In this way the number of segments is provided to the algorithms. A stop-word list and a stemmer, both for the German language, are adapted to C99 and LCseg.
  • SlidingWindow ( SW). The SlidingWindow is based on the TextTiling [ 41] algorithm and on the research of [ 42]. A sliding window system is implemented. This window (120 words) moves across the text stream of the transcript over a certain interval (20 words) and compares the neighboring windows with each other using the cosine-measure. After postprocessing, the points with the lowest similarity become a boundary.

2.4.2 Metrics

The WindowDiff [ 43] measurement ( WinDiff) is used as a standard evaluation measure for text segmentation. The implementation of the WindowDiff is used from [ 44]. 5 The WindowDiff measurement does not take into account the time delay between the calculated time and the real boundary time [ 32]. For this reason, the mean error rate and the standard deviation are used. The mean error rates ( $\bar{x}$ ) are calculated as the difference between the point in time of the reference and the point in time achieved by the algorithm for each topic boundary in the course. The mean of these values is the medial of all differences (i.e., the offset of the time shift). Additionally, we calculate the mean of all absolute differences ( $\bar{y}$ ). Further, the standard deviation ( SD) points out how much the data are spread out from the mean. If the data sets are close to the mean, the standard deviation is small and the algorithm matches the boundaries very well.

2.4.3 First Test, Segment Boundaries Are the Slide Transitions

The text boundaries are assumed as slide transitions. The question posed is how standard text segmentation algorithms can segment the erroneous transcript into coherent segments without any additional resources like the PowerPoint slides. The count of segment boundaries (slide numbers) is given for each lecture. Table 1 shows a summary of the data set's contents. The data set for the first test includes two different speakers, two different languages (German and English), and three different topics (WWW, Semantic, and Security). For the first two minutes of each lecture, the word accuracy is determined manually as depicted in [ 45].

Table 1. Summary of the Lecture Series Archive

Table 2 shows the results as the mean, the standard deviation, and the WindowDiff of the first test. "SlidingWindow" has the best mean value. Table 2 shows further that the "Linear" segmentation yields better time results ( $\bar{y}$ and the standard deviation) compared to the "C99," "LCseg," "SlidingWindow," and "Pause" algorithm for the data set. For the WindowDiff measurement, the simple "Pause" algorithm seems to be the best one.

Table 2. Results (in Seconds) of the First Test Based on the Data Set of Table 1, Segment Boundaries Are the Slide Transitions

2.4.4 Second Test, on Real Segment Boundaries

The problem with relying primarily on the slides for creating segments is that these segments will always be partially wrong due to the fact that speech is dynamic. The tutor uses his freedom to discuss topics not classified by the slides. Taking this into consideration, three people (a lecturer and two PhD students) were asked to discuss the "gold standard" of boundaries. One lecture is randomly selected from Mr. Meinel's WWW course and a perfect transcript (corrected by one person) is generated. This lecture of approximately 100 minutes in duration consists of 12,608 words and Word Error Rate is nearly 0 percent. The first test scenario A is based on the 62 slide transitions as boundaries and the second scenario B is based on 42 "real" boundaries.

Table 3 shows the mean, the standard deviation and the WindowDiff of this second test. It is surprising that the simple "pause" algorithm out-performs the WindowDiff measure for the data set (boundaries are generated by persons). Most algorithms achieve better results for the data set with the "real" boundaries compared with the slide transition. Hence, the results show that slide transition is not a good segmentation for topic boundaries.

Table 3. Results (in Seconds) of the Second Test Based on an Improved Transcript with a WER $\approx$ 0 Percent

2.4.5 Conclusion

The overall results are very unfruitful as they show that detecting the topic segment is nearly impossible. Detecting the topic segment (slide transition or real boundaries) is practically impossible. A time shift between +/- 2 to 3 minutes (SD of C99 algorithm, Second Test) occurs. If the segment count isn't given to the algorithm, the results are much worse because then the algorithms have to additionally determine the boundary count. Further, the definition of a topic segment is not clear. Malioutov and Barzilay [ 44] show that the three lecturers chosen to segment the lecture did so differently. They defined between 8.9 and 13 topic segments per lecture. This discrepancy in the segmentation results is highly critical for information retrieval.

2.5 Current Approaches versus the Ideal Approach

All current approaches, such as manual indexing, using only the slide, tagging, or using the speech transcripts in an ordinary information search, have serious disadvantages.

It is clear that manual indexing achieves the best retrieval results but it is too time consuming and too expensive for the libraries. Further, although the slides as a resource are good, the problem of segmentation within the lecture is not resolved (see test 2). The speaker is usually on the same topic for one or more slides, and so one slide is not a topic, but rather a topic consists of more than two slides [ 44]. Additionally, the question of what a topic consists of is highly debatable. Moreover, most lecture recordings available neither support desktop recordings nor maintain a dedicated log file, meaning this also poses problems. Tagging, on the other hand, doesn't ensure a consistent annotation of the video streams if the videos are only watched by a few learners. They do some tags, but not objectively and automatically. In the end, a classical retrieval (segmenting, extracting keywords) of the transcription fails on the first segmenting step. It is not possible to segment or to find the topic boundaries (slide transitions) in the lecture video in a serious way.

Thus, we need a better method, without an explicit definition of a segment, in order to give users the freedom to browse the lecture and find the desired point in the lecture.


In this chapter, we describe our solution for the indexing of lecture videos based on the speech transcript. First, we explain our indexing technique, called "chaining." Second, we explain why this enhances the search within lecture videos.

3.1 Chaining

Clustering is used to detect cohesive areas—we call them chains—in the transcript. Linguistic research has shown that word repetition in a text is a hint for creating thematic cohesion. A change in the lexical distributions is usually a signal for topic transitions [ 46], [ 47].

The word stream of the raw data can contain all parts of speech, such as a noun, verb, number, etc. A term is any stemmed word within the lecture. From that stream, the distinct terms $T$ are stored in a term list $L$ . In other words, the list $L$ consists of all distinct word stems detected by the SRE of the lecture (without stop words). $n$ is the count of the distinct terms.



A chain is constructed to consist of all repetitions ranging from the first to the last appearance of the term in the lecture. The chain is divided into subparts when there is a long break (time distance d) between the terms. The chain is a segment of accumulated appearances of equal terms. The process works as follows:

  1. take the term $T_1$ from the term-list $L$ ,
  2. build clusters—we call them chains—so that the distance between two adjacent terms $T_i$ is not more than a distance $d$ , count the occurrences (TN), and set the start time and end time for the chain,
  3. store the chain data in the database as a "chain index" for the video, and
  4. take the next distinct term $T_i$$ _+$$ _1$ from the term-list $L$ and go to 2).

The chains can overlap because they have been separately generated for each term $T_i$ used in the course. For all chains that have been identified for the terms, a weighting scheme is used. Chains containing more repeated terms receive higher scores. The term number TN is the ranking value for the chain; the higher the value of TN, the higher the relevance of the chain. In a preliminary experiment, we had a precision of 88 percent for the first relevant chain [ 48].

In Table 4 and in Fig. 3 is an example of the chain index. For example, in the first chain "Topology," the stem "topology" occurred 10 times and the chain has a start-time at 1600s and an end-time at 2400s in the specific lecture. Further, there is a chain "Topology" that also starts at 2600s and ends at 3300s, but the chain only has a word repetition of four, so it is not given as high a ranking as the first chain.

Table 4. Example of the CHAIN INDEX

Graphic: Fig. 3. Chains in a chronology sequence.

Figure    Fig. 3. Chains in a chronology sequence.

3.2 Resolving Word Disambiguation

The chain index supports a dissolving of any ambiguity. In Fig. 4 the term topology is used in the context of network and the term ring is used in the context of topology. The sense of this word is clear from the context it is used in. Moreover, the index supports a keyword register for each video segment. In Fig. 4, the search query of the user is IP and the chain with the highest TN and their inside chains ( Address, UDP, Suffix, Header, TCP) are returned. The inside chains represent the content of the video segment IP. In fact, a chain has a before, after, and superordinate area, too. These areas, before, inside, after, and superordinate, consist themselves of areas. The user has the opportunity to browse through these areas to find the semantically proper position in the video.

Graphic: Fig. 4. The 
After, and 
Superordinate Area for the Query 

Figure    Fig. 4. The Inside, Before, After, and Superordinate Area for the Query IP.

Furthermore, the problem of composite words can be solved with the help of the chain index. The query IP Address leads to the intersection of the chains IP and Address being returned to the user.


In this chapter, we describe our evaluation of the index. First, we explain our experimental setup. Second, we evaluate in two tests the accuracy of the chaining.

4.1 Experimental Setup

The database of the videos consists of the course "Einführung in das WWW," held in the German language in the first semester 2006 at the Hasso Plattner Institut in Potsdam. The second row of the Table 1 summarizes the features of the data set. The data set consists of 24 lectures; each lecture is approximately 90 minutes in length. The video corpus has an overall length of approximately 1,860 minutes and is stored as RealMedia files. The SRE needs a training phase for adapting the microphone to the SRE (10 minutes). The dictionary of the SRE is supplemented with an existing domain lexicon or, if they exist, the keywords from the slides. The domain words (in our case, the keywords from the PowerPoint slides) are trained with a standard tool in 20 minutes. So, the training phase for the SRE is approximately 30 minutes long. Please take into consideration that this training phase is done once for the lecture series. Our purpose is neither to enhance existing SRE nor to develop new speaker independent SRE; our purpose is to use the existing SRE practically. If no training is done by the lecturer, the accuracy is lower. Hürst analyzed the dependency between training phase and word accuracy [ 31].

4.2 The Distance $\schmi d$

To find the proper $d$ break for the chaining, we vary d between 0.5 to 10 minutes. and measure the accuracy of 10 keywords. The ten keywords are randomly selected from the domain lexicon. Fig. 5 shows the results of this test. In this figure the C(orrect) means that the video segment retrieved is correct and is within +/- 30 seconds. It is debatable whether this interval (between 2 minutes and 8 minutes) is an acceptable value, in part because only 10 words were tested and the measurement involved only one speaker, but it is a first clue of that value. Also, the question of whether there is a new segment or not does not arise because in the topic returned, the keyword is definitely used by the speaker. The second test provides more details.

Graphic: Fig. 5. Accuracy of the results dependent of the distance factor 
$d$ .

Figure    Fig. 5. Accuracy of the results dependent of the distance factor $d$ .

4.3 Accuracy of the Chaining

One hundred and fifty three keywords (topic words) are chosen for the evaluation (e.g., XML, DSL, ISDN, ATM, Token, SGML, topology, mpeg, security, exam...). These words occur at least once in the transcript of the course and the words are different from those in the first test. A detailed study of keyword searches in speech transcripts of lecture recordings is presented in [ 30]. We evaluated whether our generated chain represents the topic for the search word. For this, we took only the chain with the majority of the topic words TN and decided whether this chain represents the subject accurately and, further, if this is the best section in the whole course and how long the time shift between the start of the calculated chain and the start of the topic in the lecture is.

We evaluated whether each hit in the result set:

  1. is correct and is within +/-30 seconds (C),
  2. is correct and is around the area of +/- 120 seconds (CA),
  3. is similar to the area of the topic and is within +/- 30 seconds (S),
  4. is similar to the area of the topic and is around the area of +/- 120 seconds (SA), and
  5. is not correct and not in the area of the beginning of the chain (W).

Fig. 6 shows this classification in detail. Similar to the area of the topic means that the lecturer speaks about a similar topic or he/she speaks about a topic that is related. For example, the search keyword is topology and, typically, several positions exist in the lecture series where this is discussed. The correct position in the lecture series is the description of topology and not a similar area such as ATM topology or anything else related to that. Surely, the decision of whether it is the correct position or not is a subjective task, but put into the perspective that this is a special kind of lecture series, the task becomes less subjective. The lecture is a basic tutorial about internetworking and, so, each lesson consists of several new terms and definitions and each is presented only once in the whole lecture series. Certainly, the lecturer mentions the terms on several occasions, however, the decision is only whether this is the position where the lecturer explains the terms and definitions. A few retrieved positions are not easy to decide, for example, Address. The speaker discusses Address in the context of Mobile Address, Internet Address, Mailing Address, etc. In this case, we decided that these positions are all correct positions.

Graphic: Fig. 6. Classification of the results.

Figure    Fig. 6. Classification of the results.

Table 5 depicts the results of the evaluation. 59.4 percent of all hits are correct and only 13.1 percent are completely wrong. If we sum up all correct hits with all similar hits (similar to the topic), then we obtain a result of 86.8 percent correct hits.

Table 5. Results of the Accuracy of the Chaining with $d=180$ Seconds

Comparing these results with the results of the test from the section Transcriptions of Speech-Recognition Engine is complex. In that section, the segmentation problem leads to very imprecise results (about +/- 2.5 minutes for given segment boundaries; this value is higher when segment boundaries are not given). Additionally, the retrieval within these segments and the represented keywords of the segment are not perfect. One error leads to further errors. Finding the right time position in the video stream could not be done better than with our method, although our retrieval performance of the best chain depended on the accuracy of the SRE. This dependency is evaluated by [ 30].


Browsing is a subjective selective process for filtering a large object set for navigation by the user [ 49]. It is the activity of engaging in a series of glimpses, each of which exposes the browser to objects of potential interest. Depending on the interest of the user, the browser may or may not examine more closely one or more of the objects. Depending on the interest, this may or may not lead the browser to acquire the objects [ 50]. In contrast with that, retrieval is the process of recalling and finding information and sending it to the user. The above system based on the chain index has the ability to retrieve topic words from the text stream for indentifying relevant positions in the lecture video. The user has to select from the supported list the appropriate information he/she is searching for and, further, has the opportunity to browse within the objects.

In this section, we present a schematic overview of our components for the browsing system illustrated by Fig. 7. The system is organized in three basic functional components, described in the following sections.


Figure    Fig. 7. Components of the system.

5.1 Speech Component

Lectures are recorded in a multimedia form, for example, as RealMedia files (.rm) in our implementation. The conversion of the audio data into text and the preprocessing of that text is the task of this component. An out-of-the-box SRE is used to generate text data with a time-stamp on each word [ 6], [ 32]. The SRE needs a training phase for adapting the microphone to it. The dictionary of the SRE is supplemented with an existing domain lexicon or, if they exist, the keywords from the slides. The training phase's usefulness for obtaining accurate transcripts certainly depends on the user's motivation.

The transcript consists of a list of words with the corresponding point in time when the word was spotted in the speaker's flow of words. A preprocessing step deletes all the stop words (i.e., words without semantic relevancy) and transforms all words into their canonical form (stemming) [ 38]. Then, the resulting words with their time stamp—we call them raw data—are stored in a database.

5.2 Chaining Component

The automatic clustering is the task of this component. The input is the raw data from the speech component and the output data is an index of the term chains. This component is presented in Section 3.1.

5.3 Web Access

This well-known component consists of a Web-server. The Web-server uses the chain index for requests from the Web-browsers. The implementation of the system is done with the Django-Framework. 6 Django implements the Model-View-Controller Concept. The model represents the data of the application (chain index), the view manages the elements of the user interface, and the controller manages the communication to the model of user actions. The out-of-the-box SRE needs approximately 3.5 hours (dual core 2.4 MHz, 2 GB) for the transcription process of a lecture (100 minutes, 12,000 words) and less than 3 seconds for the chaining. The calculation is done once after the lesson and the later search process is not time critical. The search process only has access to the chain index that is based on integer values.

Graphic: Fig. 8. User interface with time information.

Figure    Fig. 8. User interface with time information.

The user interface ( Fig. 8) consists of three regions (see also [ 51], [ 52]):

  1. The first region is set for the input of the search query and, additionally, if the information is available, the restriction of the language and the related lecture.
  2. The second region is the result of the first search. It illustrates the results with the time information and exemplifies the results for the most relevant chains for a lecture.
  3. The third region shows the summary of the before, after, and inside areas and their chains.

After a click on an adequate chain, an external player plays the video from the starting-point and, additionally, the third region of our site is expanded. The third region consists of the new before, after, and inside areas of the played chain.


It is clear that the results produced by such a search tool depend on the accuracy of the SRE. Some incorrectly detected words and incorrect compound words are not a problem for our system and user interface. The term occurs very often in relevant chains and, so, a high redundancy exists. Thus, some wrongly detected words have no influence on the generated and final results [ 8]. But, the main problem is that even a state-of-the-art SRE cannot recognize words from different languages in the transcription process. Lectures in German, especially lectures in informatics, contain some English words and phrases. An example of this problem is the English word "Source," for which the SRE detects the German word "Soße." To avoid this, an editor tool was developed for the correction of the main chains in the areas before, inside, after, and superordinate.

The editor tool ( Fig. 9) has an input box for selecting the keyword. After the selection, the chains are expanded to an initial timeline and the working area ("Working Frame"). In some cases, the video does not start at the beginning of the sought-after video segment. The user can watch the appropriate area and now has the opportunity to change the start and end time of the chain. They also have the opportunity to change the word if the word is incorrectly detected by the SRE for the chain. The editor also has the opportunity to browse inside the chain and to have a detailed look at it. After the editing process, the user can store the changes. After that, the save modus ("Approved Frame") is depicted under the working area. A further option is when a chain (or a word) is corrected. Then, all terms of the raw data will be corrected and the indexing (chaining) starts again. The intention is not to correct the whole transcript—that would be too time-consuming—but to correct the main and fundamental chains and words. This editor can be used as a multifunctional annotation tool as well.


Figure    Fig. 9. Editor tool for the correction of the chains.

If the speaker-independent SRE had a better performance, then the training phase, the additional domain words, and the editing process would not be necessary, so our system would not necessarily need any additional resources.

Conclusionand Further Works

In this paper, we have evaluated a system that allows browsing in sections of videos from a multimedia knowledge base. The results show that it is possible to add data to the result set that supports the learners with the helpful information they are searching for.

Regardless of the issues presented by imperfect SRE, this user interface allows an exact, easy, and fast navigation in the video archive. It also allows the disambiguation of words. The results of the tests demonstrate how the browsing and word-disambiguation are effective and that learners are indeed able to retrieve the results they are searching for.

Additionally, we are planning a usability study with learners to support the results we have already obtained. The searching process and the user interface will be evaluated during this test. Furthermore, we are working on improving the ranking. The part of speech (noun, verb, number, etc.), the time-duration of the chain, and the sum of chains for a term and the occurrence of the term in different lectures could be helpful parameters for a new ranking algorithm.

We are also planning to embed a MPEG-7 annotation in an MPEG-4 video container [ 53]. With MPEG-4 Binary Format for Scenes (BIFS), it is possible to create a scene description with navigational elements and special search facilities which require annotation. This is a compact representation of audiovisual information in a single container with all information needed to compose, consume, share, and distribute this data. This offers a wide range of applications in video and audio retrieval, such as semantic search and content classification.

The application of our algorithm is not limited to indexing university lectures or presentations in general. All activity applications, e.g., newscasts, theater plays, video material, or any kind of speech complemented by textual data, could be analyzed and annotated with the help of the proposed algorithm.


This project was developed in the context of the Web University project, 7 which aims to explore novel Internet and IT technologies in order to enhance university teaching and research. Special thanks to Carol Ebbert for correcting the language. Also to the student Johannes Köhler for the implementation of the interface components.


About the Authors

Bio Graphic
Stephan Repp received the telecommunication engineering degree from the University of Applied Sciences Trier, Germany, in 1998. He received the master's degree in system design from the "Hochschule Darmstadt," Germany, in 2000. He worked as an IT project manager in the data warehouse project of the "Deutsche Post AG." He received the state examination from the "Staatliches Studienseminar Trier" in teaching informatics and electrical engineering in 2004. He is currently a PhD student at the Hasso-Plattner-Institute (HPI) for IT-Systems Engineering at the University of Potsdam and a teacher for informatics at the "Berufsbildende Schule für Gewerbe und Technik" in Trier. His current research interests revolve around information retrieval from recorded audiovisual lecture videos. He is involved in the Web-University project of the HPI.
Bio Graphic
Andreas Groß received the computer science engineering degree from the University Rostock, Germany, in 2000. He is a scientific coworker and PhD student at the Hasso-Plattner-Institute for Software Systems Engineering in Potsdam, Germany. His current research interests revolve around information retrieval from recorded audiovisual lecture videos. He is working as the chair for Internet technologies and systems with Dr. Christoph Meinel and is involved in the development of the Web-University project space enhancement of the tele-TASK video recording system.
Bio Graphic
Christoph Meinel studied mathematics and computer science at Humboldt University in Berlin. He received the doctorate degree in 1981 and was habilitated in 1988. After visiting positions at the University of Paderborn and the Max-Planck-Institute for computer science in Saarbrücken, he became a full professor of computer science at the University of Trier. He is now the president and CEO of the Hasso-Plattner-Institute for IT-Systems Engineering at the University of Potsdam. He is a full professor of computer science with a chair in Internet technology and systems. His research focuses on IT-security engineering, teleteaching, and telemedicine. He is the author of more than 300 peer-reviewed scientific papers, the chief editor of ECCC—Electronic Colloquium on Computational Complexity and IT-Gipfelblog, the chairman of the German IPv6 council, and a member of various scientific boards and program committees. In that time, his research focus was on complexity theory and on BDD-based data structures for VLSI design. Later, he became interested in Internet research, particularly in Internet and information security, as well as in innovative forms of teleteaching. From 1998 to 2002, he was the founding director of the Institut für Telematik, e.V. in Trier. He is a member of the IEEE.
60 ms
(Ver 3.x)