3.3.1 Coreference Cohesion This type of cohesion occurs when a noun, pronoun, or noun-phrase refers to another constituent in the text. For example, consider the following two sentences: 1) Bob decided to clean his carpets, 2) so Bob went into the store to purchase a vacuum cleaner. In this example, the word Bob in the first sentence is a coreferent to the word Bob in the second sentence. This is an example of noun overlap. Coreference cohesion can also be measured by morphological stem overlap. The word cleaner in sentence 2 shares the same morphological stem (i.e., “clean”) as the word clean in sentence 1, although one is a noun and the other a verb. Coh-Metrix computes the proportion of adjacent sentences with noun or stem overlap; it also computes overlap within an information window size of two sentences (see C1-C4 in Table 2).
3.3.2 Pronoun Referential Cohesion Pronoun referential cohesion occurs when pronouns in a text have a definite referent [ 38]. For example, consider the following sentences: 1) Jim has had a hard day at work; 2) so he winds down with a beer. The pronoun he in sentence 2 refers to the noun Jim in sentence 1. Binding pronouns to previously defined entities in the text plays a significant role in grounding the discourse. Unreferenced pronouns have a negative effect on the cohesiveness and thereby text comprehension. Pronoun resolution is a difficult and open computational linguistics problem [ 38], so Coh-Metrix measures pronoun referential cohesion by computing the proportion of pronouns in the current sentence that have at least one grounded referent in a previous sentence (C5).
3.3.3 Causal Cohesion Causal cohesion occurs when actions and events in the text are connected by causal connectives and other linking word particles [ 11]. Events and actions have main verbs that are designated as intentional or causal (e.g., “kill,” “impact”), as determined by categories in the WordNet lexicon [ 39]. Causal particles connect these events and actions with connectives, adverbs, and other word categories that link ideas (e.g., “because,” “consequently,” “hence”). Coh-Metrix provides measures on the incidence of causal verb categories (occurrences per 1,000 words) (C6). The most significant measure of causal cohesion is the causal ratio that specifies the ratio of causal particles to events and actions (C7). A high causal ratio indicates that there are many connectives and other particles that stitch together the explicit actions and events in the text.
3.3.4 Semantic Cohesion In addition to the coreference variables discussed earlier, Coh-Metrix assesses conceptual overlap between sentences by a statistical model of word meaning: latent semantic analysis [ 32]. LSA is a statistical technique for representing world knowledge, based on a large corpus of texts. The central intuition is that two words have similarity in meaning to the extent that they occur in similar contexts. For example, the word hammer will be highly associated with words of the same functional context, such as screwdriver, tool, and construction. LSA uses a statistical technique called singular value decomposition to condense a very large corpus of texts to 100-500 dimensions [ 32]. The conceptual similarity between any two excerpts of text (e.g., word, clause, sentence, text) is computed as the geometric cosine between the values and weighted dimensions of the two text excerpts. The value of the cosine typically varies from 0 to 1.
Coh-Metrix uses the LSA cosine scores between texts segments to assess semantic cohesion. Adjacent sentences that have higher LSA overlap scores (i.e., higher semantic similarity) are more cohesive than adjacent sentences with low LSA scores. Both mean LSA scores and standard deviations in the scores are computed (see C9-C11). A semantic cohesion gap is expected to occur for adjacent sentences with low LSA scores and also when there is a high standard deviation (because of an occasional adjacency with a very low LSA score). The given information measure (C12) computes the extent to which an incoming sentence is redundant with (i.e., LSA overlap) the previous sentences in the dialogue history for a particular problem.
3.3.5 Connectives Connectives are words and phrases that signal cohesion relationships by explicitly linking ideas expressed in a text [ 11]. Coh-Metrix provides incidence scores on several types of connectives. In addition to all categories of connectives (C13), which include the causal and intentional connectives to assess causal cohesion, we segregated the temporal (e.g., “before,” “when”), additive (e.g., “also”), and conditional (e.g., “if,” “else”) connectives. Temporal and additive connectives have both negative (e.g., “however,” “in contrast”) and positive valences (e.g., “therefore,” “in addition”) (C13-C18).
3.3.6 Other Measures Coh-Metrix provides many measures of words and language in addition to the primary cohesion measures. We included a measure of the incidence of negations (C19). There were measures of the degree of abstraction of nouns and verbs in the text (C20, C21), obtained from the hypernym index in WordNet [ 39]; lower values of hypernym indicate the word is more abstract. The incidence of content words (e.g., nouns, main verbs, adverbs, adjectives) was also included (C22) as a measure of the amount of substantive content in the text. The type-token ratio for content words is an index of lexical diversity, with a value of 1.0 meaning that each word was used once in the text whereas values nearer to zero mean that words were frequently repeated (C23). The last two measures assessed reading ease (C24) and verbosity (C25). Reading ease was measured by the Flesch Kincaid Reading Ease score [ 40] whereas verbosity was measured as the length of the longest sentence in the text (C25).
1. AUTOTUTOR. When you turn on the computer, how is the operating system first activated and loaded into RAM? [this is the main question].
2. STUDENT. When the computer is turned on a file is automatically booted up.
3. AUTOTUTOR. Anything else? [this is a pump for more information].
4. STUDENT. This file contains the operating system.
5. AUTOTUTOR. What about storage? [this is a hint].
6. STUDENT. The operating system is then read into RAM and activated.
7. AUTOTUTOR. Right. The CPU executes instructions in ROM, which helps the operating system to start. [after giving positive feedback, there is an assertion].
4.3.1 Interaction with AutoTutor Students interacted with AutoTutor for 32 minutes on one of three randomly assigned topics in computer literacy: hardware, Internet, or operating systems. Each of these topics had 12 questions that required about a paragraph of information (three-seven sentences) in an ideal answer. The participants never completed all 12 questions during the 32 minutes; the mean number of questions completed was .
A video of the participant's face and a video of their computer screen were recorded throughout the tutoring session. The video included the synthesized speech generated by the animated conversational agent.
4.3.2 Judging Affective States Similar to a cued-recall procedure [ 42] the judgments for a student's tutoring session proceeded by playing a video of the face along with the screen capture video of the interaction with AutoTutor on a dual-monitor computer system. The screen capture included the tutor's synthesized speech, printed text, the student's responses, the dialogue history, and images, in essence recreating the context of the tutorial interaction.
Judges were instructed to make judgments on what affective states were present at any moment during the tutoring session by manually pausing the videos ( spontaneous judgments). They were also instructed to make judgments at each 20 second interval where the video automatically stopped ( fixed judgments). Judges were provided with a checklist of seven states for them to mark along with definitions of the states (see [ 13] for definitions of the emotions). Hence, judgments were made on the basis of the student's facial expressions, contextual cues via the screen capture, the definitions of the cognitive-affective states, and recent memories of the interaction (for self-reports only as described below).
Four sets of judgments were made for the observed affective states of each AutoTutor session. First, for the self-judgments, the student watched his or her own session with the tutor immediately after having interacted with AutoTutor. Second, for the peer judgments, each student came back a week later to watch and judge another student's session. Finally, there were two trained judges: undergraduate research assistants who were trained extensively on AutoTutor's dialogue characteristics (i.e., the context) and how to detect facial action units according to Ekman's Facial Action Coding System [ 43]. The two trained judges scored all sessions separately.
It is important to mention three important points pertaining to the present affect judgment methodology. This procedure was adopted because it affords monitoring participants' affective states at multiple points, with minimal task interference, and without participants knowing that these states were being monitored while completing the learning task. Second, this retrospective affect-judgment method has been previously validated [ 42], and analyses comparing these offline affect judgments with online measures encompassing self-reports and observations by judges have produced similar distributions of emotions (see [ 9] for a review). Third, the offline affect annotations obtained via this retrospective protocol correlate with online recordings of facial activity and gross body movements in expected directions [ 44]. Although no method is without its limitations, the present method appears to be a viable approach to track emotions at a relatively fine-grained temporal resolution.
4.4.1 Proportions of Affective States Experienced The affect judgment procedure yielded 2,967 self-judgments, 3,012 peer judgments, and 2,995 and 3,093 judgments for the two trained judges. We examined the proportion of judgments that were made for each of the affect categories, averaging over the 4 judges. A repeated measures ANOVA indicated that there was a statistically significant difference in the distribution of states, , , , partial eta-square . Bonferroni post-hoc tests revealed that the proportional occurrence of boredom ( ), confusion ( ), flow/engagement ( ), frustration ( ), and neutral ( ) were on par and significantly greater than delight ( ) and surprise ( ), which were equivalent to each other. Delight and surprise were excluded from the subsequent analyses because they were relatively rare.
4.4.2 Reliability between Judges We evaluated the reliability by which the affective states were rated by the four judges. Proportional agreement scores for the six judge pairs were: self-peer (0.279), self-judge1 (0.364), self-judge2 (0.330), peer-judge1 (0.394), peer-judge2 (0.368), and judge1-judge2 (0.520). These scores indicate that the trained judges had the highest agreement, the self-peer pair had the lowest agreement, and the other pairs of judges were in between. Another finding is that there are actor-observer differences in the agreement scores. The average actor-observer agreement was 0.324 (i.e., average of self-peer, self-judge1, and self-judge2), which is lower than the average observer-observer agreement score of 0.427 (i.e., average of peer-judge1, peer-judge2, judge1-judge2).
Although the agreement scores appear to be low, they are on par with data reported by other researchers who have assessed the problem of measuring complex psychological constructs, such as emotions [ 45], [ 46], [ 47], [ 48]. Agreement is low when emotions are not intentionally elicited, contextual factors play an important role, and the unit of analysis is on individual emotion events. It is unclear who provides the most accurate judgments of the learner's affective states [ 13]. Is it the self, the untrained peer, the trained judges, or physiological instrumentation? A neutral, but defensible position is to independently consider ratings of the different judges, thereby allowing us to examine patterns that generalize across judges as well as patterns that are sensitive to individual judges. This strategy was adopted in the current paper.
4.4.3 Extracting Transcripts of Tutorial Dialogues At the end of each student turn, AutoTutor maintained a log file that captured the student's response, a variety of assessments of the response, feedback provided, and tutor's next move. Transcripts of the tutorial dialogue between the student and the tutor were extracted for each problem that was collaboratively solved during the tutorial session. The tutorial sessions yielded 164 student-tutor dialogue transcripts. The transcripts contained 1,637 student and tutor turns.
Two sets of responses were obtained from the transcripts. The first, called student responses, were obtained by only considering the student turns in each transcript. The second dialogue category, called tutor responses, consisted of the tutor's statements. The purpose of dividing the transcripts into separate student and tutor responses is to assess the impact of each response category in predicting the student's emotions.
4.4.4 Computing LIWC and Coh-Metrix Features Psychological and linguistic features were computed for the student and tutor dialogues using LIWC 2007. This resulted in 60 features (30 for each response category). Similarly, 50 cohesion features (25 for each response category) were computed using Coh-Metrix 2.0. The text submitted to each computational tool consisted of the responses generated during the solution of an individual problem. An aggregated score for each predictor was derived for each subject by averaging the scores across problems. Hence, the unit of analysis for the subsequent analyses was an individual subject.
6.2.1 Linguistic Predictors None of the linguistic predictors correlated with frustration, so we constructed models for boredom, , , confusion, , , and flow, , . Negations were the only significant predictor of boredom. An examination of the tutorial dialogue of highly bored students indicated that these students used a large number of negations (e.g., “no,” “never”) as well as negative frozen expressions such as “I don't care” or incessantly repeating, “I don't know.” In contrast to this, engaged (flow) students used many impersonal pronouns, suggesting that they provided more material-focused responses.
Confusion was best predicted by a two-parameter model that included a lower use of second person pronouns by the student coupled with an increase in future tense words by the tutor. It is quite plausible that confusion was increased when the tutor used more future terms, ostensibly since these terms routinely accompany deep knowledge questions (e.g., “What would you say about X?”, “How should X effect Y?”, “What should you consider if ?”). These questions require students to think, reason, and problem solve, so they are expected to generate confusion in the students.
6.2.2 Cohesion Predictors Significant models were discovered for boredom, , flow , confusion, , and frustration , . Similar to the linguistic analysis, the cohesion analysis indicated that boredom was accompanied by an increased use of negations by the student. For flow, the data suggest that causally cohesive responses of the student were predictive of this state. The ability of students to produce such responses indicates that they were able to construct a causal situation model (sometimes called a mental model) that links events and actions [ 11]. The construction of a situation model is essential for learning at deeper levels of comprehension [ 49]. Engaged students use this mental representation to produce causally cohesive responses to the tutor's questions.
Confusion was marked by a breakdown in understanding the pronouns expressed by the tutor. This predictor measures the proportion of pronouns that have a grounded reference. Reading ease and comprehension are compromised when the students do not understand the referents of pronouns. Hence, it is no surprise that tutor responses that have a higher proportion of ungrounded pronouns are linked to heightened confusion.
Frustrated students provided responses with cohesion gaps (noun overlap across adjacent sentences). This may have occurred because it is difficult for them to compose a cohesive message or because they were allocating most of their cognitive resources to managing their frustration.
6.2.3 Comparing the Predictive Power of Linguistic and Cohesion Features Although the same predictor (i.e., the incidence of negatives) was diagnostic of boredom in both models, the cohesion model explained somewhat more variance ( ) than the linguistic model ( ). A one-parameter model was the most effective in predicting flow for both models. The linguistic model yielded an impressive . of 0.527, which is quantitatively superior to the adj. of 0.318 obtained from the cohesion model. Hence, flow is best predicted by the linguistic features.
The best linguistic model for confusion was a two-parameter model that yielded an adj. of 0.347. A quantitatively lower adj. of 0.223 was obtained from the best cohesion model, which was a one-parameter model. Although it is tempting to conclude that confusion is best predicted by the linguistic features, this would be an erroneous conclusion since the number of predictors differs across models. Therefore, we conclude that both models are equivalent in their ability to predict confusion. The best cohesion model yielded an adj. of 0.258 for frustration, while the linguistic features were unable to predict this affective state.
In summary, both models have their associated strengths and weaknesses that render them on par with each other. The linguistic models have a somewhat higher precision than the cohesion models (average adj. for linguistic and 0.248 for cohesion). However, they have lower recall because only three of the emotions could be detected for the linguistic models, while all four could be detected for the cohesion models.
6.5.1 Boredom The MLR analyses resulted in a significant model for boredom, , . of 0.363. It appears that self-reported boredom is signaled by tutor responses that lack discrepant terms (e.g., “should,” “would,” “could”) ( ) and are highly cohesive due to a high incidence of connectives ( ). The lack of discrepant terms in the tutor responses indicates that the tutor is directly asserting information to the student with well-crafted cohesive summaries of topics. These direct tutor moves that apparently elicit boredom can be contrasted with more indirect prompts and hints that are linked to confusion [ 7].
6.5.2 Confusion The MLR analyses resulted in a highly robust model for confusion, , with a . of 0.472. The predictors were students' responses that were lacking in connectives ( ) but with an increase in inhibitory terms ( ). It is informative to note that confused students provide responses that are rife with inhibitory terms such as “block,” “constrain,” and “stop.” Although confused students do not directly express their confusion, their responses inevitably convey their confusion via words that imply that they feel blocked, stopped, and constrained. Furthermore, the responses of these confused students were not sufficiently cohesive as they lack connectives to bind together substantive ideas.
6.5.3 Frustration The analyses revealed some interesting insights into self-reported episodes of frustration. Frustration was predicted by a two-parameter model , with a . of 0.414. It appears that frustrated students provide responses that are verbose ( ) and the tutor responds with words that lack certainty ( ), ostensibly because it cannot fully comprehend the students' responses.
6.5.4 Psychological, Linguistic, or Cohesion Features Although models that compared the effectiveness of each of the predictor types were not considered in this set of analyses, an examination of the coefficients of the models can provide some insights. Each affective state was predicted by one psychological feature and one cohesion feature. There was no linguistic feature that was sufficiently diagnostic of self-reported affective states. Let us refer to this set of features as the Self-Feature Set. The fact that psychological predictors play a critical role in the Self-Feature Set invalidates our earlier conclusions pertaining to the lack of diagnosticity of the psychological features. These psychological features were significant predictors in the Self-Feature Set but not the Observer Feature Set. This suggests that psychological cues are on the radar of the students themselves but are overlooked by the other judges. Unlike the linguistic features that were included in the Observer Feature Set but not the Self-Feature Set, the cohesion features were important predictors in both feature sets. Therefore, a deeper analysis of textual dialogues that the cohesion features provide yields the most generalizable models.
S.K. D'Mello is with the Departments of Computer Science and Psychology, University of Notre Dame, 384 Fitzpatrick, Notre Dame, IN 46556. E-mail: email@example.com.
A. Graesser is with the Department of Psychology and the Institute for Intelligent Systems, University of Memphis, 202 Psychology Building, Memphis, TN 38152. E-mail: firstname.lastname@example.org.
Manuscript received 23 Aug. 2011; revised 8 Dec. 2011; accepted 12 Apr. 2012; published online 18 Apr. 2012.
For information on obtaining reprints of this article, please send e-mail to: email@example.com, and reference IEEECS Log Number TLT-2011-08-0090.
Digital Object Identifier no. 10.1109/TLT.2012.10.
Sidney K. D'Mello received the PhD degree in computer science from the University of Memphis in 2009. He is an assistant professor in the Departments of Computer Science and Psychology at the University of Notre Dame. His research interests include emotional processing, affective computing, artificial intelligence in education, human-computer interaction, speech recognition and natural language understanding, and computational models of human cognition. He has published more than 100 journal papers, book chapters, and conference proceedings in these areas. He has edited two books on affective computing. He is an associate editor for the IEEE Transactions on Affective Computing and serves as an advisory editor for the Journal of Educational Psychology.
Arthur Graesser received the PhD degree in psychology from the University of California at San Diego. He is a professor of psychology, adjunct professor of computer science, and codirector of the Institute for Intelligent Systems at the University of Memphis. His specific interests include knowledge representation, question asking and answering, tutoring, text comprehension, inference generation, conversation, reading, education, memory, expert systems, artificial intelligence, and human-computer interaction. Currently, he is the editor of the Journal of Educational Psychology. In addition to publishing more than 400 articles in journals, books, and conference proceedings, he has written two books and edited nine books.