The Community for Technology Leaders

In the News

Pages: pp. 5-8

AI Essay Graders Seek High Marks for Speed and Accuracy

Mark Ingebretsen

On the face of it, scoring student essays would seem to push AI capabilities to their limits. After all, students express themselves through their writing in vastly different ways. Furthermore, they might misunderstand the essay questions they've been asked to write about, or drift off the topic in the course of writing.

Even so, for decades now, researchers have known ways to automatically evaluate student writing. What's been lacking, explains Tom Landauer, executive vice president at the Knowledge Technologies group of Pearson, is the computer processing power to make grading practical. Processing power has now become a nonissue, of course. At the same time, Landauer says, "The algorithms used to analyze student writing have gotten more complex."

A one-time faculty member at Harvard, Dartmouth, Stanford, Princeton, and most recently the University of Colorado, Landauer's name appears on five patents associated with latent semantic analysis (, a statistical technique used in natural language processing. LSA is just one of many techniques that researchers are employing to advance the state of automated essay grading.

Classroom companion

Essay-grading applications overall have proven accurate enough that thousands of school systems are using them. "Researchers were surprised to find how well they can do as compared to teachers at evaluating essays," says David Williamson with Educational Testing Services, the not-for-profit group best known for its work with the College Board on the SAT product line.

As a director of research for the Automated Scoring and Natural Language Processing Group, Williamson focuses on ETS's Criterion Online Writing Evaluation product. The Criterion service is basically an online interface that lets teachers create writing assignments. Teachers can either create essay topics themselves or choose from a standardized library. Students log onto Criterion via an Internet-connected computer, retrieve assignments, and access tools that help them plan, organize, and correct their writing.

Under the Criterion service's hood, the student assignments are pored over by the e-rater evaluation engine, whose development is spearheaded by Williamson's team at ETS. E-rater consists of "a large number of components, each charged with identifying a particular aspect of writing," he says. For example, one component might flag run-on sentences, while another checks sentence endings for correct punctuation.

The most basic components in e-rater are rule-based, Williamson says. These consist of lookup tables and other small applications similar to those in word processors. As part of the scoring process, e-rater also draws on corpuses of text that have been parsed by its algorithms. E-rater flags as anomalies those word combinations in a student's essay that appear unusual compared to those in the corpuses.

E-rater uses regression analysis and other statistical techniques to find patterns in these anomalies. The engine expresses these patterns as classes of errors. Those classes encompass common writing-evaluation areas such as grammar, usage, mechanics, style, and organization. Then, in what's basically e-rater's final step, Williamson explains, e-rater inserts the error classes as independent variables of a regression equation. It uses those variables to predict the score a teacher would give the essay.

The inner meaning

E-rater evaluates a student's writing ability on the basis of grammar, usage, mechanics, style, and organization and development. The developers of Pearson's Intelligent Essay Assessor and related WriteToLearn essay- and summary-scoring programs claim their software can also measure students' actual knowledge of a subject. Like the Criterion service, WriteToLearn is a Web interface that students access in their classrooms. Teachers choose an essay topic from the application's library. The program presents the topic to students in the form of a question, such as "Should 18-year-olds be required to fulfill one year of national service?" Additionally, teachers can select a lesson such as Aztec history. In such instances, the students are asked to write a summary of the lesson.

Behind WriteToLearn is the Knowledge Analysis Technologies (KAT) engine, which is based on LSA applications that Landauer developed. According to Landauer, these LSA applications are all based on a simple rule: "The meaning of a passage of text is the sum combination of the meaning of the words of which it's composed."

KAT requires considerable, although entirely automatic, training before it can accurately determine a passage's meaning. Suppose, for example, that it must assess students' knowledge of health science. As a first step, developers input a corpus that encompasses what students might have read on the subject by their first year in college. The inputted text could total 100,000 or more paragraphs. Then, the program flags words and gives them a value based on their relationship to other words in a virtual multidimensional vector space.

The result is that KAT can deduce that two sentences mean essentially the same thing, even when they're worded differently. For instance, as Landauer explains, if one student wrote that "heart surgery is no longer dangerous" and another student wrote "cardiac operations aren't hazardous anymore," KAT would determine that the combination of words in each of the two sentences had roughly the same meaning.

KAT's training doesn't end there. Some 200 experts in the field each score a sample set of essays. When KAT looks at a new student essay, it searches its database for the set of human-expert-scored essays that most closely resemble the student essay. From that, it predicts what score the experts would have given the student essay.

Besides LSA, KAT draws on other statistical techniques to determine "how well the words follow each other, whether the best words have been used to express what's needed, and how well one sentence follows the next one," says Landauer.

According to Landauer, LSA, if properly trained, could score essays in any language. Landauer and his colleagues have investigated using LSA with Arabic, Chinese, Hindi, and Swahili. Chinese essays might seem particularly daunting. However, Landauer says that LSA considers a passage as consisting of parts and doesn't care whether those parts are Chinese characters or words expressed in a phonetic alphabet. "As long as the parts add up," Landauer says, LSA can accurately score the essay.

A bundle of nerves

Researchers are also using neural networks to score essays. Sargur Srihari, a University at Buffalo computer science professor and founder of the university's Center for Excellence for Document Analysis and Recognition (CEDAR,, first used neural networks to decipher handwriting. He was the principal investigator of the team that developed the US Postal Service's first handwriting-recognition software, a project that consumed more than 10 years.

"Grading handwritten essays seemed the next logical step," he says, particularly because handwritten essays are among the most difficult and time-consuming assignments for teachers to grade. The time lag in grading them can impair a student's ability to learn. Neural networks embodying handwriting analysis offer the chance to give students nearly instantaneous feedback.

Another layer of Srihari's application uses a neural network to determine the meaning of a passage of text. In Srihari's network, which he calls a first effort, 150 essays were input on a subject that had been human graded. Another 150 essays on the same subject, which hadn't been human graded, were used to test whether the net could yield similar results on its own.

The research, Srihari says, won't result in a product students can use any time soon. Unlike LSA and other statistical techniques, neural networks don't reveal how they arrive at a conclusion. However, Srihari's preliminary research shows that neural networks can produce essay scores that correlate more closely to those of human scorers than LSA-derived scores do (for more information, see "Automatic Scoring of Short Handwritten Essays in Reading Comprehension Tests," Artificial Intelligence, vol. 172, nos. 2–3, 2008, pp. 300–324).

Regardless of the AI methodology teachers use, they appear to give high marks to Criterion and WriteToLearn. Eric Henry, an English teacher at Scott High School in Huntsville, Tennessee, notes that reading scores measured by the state rose from 3.6 percent to 4.2 percent between 2003 and 2004 after the Criterion service was introduced. The instant feedback given by the product helped students better assimilate writing skills, and his high-school-aged students preferred typing the essays to writing them by hand, he says. Mae Guerra, a fourth-grade teacher at Valverde Elementary School in Denver, while admitting her younger pupils might lack typing skills, says her class is motivated by the fact that WriteToLearn resembles the video games they enjoy at home. She adds that the time she saves hand-grading essays could now be used to help slower learners in her class.

Notwithstanding that praise, AI essay-grading developers will continue tweaking their products. Landauer, for example, foresees applications able to accurately score essays on the basis of data sets no larger than the number of students in a typical class, as opposed to results that require processing an entire school system's worth of essays. Similarly, he sees improvements in essay-grading applications' ability to score short-answer quizzes, where the application has only a small number of words to analyze.

A long-term goal, says Srihari, would be for computers to be able to respond to text they've received in the same way students are expected to. "One of the grand challenges in AI is to have a computer read a chapter in a physics textbook and then answer questions at the end of the chapter," he says. Seems only fair.

Machine Learning Takes On the Brain

Keri Schreiner

In Apprentices of Wonder: Inside the Neural Network Revolution (Bantam Books, 1989), William F. Allman called the human brain "a monstrous, beautiful mess" and noted that its billions of neurons "lie in a tangled web that displays cognitive powers far exceeding any of the silicon machines we have built to mimic it." In an effort that's helping to make sense of this neurological morass and illuminate a path forward in AI, Carnegie Mellon University researchers are using brain imaging and machine learning technologies to study human information processing.

Aided by a recently announced US$1.1 million W.M. Keck Foundation grant, Tom Mitchell, a computer science professor at CMU and head of the university's Machine Learning Department, and Marcel Just, a cognitive-neuroscience professor and the director of CMU's Center for Cognitive Brain Imaging (, aim to develop a theory for how humans neurologically represent words in English. Their research is already producing impressive results and could one day offer invaluable insights into how the brain activity of people with conditions such as autism or depression deviates from typical processing patterns. In exploring this territory, Mitchell, Just, and their CMU team are hoping to not only better map brain activity but also guide other researchers seeking subtle patterns in massive data sets.

Project overview

Early efforts to understand how humans process information were limited because researchers were unable to track a single thought's signals, which occur in several brain regions at once. Over the past decade, the development of new technologies—including functional magnetic resonance imaging (fMRI) and various machine learning approaches—have made it possible for researchers to begin identifying these multivariate patterns of voxels (volume elements) and their characteristic activation levels.

In one study, the CMU team used fMRI to analyze participants' brain activity as they viewed words or line drawings of concrete nouns from two categories: tools and dwellings. The team trained machine learning classifiers to analyze the brain activation patterns in the resulting images (see the figure). As described in a recent paper (, the classifiers successfully identified the object and object category participants were viewing. The CMU team also discovered a common neural pattern across participants—a finding that "hugely surprised" Mitchell.

Graphic: Brain regions where Functional Magnetic Resonance Imaging of neural activation encodes word meanings for three different people. As the images show, multiple brain regions are involved, yet the most predictive regions (red and yellow voxels) are in similar locations in all three individuals. (figure courtesy of Carnegie Mellon University's Brain Image Analysis Research Group)

Figure    Brain regions where Functional Magnetic Resonance Imaging of neural activation encodes word meanings for three different people. As the images show, multiple brain regions are involved, yet the most predictive regions (red and yellow voxels) are in similar locations in all three individuals. (figure courtesy of Carnegie Mellon University's Brain Image Analysis Research Group)

"Anyone interested in the question of how brains work has to address the question of whether brains of different people work differently," says Mitchell. "In fact, we've found that we can train a classifier on your brain and use it on my brain. There's a striking similarity in how we process information, which means there's a possibility of developing one theory of how people process information."

In their latest studies, the CMU researchers are investigating abstract concepts such as "democracy" and "love." To do this, they'll use the same process as in their earlier studies, showing participants the word representing an abstract concept and asking them to think about it. They'll then analyze the resulting images using the machine learning techniques honed in the project's first phase. They also plan to experiment with presenting adjectives with various concrete nouns, such as "hungry rabbit" or "fast rabbit," to understand how we process such modifiers. One question, says Mitchell, is whether their studies will simply show "a summation of the two concepts (fast and rabbit) or whether something else will be going on." Mitchell says he expects to find "subtle things, such as velocity adjectives causing subtle patterns of their own."

Implications for AI research

As Mitchell notes, applying machine learning to brain imaging offers a general case study in at least three key areas. First, machine learning traditionally uses a pool of examples that's far larger than the set of features being studied. However, with large, complex data sets—such as those obtained through fMRIs—that goal is practically impossible. One way the CMU team is addressing this is to present each word or picture to study participants more than once to check repeatability and locate standard (versus random) variations.

Second, like many other phenomena, individual brains are both similar and unique. The CMU project therefore uses both pool data (from many different brains) and specific data, from an individual brain. To train the classifier on the two different data types, the team is exploring hierarchical Bayesian methods.

The third issue is how to address any hidden mental processes that might occur during experiments. To mitigate the impact of such processes, the team is investigating timing issues—that is, when and how these covert processes might occur. For example, researchers might first show participants the sentence, "There's a square above a triangle." Next, they would show them pictures of various geometric shapes, one of which would be of a square above a triangle. Given this, the team could then estimate the timing on three distinct processes: when participants comprehended the sentence, when they comprehended the picture, and when they decided whether the two match. The first two processes activate immediately, but the timing of the third is uncertain. So, the team is working to develop an algorithm—similar to a hidden Markov model—that will account for timing differences.

At a higher level, the CMU study and other efforts like it—including the UC Berkeley Gallant Lab's project using quantitative-receptive field models to characterize the relationship between visual stimuli and fMRI activity (see—are offering new possibilities in our quest to understand and replicate human intelligence. When Mitchell started his career, he said he briefly considered studying psychology to gain insights into intelligence, "but I looked into it and I thought it was just a loser because they didn't even have an oscilloscope—it was all behavioral, all observation," he says. He then turned to AI, considering it the best way to study intelligence throughout the '80s and '90s.

"But now times have changed—now we have that oscilloscope—and we're starting to see how the brain is organized," says Mitchell. "Ten years ago, I would have said that the best bet for studying intelligence was to build an autonomous robot. Now, I say that some people should do that, some should study the brain, and a lot of people—which we don't have now—should do the cross-fertilization between the two."

60 ms
(Ver 3.x)