^{33}] with many companies

^{1}providing such services. Some teachers make extensive use of practice tests and released test items to help identify learning deficits for students. However, such tests not only require great effort and dedication, but they also take valuable time away from instruction. The limited classroom time available in middle school mathematics classes compels teachers to choose between time spent assisting students' development and time spent assessing students' abilities. A solution must involve a way whereby students can take an assessment and learn simultaneously. Yet, traditionally, these two areas of testing (i.e., Psychometrics) and instruction (i.e., math educational research and instructional technology research) have been separated fields of research with their own goals. Statisticians have not done a great deal of work to enable assessment of students while they are learning.

^{2}To help resolve this dilemma, the US Department of Education funded Heffernan and Koedinger to build a Web-based tutoring system ("the ASSISTment System"

^{3}) that would also do assessment at the same time.

*original question*and a list of

*scaffolding questions*. The original question usually has the same text as found in the MCAS test, while the scaffolding questions were created through breaking the original question down to the individual steps by our content experts. A student is initially presented a question that usually has several skills needed to solve it correctly. If the student gets the question correct, he would get credit for all the associated skills and move on to next question, otherwise he is forced to go through a sequence of scaffolding questions (or scaffolds). Students work through the scaffolding questions, possibly with hints and buggy messages, until they eventually get the problem solved. An ASSISTment question that was built for Item 19 of the 2003 MCAS is shown in Fig. 1. We see that the student typed "23," a wrong answer, for the original question that involved understanding algebra, perimeter, and congruence. Once the student gets the first scaffolding question correct (by typing "

*AC*"), the second scaffolding question appears, focusing on the concept of perimeter. After he got this question right, he was given a question on equation-solving. Buggy messages will show up if the student types in a wrong answer. So, if a student got the original question wrong, what skills have they not mastered? A fine-grained skill model will help determine which of the skills needed to solve this problem that the student has not mastered.

^{17}], [

^{22}] showed more accurate assessment that can be achieved by not only using the overall correctness of student answer, but also using the interaction data, such as response speed, help-seeking behavior on the effort required for students to solve a test item with instructional assistance. It has been shown that students are learning from working in the ASSISTment System [

^{21}], [

^{39}], [

^{40}]. Additionally, randomized controlled experiments have been conducted to determine the effectiveness of different types of interventions [

^{39}], [

^{40}].

**Issues for practitioners.**Most large standardized tests are "unidimensional" in that they are analyzed as if all the questions are tapping a single underlying skill. However, cognitive scientists such as Anderson and Lebiere [

^{3}] believe that students are learning individual skills. Among the reasons that psychometricians analyze large-scale tests in a unidimensional manner is that students' performance on different skills are usually highly correlated, even if there is no necessary prerequisite relationship between these skills. Another reason is that students usually do a small number of items in a given setting (for instance, 39 items for the eighth grade math MCAS test), which makes it hard to acquire identifiability for each single skill, especially when the number of skills that need to be mastered is larger than the number of the items in the test. Such tests work pretty well at telling you which students are performing well but are not good at

*informing educators*about which skills are causing difficulty and how to help students.

*they*need to fill. And that is the problem!" Another teacher followed up with "It does affect reports because then the state sends reports that say that your kids got this problem wrong so they are bad in geometry—and you have no idea, well you do not know what it really is—whether it is algebra, measurement/perimeter, or geometry." Thus, a teacher cannot trust that putting more effort on a particular low scoring area will indeed pay off in the next round of testing. It was reported that instead of having performance reports that break math knowledge into only a few components, teachers want more fine-grained diagnostic reports to accommodate their everyday classroom practice. These reports are referred to as "assessment for learning" (e.g., [

^{29}], [

^{45}]).

**Needs of intelligent tutoring systems**. One key component of creating an intelligent tutoring system is forming the model that monitors student behavior. ITS needs the construction of complex models to represent the skills that students are using and their knowledge states. As students work through the program, the model tracks their progress and chooses what problems will be displayed next. By using a better skill model, a system should be able to do a better job of predicting which items students will get correct in real time. That means that the system can do a better job of selecting the next best item for students to work on. For instance, one criterion of the next "best" item could be the one that has the largest ratio of expected test score gain to expected time to complete the problem. Expected test score gain will be a function that depends upon both the expected rise in skills from doing that item at that time, as well as the weight of those skills on the test (i.e., the MCAS). A better model would also help in addressing the issues that we mentioned above to help teachers adjust their instruction in a data-driven manner. Such a model will allow a teacher who has one week before the MCAS to know what topics to review to maximize the class average. We can make a calculation averaging the whole class to suggest what will give the teacher the biggest "bang for the buck." An example of a useful report [

^{19}] that teachers can get using the ASSISTment system is shown in Fig. 3. Teachers can see how their students are doing on each skill and can determine where they need to spend the most time.

^{24}] proposed two directions for future research of cognitive assessment, of which one is to increase understanding of how to specify an appropriate grain size or level of analysis with a cognitive diagnostic assessment [

^{32}]. In this paper, we consider four skill models with different granularity, including a unidimensional model and a fine-grained model developed at WPI with 78 skills. The four models are structured with an increasing degree of specificity as the number of skills increases. The measure of model performance is the accuracy of the predicted MCAS test scores based on the assessed skills.

^{12}], [

^{15}]). Corbett and his colleagues employed a very detailed model of skills, but their system did not have questions tagged with more than one production rule [

^{2}]. Our collaborators [

^{5}] were engaged in trying to allow multimapping

^{4}using a version of the fine-grained model but reported their Linear Logistic Test Model (LLTM) does not fit well. Different from our approach, the model they applied does not track student performance over time. Almond et al. [

^{1}]

*examine the application of Bayesian networks to Item Response Theory-based cognitive diagnostic modeling.*Bayesian networks have also been used to investigate the results of skill hierarchies using real-world data in intelligent tutoring systems (e.g., [

^{23}]) and simulated users (e.g., [

^{10}], [

^{14}]). Others (e.g., [

^{8}]), in the psychometrics field, have developed multidimensional Item Response Theory (IRT) models but these models do not allow multimapping.

^{6}] and psychometricians [

^{43}], Croteau et al. (2004) called it "transfer model," while Cen et al. [

^{9}] and Gierl et al. [

^{24}] used the term "cognitive model." In all cases, a skill model is a matrix that relates questions to the skills needed to solve the problem. Such a model provides an interpretative framework to guide test development and psychometric analyses, so test performance can be linked to specific cognitive inferences about the examinees. Researchers in machine learning area have been using automatic/semiautomatic techniques to search for skill models, including the rule space method [

^{43}], the Q-matrix method [

^{6}], and Learning Factor Analysis (Cen et al. [

^{9}]). Though it addressed the same problem, our work is different in that we hand-coded the skill models and built the connection between skills and questions. This is similar to what Ferguson et al. [

^{23}] did in their work as they also associated problems with skills by hand, but they employed a different methodology.

^{46}] who developed an alternative curriculum framework. Their results of confirmatory factor analysis showed that the alternative framework fits data better suggesting the state's learning standards are subject to improvement.

^{30}] described six steps in model-based reasoning in science. These steps, including model formation, elaboration, use, evaluation, revisions, and model-based inquiry, provide a framework for considering our progress in developing and refining cognitive models. Following these steps, the rest of the paper is organized as follows: In Section 2, we describe how the fine-grained model was developed and how it is currently being used in ASSISTment system. In Section 3, we evaluate the models by answering two research questions. Finally, we conclude in Section 4 and bring up the issue of model refinement and model-based inquiry as part of our future work.

1. "Patterns, Relations, and Algebra."

2. "Geometry."

3. "Data Analysis, Statistics, and Probability."

4. "Number Sense and Operations."

5. "Measurement."

*equation-solving*is associated with problems involving setting up an equation and solving it, while

*equation-concept*is related to problems that have to do with equations in which students do not actually have to solve them. In the second column, we see how the two skills in WPI-78 are nested inside of "Patterns, Relations, and Algebra," which itself is one piece of the five skills that comprise the WPI-5 skill model.

Table 1. Hierarchical Relationship among Skill Models

^{26}] provides technology support for authors to tag skills for the ASSISTment System question they build. This tool, shown in Fig. 4, provides a means to link certain skills to problems and to specify that solving the problem requires knowledge on that skill. The skills are organized in a hierarchical structure. The authors are allowed to browse the skills within each model and map the ones they select to a problem.

^{5}students, who used our system from 17 September 2004 to 16 May 2005 for, on average, 7.3 days (one period per day).

^{6}All these students worked on the system for at least six days (one session per day). We excluded data from the students' first day of using the system because they were learning how to use the system at that time. The item-level state test report was available for all these 447 students so that we were able to construct our predictive models on these students' data and evaluate the accuracy on state test score prediction. The original data set, corresponding to students' raw performance (before applying any "credit-and-blame" strategies as described below and not inflated due to the encoding used for different skill models), contained about 138,000 data points, among which around 43,000 come from original questions. On average, each student answered 87 MCAS (original) questions. We will refer to this data set as DATA-2005.

*fixed effects*, parameters corresponding to an entire population or repeatable levels of factors, and

*random effects*, parameters corresponding to individual subject drawn randomly from a population. For dichotomous (binary in our case) response data, several approaches have been developed. These approaches use either a logistic regression model or a probit regression model and various methods for incorporating and estimating the influence of the random effects on individuals. Since we want to track individual student's development of skills over time and make predictions, we chose mixed-effects logistic regression model because it takes into account the fact that responses of one student to multiple items are correlated; moreover, the random effects allow the model to learn parameters for individual students separately. Hedeker and Gibbons [

^{25}] described mixed-effects models for binary data that accommodate multiple random effects. As these sources indicate, the mixed-effects logistic regression model is a very popular and widely accepted choice for analysis of dichotomous data.

*logit*of the probabilities, namely:

*logit*is called the link function because it maps the (0, 1) range of probabilities onto ( ) range of linear predictors. And by doing this, now the logistic regression model is linear in terms of the logit, though not in terms of the probabilities.

*skill*can be introduced as a factor in the model in a similar way). The two-level representation of the model in terms of

*logit*can be written as:

^{41}] since TIME is introduced as a predictor of the response variable, which allows us to investigate change over time. The models were fitted in R [

^{36}] using

*lmer()*function in

*lme4*package [

^{7}] and "logit" was used as the link function. In this model, we introduced skills as fixed-effect factor and TIME (

*monthElapsed*) as both a fixed effect and a random effect in order to learn both the learning rate per month for the whole group of students, on average, and the variation of each individual student. We also included the interaction between

*skills*and

*monthElapsed*, which told the model to learn students' average learning rate separately for each skill. Notice that we did not include skills as random effect, which meant the model assumed that a student's learning rate did not vary over different skills.

^{7}

*monthElapsed*covariate, four coefficients for the

*skills*, one for each skill in the WPI-5 model, and four coefficients for the interaction term, and the random effects for each student (i.e., ), including an intercept indicating a student's incoming knowledge and a slope (coefficient for

*monthElapsed*as a random effect) indicating the student's overall learning rate per month, were extracted. Then, the two learning parameters "intercept" and "slope" (i.e., and in the model above) were calculated for each individual student and each skill. Given this, we can apply the model on the items in the state test to estimate students' response to each of them.

^{8}To predict a student's test score when a particular skill model is adopted, we will first find the fractional score the student can get on each individual item, and then, sum the "item-score" up to acquire a total score for the test. So, how did we predict their state test item score?

^{9}for that student (the hardest skill for the student). Thus, we obtained the probability of positive response to any particular item in the state test. In our approach, a student's probability of correct response for an item was used directly as the fractional score to be awarded on that item for the student. We summed item scores up to produce the total points awarded on the test. For example, if the probability of an item marked with Geometry is 0.6, then 0.6 points were added to the sum to produce the points awarded. This sum of these points was what we use as the predicted state test score.

^{10}

^{27}]:

*MCASi*is the actual MCAS score of the th student, and

*predictioni*is the predicted score from the prediction function being evaluated. For every model, we subtracted each student's real test score from his predicted score, took the absolute value of the difference, and averaged them to get the MAD. We also calculated a normalized metric named % Error by dividing the MAD by the full score:

*MaxRawScore*" is the maximum raw score possible with the MCAS questions used. The MCAS state test consists of five open response, four short answer, and 30 multiple choice questions. The max score is 54 points if all 39 MCAS questions are considered, since some are scored wrong/right and some are scored with partial credit. In our case, only the multiple-choice and short-answer questions are used with regard to the fact that currently, open response questions are not supported in our system. This makes a full score of 34 points with one point earned for a correct response on an item. For the students in our 2005 data set, the mean score out of 34 points was 17.9 (standard deviation ). For the students in 2006 data set, the mean score was 18.8 (standard deviation ).

**Research Question 1**(

**RQ1**). Would adding response data to scaffolding questions help us to do a better job of tracking students' knowledge and more accurately predicting state test scores, compared to only using the original questions? Because the scaffolding questions break the test question down into a series of simpler tasks that directly assess fewer knowledge components, we believe that the ASSISTment System can do a more accurate assessing job. This hierarchal breakdown of knowledge provides a much finer grained analysis than is currently available. We think that getting an answer to RQ1 would help us properly evaluate the second and more important research question described in Section 3.5.

**3.4.1 Scaffolding Credit and Partial Blame**We started our work examining only students' responses to original questions. And then, we brought up RQ1, asking ourselves if we can improve our models by including students' response to the scaffolding questions. As discussed in Section 1, adding in scaffolding responses creates a good chance for us to detect exactly which skills are the real obstacles that prevent students from correctly answering the original questions. This would be especially useful when we utilize a finer grained model.

Since the scaffolding questions show up only if the students answer the original question incorrectly, their responses to the scaffolding questions are explicitly logged. However, if a student gets an original question correct, he is only credited for that one question in the raw data. To deal with the "selection effect," we introduced the compensation strategy of "scaffolding-credit": Scaffolding questions are also marked correct if the student gets the original questions correct.

An important thing we need to determine when using a multimapping model (in which one item is allowed to be tagged with more than one skill) is which skills to blame when a student answered an item tagged with multiple skills incorrectly. Intuitively, the tutor may want to blame all the skills involved; however, this would be unfair to those relatively easy skills when they are tagged to some compound, hard items. To avoid this problem, we applied the "partial blame" strategy: If a student got such an item wrong, the skills in that item will be sorted according to the overall performance of that student on those skills and only the skill on which that particular student showed the worst performance will be blamed.

When evaluating a student's skill levels, both original questions and scaffold responses are used in an equal manner and they have the same weight in evaluation.

**3.4.2 Results**Recall that RQ1 asked whether adding response data to scaffolding questions can help us to do a better job of tracking students' knowledge and more accurately predicting state test scores. To answer RQ1, we first trained mixed-effects logistic regression models using the data set that only includes original questions response; one regression model for each skill model. Then, we replicated the training process but used the data set that was constructed by including responses to scaffolding questions and applying the "credit-and-blame" strategy described as above. Again, models were trained for all three skill models.

It turns out that better fitted models, as measured by % Error, on the state test can always be obtained by using scaffolding questions. In particular, when using the WPI-1 on DATA-2005, the mean decrease of % Error is 1.91 percent after scaffolding questions were introduced; for WPI-5, the decrease is 1.21 percent; and the decrease of % Error is 2.88 percent for the WPI-39; and 5.79 percent for the WPI-78 which is the biggest improvement. We then did paired t-tests between the % Error terms for the 447 students and found that the improvements are statistically significant in all the four cases as summarized in Table 3. We noticed the same effect in DATA-2006. As shown in Table 3, the improvement on % Error is statistically reliable on all of the four models. (Please read across the columns for an answer to RQ1. Reading across the rows is the answer to RQ2 that we will describe in the next section.)

Table 3. The Effect of Using Scaffolding Questions on DATA-2005 and DATA-2006

This drop-down of % Error (also MAD) makes sense for two reasons. One is that by using the response data to scaffolding questions, we are using more of the data we collected. A second reason is that the scaffolding questions help us to do a better job of dealing with credit-and-blame problems. To get more "identifiability" per skill, in the next section, we use the "full" response data (with scaffolding question responses added in) to try to answer the question of whether finer grained models predict better.

Sharp readers may have noticed that the MAD of WPI-39 model for DATA-2006 is lower than that of WPI-78, yet % Error of the WPI-39 model is higher than % Error of the WPI-78 model. This is because the two multiple-choice items in 2006 MCAS test, item 13 and item 26, were tagged with the skills "N.6.8-understanding-absolute-value" and "P.9.8-modeling-covariation," respectively, yet, none of the ASSISTment System items were tagged by the same two skills, which means that we do not have training data to track student knowledge on the two skills. Therefore, we ignored the two items when predicting students' total score of 2006 MCAS test using the WPI-39 model. This reduces the total number of MCAS items of the WPI-39 to 32. The % Error of the WPI-39 model is calculated by MAD/32, while the % Error of the other models are calculated by MAD/34.

Does an error rate of 12.09 percent on the WPI-78 seem impressive or poor? What is a reasonable goal to shoot for? Zero percent error? For comparison reason, we created a baseline estimation of students' MCAS test scores by first computing students' overall percent correct on original questions, and multiplied the % correct with the full score. Under this "dumb" approach, the % Error was 17.26 percent for DATA-2005 and 21.47 percent for DATA-2006. In [15], we reported on a simple simulation on how well one MCAS test was at predicting another MCAS test. We did not have access to data for a group of students who took two different versions of the MCAS test to measure this, so we estimated it by taking students' item-level scores on MCAS, randomly splitting the 34 multiple-choice items in the test into two halves, and then, using their scores on the first half to predict the second half. This process was repeated five times, and, on average, the % Error was 11 percent, suggesting that a 12 percent error rate is looking somewhat impressive.

**Research Question 2**(

**RQ2**). How does the finer grained skill model (WPI-78) do on estimating external test scores compared to the other skill models?

**3.5.1 Does WPI-78 Fit Better than the Coarser-Grained Models?**To answer RQ2, we compared the four mixed-effects regression models (trained on the "full" data set with scaffolding questions used) fitted using the four different skill models. As shown in Table 4, the WPI-78 had the best result, followed by the WPI-39, WPI-5, and WPI-1. % Error dropped down when a finer grained model was used from WPI-1 to WPI-5, and then, from WPI-39 to WPI-78.

Table 4. Evaluating the Accuracy of Skill Models

To see if the % Error was statistically significantly different for the models, we compared each model with every other model. We did paired t-tests between the % Error terms for the 447 students in DATA-2005 and also the 474 students in DATA-2006. We found out that in DATA-2005, the WPI-78 did as well as the WPI-39 ( ), and they both predicted MCAS score reliably better than the WPI-5 and WPI-1. In DATA-2006, the WPI-78 model is statistically reliably better than the WPI-39, WPI-5, and WPI-1 ( in all cases), and WPI-1 is statistically reliably worse on predicting MCAS scores than the other models ( ). This suggested that finer grained skill models were helpful in tracking students' knowledge over time.

We want to address that our results on student performance prediction are by no means the best. As a matter of fact, we trained an Item Response Theory [ ^{42}] model that has been widely used in traditional testing area by psychometricians as a control. We fit the simplest model, the Rasch model that models student dichotomous response ( ) to problem as a logistic function of the difference between student proficiency ( ) and problem difficulty ( ), on our online data. The fitted model gave us an estimate of math proficiency for every individual student which allows us to compute the predicted MCAS score assuming that every item in MCAS has an average difficulty ( ). In Table 4, *IRT-2005* refers to the IRT modeling condition for DATA-2005, and *IRT-2006* refers to the IRT modeling for DATA-2006. As we can see, the % Error of the Rasch model for DATA-2005 is 12.82 percent, marginally higher than that of the WPI-78, 12.09 percent ( ). Yet, the Rasch model did better in the next year where the % Error (13.70 percent) is reliably higher ( ) than that of the WPI-78 (14.70 percent). Other than the IRT model, we have also contrasted our result on DATA-2005 with the result produced by Bayesian network approach that dealt with skills associated with one item conjunctively using "AND" gate [ ^{34}]. The "AND" gate signifies that all the skills must be known in order for the questions to be answered correctly. Pardos et al. [ ^{35}] confirm that the "conjunctive" hypothesis. During the comparison process, we found out that our approach did better than the Bayesian networks approach when the WPI-1 and WPI-5 models were used, and the two approaches are comparable when the WPI-39 and WPI-78 were used. Specifically, for the WPI-39 model, % Error of the Bayes approach is 12.05 percent, lower than what we got (12.41 percent); yet for the WPI-106 model, % Error of the Bayes approach is 13.75 percent, higher than our result of 12.09 percent.

As a measure of internal fit, we calculated the average absolute residual for each model fitted on the data. For data of both years, the WPI-78 fits best. Since the WPI-78 model contains far more skills than other models, one might think the model won simply because of the large number of parameters. Therefore, as a sanity check, we generated a Random-WPI-78 model in which items are randomly mapped with skills from the WPI-78 model. It turned out that the random model did reliably worse than the WPI-78 model (and also the WPI-39), both in MCAS score prediction and the internal fit. ^{11} Readers may have noticed in Table 3 that when only response data on original questions were used, the order changed for DATA-2005: The WPI-5 still did better than WPI-1. However, the prediction error gets worse when the WPI-39 or WPI-78 models were used. Our interpretation of this is that when only original responses were used, individual skills do not get as much identifiability; it only makes sense to make fine-grained skill models, if you have questions that can be tagged with just a single skill. Another reason why finer grained models might not fit the data as well would be the fact that the finer grained model has fewer data points per skill, so there is a trade-off between the number of skills you would like and the precision in the estimates.

Comparing the results that we got using DATA-2005 and those using DATA-2006, we noticed two things changed. First, the order of prediction accuracy differs when only original questions were used. The finer grained models still track student knowledge better than coarser-grained models when DATA-2006 was used; yet it is not the case when DATA-2005 was used. Second, the prediction error was much higher in the year 2005-2006 than in the previous year. Third, the effectiveness of the IRT model reduced in the year 2006. One possible reason is that we have fewer training data points for each student in the year 2005-2006 (5.5 sessions and 51 problems done versus 7.3 sessions and 87 problems done). Additionally, the problem sets administered to students in the two years are not the same.

**3.5.2 How Well Does the Model WPI-78 Fit the Data?**When using logistic regression, the statistical packages allow the user to analyze which of the parameters seem to have good fitting values. We now turn to do a little more analysis on the WPI-78 to see how good our model is. In our model, each skill gets one coefficient indicating the skill's "intercept" and one for the skill's "slope." The first of these, the intercept, allows us to model that some skills start the year with students knowing them better, while the slope allows for the fact that some skills are learned more quickly than others. Our model shows that for students who used the system in the school year 2004-2005, the easiest skills are "Subtraction," "Division," and "Simple-Calculation," while the skill that had the hardest incoming difficulty was "Qualitative-Graph-Interpretation" (as shown in Fig. 5). We also looked at the fits on the slopes for each skill. The skill that showed the steepest rate of learning during the course of the year was "Sum-of-Interior-Angles-Triangle" (e.g., "what is the sum of the angles inside of a triangle?"). It seems quite plausible that students learned a good amount related to this skill as we noticed in a classroom a poster that said "The sum of the interior angles in a triangle is 180" clearly indicating that this was a skill that teachers were focused on teaching. The skill that showed the least learning was called " Equation-Concept" (as shown in Fig. 6). Out of the 78 skills, seven coefficients predicted "unlearning" (i.e., the slopes are negative), which presumably raised a sign of overfitting, or that the tagging of the skills in the skill model was not quite right. In the future, we will investigate automating the process to remove such skills from the model and refit the data.

Considering the accuracy of fit, we noticed that the model obtained a high accuracy on predicting student response on items tagged with the simple skills (e.g., Division, Subtraction), yet not so good at tracking student knowledge on skills "Of-Means-Multiply," "Interpreting-Linear-Equations," or "Inequality-Solving." We speculated that skills that had less data for them would be more likely to be poorly fit. We did a correlation to see if the skills that were poorly fit were the same skills that had a relatively smaller numbers of items, but surprisingly the correlation was very weak. Other reasons that a skill might have a poorly fit slope would be that we tagged items with the same skill names that share some superficial similarity, but do not have the same learning rates. This analysis suggests some future work in refining the WPI-78 model; for instance, one possible refinement is to merge "equation-concept" with "equation-solving" (i.e., delete the "equation-concept" skill from the model and map all items tagged with "equation-concept" to "equation-solving"). Computational techniques such as Learning Factors Analysis [ ^{9}] provide a way to manipulate the skill model, thus substantially improve the model fit to data.

All in all, we make no claim that the fine-grained model we created represented the best fitting model possible. Nevertheless, we stand by the claim that this model, taken in total, is good enough that it can produce good fit to the data and make good predictions of the MCAS scores, indicating that the model is useful, even given the flaws that might exist in it.

^{22}], we reported that the ASSISTment System can be a better assessor after accounting for information such as the amount of assistance students required and their help-seeking behavior. The results presented in this paper further showed that not only can reliable assessment and instructional assistance be effectively blended in a tutoring system, but also, more importantly, such a system can provide teachers with useful fine-grained student-level knowledge they can reflect on and adjust their pedagogy. Recently, in an interview with US News & World Report [

^{38}], Secretary of Education Arne Duncan weighed in on the NCLB Act and called for continuous assessment. He mentioned that he is concerned about overtesting, and feels that fewer, better tests would be more effective. He wants to develop better data management systems that will help teachers track individual student progress in real time so that teachers and parents can assess and monitor student strengths and weaknesses. Our studies implies that it is possible for the states to develop such a system similar to the ASSISTment System that does all three of these things at the same time: 1) accurately assesses students; 2) gives fine-grained feedback that is more cognitively diagnostic; and 3) saves classroom instruction time by assessing students while they are getting tutoring.

^{28}], requiring the time of experts to create, and then test these models on students. The first model is the best guess and should be iteratively refined after usage in intelligent tutoring systems. The expert-built models are subject to the risk of "expert blind spot" [

^{31}]. We are happy to see that our first cognitive model fits well on student performance data. Nevertheless, we still feel that we can probably refine the fine-grained model to be more accurate. We have found that maintaining a cognitive model is difficult in a system where new questions are being added everyday. For future work, we plan to improve the model iteratively and use student performance data to evaluate the fitness of the models in each cycle, focusing on the less well-fitted skills.

# Acknowledgments

• *M. Feng is with the Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, WPI#696, Worcester, MA 01609. E-mail: mfeng@cs.wpi.edu.*

• *N.T. Heffernan is with the Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609.*

*E-mail: nth@wpi.edu.*

• *C. Heffernan and M. Mani are with the Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609. E-mail: {ch, mmani}@cs.wpi.edu.*

*Manuscript received 26 Dec. 2008; revised 24 Mar. 2009; accepted 3 Apr. 2009; published online 8 Apr. 2009.*

*For information on obtaining reprints of this article, please send e-mail to: lt@computer.org, and reference IEEECS Log Number TLTSI-2008-12-0122.*

*Digital Object Identifier no. 10.1109/TLT.2009.17.*

1. Including assessment systems from Northwest Evaluation Association ( http://nwea.org/assessments/), Measured Progress ( http://measured progress.org), Pearson ( http://www.pearsonassessments. com/), and the Center for Data-Driven Reform in Education ( http://www.cddre.org/Services/4Sight.cfm).

2. Standard psychometric models assume that the amount of learning happens during a test is limited. Some works have been done to measure growth and change (e.g., [ ^{42}], [ ^{16}]), but they are not based on testing data where students are actively learning materials.

3. The term "ASSISTment" was coined by Kenneth Koedinger and blends instructional **assist**ance and assess **ment**.

4. A "multimapping" skill model, in contrast to a "single-mapping" model, allows one item to be tagged with more than one skill.

5. The amount of data is limited by the maximum memory allowed by the open-source statistical package we used.

6. Given the fact that the state test was given on 17 May 2005, it would be inappropriate to use data after that day for the purpose of predicting state scores. Therefore, that data were not included in our data set.

7. This is just a simplifying assumption. Of course, in reality, it is possible that a student might learn one skill (e.g., *perimeter*) faster than another one (e.g., congruence).

8. All the tagging was done after the MCAS items were released without any reference to the modeling process described in this paper.

9. We admit that there are other approaches dealing with multimapped items. For instance, using Bayesian Networks is a reasonable way to deal with this situation. Pardos et al. [ ^{34}] use this approach and got similar results that fine-grained models enable better predictive models.

10. We think that it might be useful to discuss our model from a more qualitative point of view. Is it the case that if you tag an item with more skills, does that mean our model would predict that the item is harder? The answer is no, in the sense that if you tagged a bunch of items with an easy skill (i.e., one easier than what the item was currently tagged with), which would not change our model's prediction at all. This makes qualitative sense, in that we believe the probability of getting a question correct, is given by the probability of getting correct the most difficult skill associated with that question.

11. It is common to report the value of a model by using a metric that balances model fit and model complexity such as Bayesian Information Criterion (BIC). For instance, Cen et al. [ ^{9}] and Ferguson et al. [ ^{23}] both used BIC to compare different models. However, because the size of the data sets was different when we used the different models; the finer grained models add additional rows for all questions that are tagged with more than one skill, while BIC only makes sense when the data are meant to be the exact same size. For the same reason, we did not conduct ANOVA on the results.

#### References

**Mingyu Feng**received the BS and MS degrees in computer science from Tianjin University, China. She is currently working toward the PhD degree in computer science at Worcester Polytechnic Institute. Her primary interests lie in the areas of intelligent tutoring systems, particularly, student modeling and educational data mining. She has also worked in the area of cognitive modeling and psychometrics. Her research has contributed to the design and evaluation of educational software, developed computing techniques to address problems in user learning, and produced basic results on the tracking student learning of mathematical skills. Her paper summarizes the current state of her thesis work on cognitive skill assessing in the ASSISTment project, on which she has been working since 2004.

**Neil T. Heffernan**received the summa cum laude degree in history and computer science from Amherst College, and the PhD degree in 2001 at Carnegie Mellon University. He is an associate professor of computer science at Worcester Polytechnic Institute and a creator of WPI's new Learning Sciences & Technology PhD program. For his dissertation, he built the first intelligent tutoring system that incorporated a model of tutorial dialog, and was one of the first intelligent tutoring systems on the Web. He does multidisciplinary research in intelligent tutoring systems, artificial intelligence, psychometrics, and cognitive science. He has more than 25 peer-reviewed publications and received about $9 million in funding on more than a dozen different grants.

**Cristina Heffernan**received the BS degree in mathematics from Lewis and Clark College, the MS degree in mathematics education from The University of Pittsburgh, and the MA degree in teaching from Towson State University. She serves as a cocreator of the ASSISTment system and the chief subject matter expert. Prior to that, she was a math coach and has seven years of experience teaching middle school math.

**Murali Mani**received the MS and PhD degrees from the University of California, Los Angeles, and the BTech degree from the Indian Institute of Technology, Madras. He is an assistant professor in the Department of Computer Science at Worcester Polytechnic Institute, where he joined in 2003. His research interests are in database systems, where he focuses on data integration, Web and XML systems, and data management for health informatics.

| |||