# Automatic Detection of Off-Task Behaviors in Intelligent Tutoring Systems with Machine Learning Techniques

Suleyman Cetintas
Luo Si
Yan Ping Xin
Casey Hord

Pages: pp. 228-236

Abstract—Identifying off-task behaviors in intelligent tutoring systems is a practical and challenging research topic. This paper proposes a machine learning model that can automatically detect students' off-task behaviors. The proposed model only utilizes the data available from the log files that record students' actions within the system. The model utilizes a set of time features, performance features, and mouse movement features, and is compared to 1) a model that only utilizes time features and 2) a model that uses time and performance features. Different students have different types of behaviors; therefore, personalized version of the proposed model is constructed and compared to the corresponding nonpersonalized version. In order to address data sparseness problem, a robust Ridge Regression algorithm is utilized to estimate model parameters. An extensive set of experiment results demonstrates the power of using multiple types of evidence, the personalized model, and the robust Ridge Regression algorithm.

Index Terms—Computer uses in education, adaptive and intelligent educational systems.

## Introduction

An increasing trend in computers' utilization for teaching has led to the development of many intelligent tutoring systems (ITSs). ITSs have been shown to increase students' involvement and effort in the classroom [ 26] as well as improve students' learning [ 16]. However, students' misuse or lack of motivation can reverse the positive effect of ITSs. Therefore, there have been considerable efforts to model and understand the behaviors of the students while they use the system [ 3], [ 4], [ 7], [ 27]. The vast majority of the prior work focused on the interaction between students and the tutoring environments within the software, which is called “gaming the system.” Gaming behavior is performed by systematically taking advantage of properties, hints, or regularities in a system to complete a task in order to finish the curriculum rather than think about the material and only happens when a student is working with a system. However, students' behavior outside the system may also affect the learning opportunities that ITSs provide.

Off-task behavior means students' attention becomes lost and they engage in activities that neither have anything to do with the tutoring system nor include any learning aim. Surfing the web, devoting time to off-topic readings, talking with other students without any learning aims [ 3], and preventing other students' learning [ 28] are among typical off-task behavior examples.

Although off-task and gaming the system behaviors are quite different in nature, it has been noted by Baker that off-task behaviors are associated with deep motivational problems that also lead to gaming the system behaviors. Baker also suggested that these behaviors should be carefully studied together during system design since decreasing off-task behaviors can lead to an increase in gaming behaviors especially in the case of immediately warning a student to cease an off-task behavior [ 2]. Long-term solutions, rather than immediate warnings (e.g., students' self-monitoring), have been shown to decrease the off-task behaviors in traditional classrooms [ 11]. For intelligent tutoring systems, increasing the challenge and giving rewards for quickly solving problems without exhibiting gaming the system behavior have been suggested to decrease the off-task behaviors [ 2].

Off-task behaviors occur not only in educational systems but also in various types of interactive systems that require continued attention and engagement of a user. As noted in [ 2], driving a car is among such technology supported tasks. Being able to detect when users of such systems are not paying necessary attention to their tasks might also make these systems more effective, increase security, labor quality, etc.

Detection of student's off-task behavior in environments where it's not practical to utilize equipment such as microphones, gaze trackers, etc., is a challenging task [ 2]. Utilization of such equipment would provide instructional systems with audio and video data (e.g., facial cues and postures) [ 12], and there has been some research incorporating this information into the instructional systems [ 19], [ 21], [ 23]. This type of data would make tasks such as off-task detection relatively easier; however, since most K-12 schools are not equipped with the equipment to collect such data, systems must detect off-task behaviors using only the data from students' actions within the tutoring software. This brings its own challenges since it has been found that understanding students' intentions only from their actions within a system can be challenging [ 25]. Yet, off-task behavior detectors, especially personalized off-task detectors that consider the interuser behavior variability (e.g., using more or less time to solve the problems, having difficulties with particular types of questions, and/or problems or different mouse usage types, etc.) have the potential of improving students' learning experiences.

To the best of our knowledge, there is very limited prior work on the automatic detection of off-task behavior utilizing mouse movement data [ 10], [ 13] and there is no prior work on the personalized detection of off-task behaviors with multifeature models. Prior works mainly focus on detecting gaming behaviors [ 3], [ 4], [ 7], [ 27], which is also a quite different task from detecting off-task behaviors and none of the researchers in these works utilized mouse movement data. Other prior work that focused on off-task behavior detection only analyzed time and performance features (that can be extracted from the logs of user-system interaction) and did not utilize mouse movement data [ 2]. One of the works using mouse movement data was done by De Vicente and Pain. They had human participants use mouse movement data as well as data from student-tutor interactions for motivation diagnosis, but did not use it for automated detection of the off-task behaviors [ 13]. Cetintas et al. tracked and analyzed mouse movement data along with performance and time data to automatically detect off-task behaviors; however, these researchers did not consider personalization [ 10]. In another work, Cetintas et al. also used mouse movement data along with performance, problem, and time features to predict the correctness of problem solving, which is a very different task from the off-task detection [ 9].

Although time and performance features are useful for improving the effectiveness of off-task behavior detection, the vast majority of these works ignore: 1) an important type of data, namely mouse tracking data, which can be easily stored in and retrieved from user-system logs. Furthermore, all prior works ignore 2) the approach of personalization to capture different characteristics of students' that can lead to different behavior types, which are hard to identify with nonpersonalized off-task detectors since these models can not recognize interuser variability for different behavior types.

This paper proposes a machine learning model that can automatically identify the off-task behaviors of a student by utilizing multiple types of evidences including time, performance, and mouse movement features and by utilizing personalization to capture interuser behavior variability. To address data sparseness problem, the proposed model utilizes a robust ridge regression technique to estimate model parameters. The proposed model is compared to 1) a model that only utilizes time features and 2) a model that uses time and performance features. Furthermore all models are compared to each other 1) when personalization is not used and 2) when data sparseness problem is not addressed (i.e., when model parameters are not learned with Ridge Regression). We show that utilizing multiple types of evidence, personalization, and the robust Ridge Regression technique improves the effectiveness of off-task detection.

The rest of the paper is structured as follows: Section 2 introduces the data set used in this work. Section 3 describes the Least-Squares and Ridge Regression techniques. Section 4 proposes several approaches for modeling off-task behaviors as well as the personalization of those modeling approaches. Section 5 discusses the experimental methodology. Section 6 presents the experiment results, and finally, Section 7 concludes this work.

## Data

Data from a study conducted in 2008 in an elementary school were used in this work. The study was conducted in mathematics classrooms using a math tutoring software (that had been developed by the authors). The tutoring software taught problem solving skills for Equal Group and Multiplicative Compare problems. These two problem types are a subset of the most important mathematical word problem types that represent about 78 percent of the problems in fourth-grade mathematics textbooks [ 20]. First, in the tutoring system, a conceptual instruction session is studied by the students followed by problem solving sections to test their understanding. Both the conceptual instruction and the problem solving parts require students to work one-on-one with the tutoring software and if students fail to pass a problem solving session, they have to repeat the corresponding conceptual instruction and the problem solving session. The tutoring software has a total of four conceptual instruction sessions and 11 problem solving sessions that have 12 questions each. The problem solving sessions include four sessions for Equal Group worksheets, four sessions for Multiplicative Compare worksheets, and three sessions for Mixed worksheets, each of which includes six EG and six MC problems. The tutoring software is supported with animations, audio (with more than 500 audio files), instructional hints, exercises, etc.

The study included 12 students that include four students with learning disabilities, one student with emotional disorder, and one student with emotional disorder combined with a mild intellectual disability. Students used the tutor for several sessions that last about 30 minutes. Students used the tutor for an average of 18.2500 sessions with standard deviation of 3.3878 sessions. The evidence about students' on and off-task behaviors was observed during these sessions. Outside observations of behavior were used for collecting the evidence. Self-report was not used to assess students' on and off-task behaviors due to the concern that this measure might influence students' off-task behaviors and learning outcomes. A specific student is coded as either on-task or off-task by the observer as in past observatory studies of on-task and off-task behaviors [ 15], [ 17], [ 18].

Table 2. Details about the Training and Test Splits of the Observed Data for On-Task and Off-Task Behaviors

## Methods: Least Squares and Ridge Regression

This section first describes the technique of Least Squares, then introduces the problem of overfitting, and finally, talks about the technique of Ridge Regression.

The linear model fit has been a very important method in statistics for the past 30 years and is still one of the most powerful prediction techniques. The simplest linear model for regression involves a linear combination of input variables as follows:

$y(\schmi{x_ n , w}) = w_0 + w_1 x_1 + \cdots + w_n x_n = {\schmi w}^T {\schmi x},$

(1)

where ${\schmi x}=(1,x_1,\ldots,x_n)^T$ is an instance of training data of ${\rm D}+1$ dimensions and ${\schmi w}=(w_0,w_1,\ldots,w_n)^T$ are model coefficients ( $w_0$ is the bias or the intercept). To fit a linear model to a set of training data, the method of the Least Squares is one of the most popular techniques. It has been noted in prior work that for machine learning agents that act as “black box” when making predictions, the exact mechanism used by the agent is secondary (i.e., any machine learning method that performs function approximation will work and that determination of the model's inputs/outputs is the critical issue) [ 5], [ 6]. Since this paper uses a machine learning agent acting as “black box” for off-task prediction, the Least Squares is used as the machine learning agent.

The Least-Squares method determines the values of model coefficients ${\schmi w}$ by minimizing the sum of the squares error between predictions $y({\schmi{x_ n, w}})$ for each data point ${\schmi {x_n}}$ and the corresponding target value $t_n$ . The sum of squares error is defined as follows:

$E_D ({\schmi w}) = \sum_{n = 1}^N {\{ t_n - y({\schmi {x_ n , w}})\} ^2 = \sum_{n = 1}^N {\{ t_n - {\schmi w}^T {\schmi {x_n}} \} ^2 } },$

(2)

which can be minimized with a maximum likelihood solution that gives the Least-Squares solution of the model parameters as follows:

${\schmi w}_{ML} = ({\schmi \Phi} ^T {\schmi \Phi} )^{ - 1}{\schmi {\Phi t}},$

(3)

where ${\schmi \Phi}$ is an ${\rm N^\ast D}$ design matrix whose elements are given by ${\schmi {\Phi_{nj}}} = {\schmi {x_ {nj}}}$ (i.e., jth dimension of the nth training instance).

An important and common problem in statistical learning is the problem of overfitting. Overfitting as the name implies is the problem of having an excellent fit to the training data, which may not be a precise indicator of future test data [ 8], [ 14]. Overfitting especially occurs in the case of data sparseness that is caused by using limited training data to learn the parameters of a model. Regularization is a technique to control the overfitting problem by setting constraints on model parameters in order to discourage them from reaching large values that lead to overfitting [ 8], [ 14]. Ridge Regression is a technique that better controls overfitting by adding a quadratic regularization punishment of $E_W({\schmi w})=1/2\; {\schmi w}^T {\schmi w}$ to the data-dependent error $E_D({\schmi w})$ . After the addition of $E_W({\schmi w})$ , the total error function that the technique of the Ridge Regression aims to minimize becomes

${E_{TOTAL} = E_D ({\schmi w}) + \lambda E_W ({\schmi w}) = \sum_{n = 1}^N {\big\{ t_n - {\schmi w}^T {\schmi {x_n}} \big\} ^2 + }{{\lambda }\over{2}}{\schmi w}^T {\schmi w,}}$

(4)

where $\lambda$ is the regularization coefficient that controls the relative importance of data-dependent error $E_D({\schmi w})$ and the regularization term $E_W({\schmi w})$ . The regularization coefficient in this work is learned with twofold cross validation in the training phase. The exact minimizer of the total error function can be found in closed form as follows:

${\schmi w}_{RIDGE} = (\lambda I + {\schmi \Phi}^T {\schmi \Phi} )^{ - 1}{\schmi {\Phi t}},$

(5)

which is the Ridge Regression solution of the parameters of the model.

## Modeling Approaches and Personalization

This section describes several modeling approaches for off-task behavior detection as well as the personalized versions of the models.

### 4.1 Several Modeling Approaches

This section describes the models that are used for evaluation: 1) a model that only considers time features (i.e., Time Only Modeling), 2) another modeling approach that considers performance features as well as time features (i.e., Time and Performance-Based Modeling), and finally, 3) a more advanced model that incorporates mouse movement features with time and performance-related features (i.e., Time, Performance, and Mouse-Tracking-Based Modeling).

#### 4.1.1 Time Only Modeling ( ${\rm TimeOnly}\_{\rm Mod}$ )

Modeling students' off-task behaviors just by considering the time taken on an action has been considered a useful approach in the prior work [ 4], [ 7], [ 22]. This modeling approach only considers time-related features as a good discriminator of on-task and off-task behaviors. Setting a cutoff on how much time an action/problem should take and categorizing all actions that last longer than that cutoff is one of the simplest and most intuitive ways that have previously been applied to determine whether a student is reading hints carefully [ 22], to determine whether a student is using guessing to solve a problem [ 7], and by Baker as a baseline for his multifeature off-task detection model in his recent work [ 2]. Baker uses an absolute time feature which is the time that the action of interest takes the user as well as a relative time feature that is expressed in terms of the number of standard deviations the actions' time was faster or slower than the mean time that actions take for all other students. Idea of relative time features is also quite intuitive since some actions/problems might take more time for many students while others take relatively much less time depending on factors such as difficulty level, familiarity, etc.

In this work, both an absolute time feature and a relative time feature are used for time only modeling. The relative time feature used in this work is defined as the time spent by a user minus the average time spent on the same problem by all other students.

Time only modeling in this work serves as the baseline for all other models and will be referred as ${\rm TimeOnly}\_{\rm Mod}$ .

#### 4.1.2 Time and Performance-Based Modeling ( ${\rm TimePerf}\_{\rm Mod}$ )

Time-related features are useful in many situations; however, there are lots of other possible data that can be good indicators of off-task behaviors such as the probability that the student possesses the prior knowledge to answer the given question correctly. The percentage of correctness across all previous problems for a student has recently been used as an indicator of this feature for off-task gaming behavior detection [ 27] and a similar measure has been used in Baker's recent off-task behavior detection effort [ 2].

The approach that uses time and performance features will be referred as ${\rm TimePerf}\_{\rm Mod}$ .

#### 4.1.3 Time, Performance, and Mouse-Tracking-Based Modeling ( ${\rm TimePerfMouseT}\_{\rm Mod}$ )

Incorporation of performance features into the time only modeling is an effective way of improving the off-task behavior detector; however, there is still more room to improve. Both time-only modeling and time and performance-based modeling approaches ignore an important data source: mouse movement. Students are almost always in interaction with mouse when using the tutoring systems. As far as we know, there is very limited prior research on the detection of gaming the system or off-task behaviors that utilize mouse tracking data [ 10], [ 13]. More details about the prior work on this modeling approach, as well as utilizing mouse tracking data, can be found in Section 1.

In addition to the two time features and eight performance-related features that have been mentioned, this modeling approach incorporates six more features consisting of three main features each of which has one absolute and one relative version that are used as mouse tracking data. The first feature is the maximum mouse off time in a problem, which provides the knowledge of the biggest time interval (in seconds) in which mouse is not used for a current problem. Second and third mouse tracking features are the average x movement and average y movement, respectively. They basically assess the average number of pixels the mouse is moved along the x and y-axes in 0.2 second intervals. Two versions of all mouse movement features are used (i.e., absolute and relative).

The approach that uses time, performance, and mouse tracking features will be referred as ${\rm TimePerfMouseT}\_{\rm Mod}$ .

### 4.2. Personalization

This section describes the approach of personalizing all the models explained in the previous section.

Utilizing absolute versions of features, along with relative versions (which basically assess the value of particular feature with respect to other students), has been shown to be effective for off-task detection task by Baker [ 2]. Although incorporating students' relative information with respect to other students is an intuitive way of improving the accuracy of off-task behavior detection, there is still an important issue to consider: a student's current performance with respect to her/his past performance. Different students have different types of behaviors (e.g., using more or less time to solve the problems, having difficulties with particular types of questions and/or problems, different mouse usage styles, etc.). Therefore, introducing personalized versions of each feature into off-task detection models makes these models more flexible and adaptive to different student types (i.e., makes them personalized).

In addition to all the absolute and relative versions of the features of each model that were described in the previous section, personalized approach also considers personal versions of each feature. Personal version of a feature in this work is defined as the absolute value of a feature minus the average value of this feature on the same problem by the same student so far. To reiterate, data from a student's past trials on a particular problem are used to generate the personal version of each feature while predicting his/her behaviors for the current trial of the same problem. However, this approach becomes problematic if there are a limited number of past trials or none at all in which case, personalized features will not be a good representation of students' past behavior trend. Yet, note the statistics shown in Table 3 which show that students usually repeat the problems, and therefore, personalization is practical in general.

Table 3. Details about the Mean Average Number of Relative and Personal Data per Student

In this work, we use a weighted combination of relative and personalized versions of each feature in a way that if there are very limited personalized data, relative version of each feature dominates the value of this combined version. If there are enough personalized data available, personalized version of each feature dominates the combined version. Specifically, the weighted combination is as follows:

\eqalign{&RelPersComb_{{p}_i} = \cr &\quad \left( {{{\left( {Num\_Rel\_Data_p/C} \right)}}\over{{\left( {Num\_Rel\_Data_p/C} \right) + Num\_Pers\_Data_p }}} \right) \ast Rel_{{p}_i} \cr &\quad + \left( {{{Num\_Pers\_Data_p }}\over{{\left( {Num\_Rel\_Data_p/C} \right) + Num\_Pers\_Data_p }}} \right)\ast Pers_{{p}_i},}

(6)

where $RelPersComb_{{p}_i}$ is the weighted (linear) combination of the relative version of the ith feature of the pth problem ( $Rel_{{p}_i}$ ) and the personalized version of the ith feature of the pth problem ( $Pers_{{p}_i}$ ). $Num\_Rel\_Data_p$ is the number of training data instances available for the current (pth) problem (i.e., number of relative data instances available from the training data). $Num\_Pers\_Data_p$ is the number of personal data instances available for the current (pth) problem (i.e., data from student's past trials on the pth problem). C is a constant that is set to 20. Some statistics about mean average values of$Num\_Rel\_Data \;and\; Num\_Pers\_Data$ are shown in Table 3.

Personalized version of each modeling approach uses the above combined version of relative and personal version of each feature along with the absolute version of each feature.

## Experimental Methodology: Evaluation Method

To evaluate the effectiveness of the off-task behavior detection task, we use the common $F_1$ measure, which is the harmonic mean of precision and recall [ 1], [ 24]. Precision (p) is the ratio of the correct categorizations by a model divided by all the categorizations of that model. Recall (r) is the ratio of correct categorizations by a model divided by the total number of correct categorizations. A higher $F_1$ value indicates a high recall as well as a high precision.

## Experiment Results

This section presents the experimental results of the models that are presented in Sections 3 and 4. All the models were evaluated on the data set described in Section 2.

An extensive set of experiments is conducted to address the following questions:

1. How effective are the following three models compared to each other: 1) ${\rm TimeOnly}\_{\rm Mod}$ model that utilizes time features, 2) ${\rm TimePerf}\_{\rm Mod}$ model that utilizes time and performance features, and 3)  ${\rm TimePerfMouseT}\_{\rm Mod}$ model that utilizes time, performance, and mouse tracking features?
2. How effective is the approach of utilizing the Ridge Regression technique to estimate the model parameters?
3. How effective is the approach of utilizing personalization?

### 6.1 The Performance of Several Modeling Approaches

The first set of experiments was conducted to measure the effect of including the performance features in the ${\rm TimeOnly}\_{\rm Mod}$ model as well as including the mouse tracking data in the ${\rm TimePerf}\_{\rm Mod}$ model. The details about this approach are given in Section 4.1.

More specifically, ${\rm TimePerf}\_{\rm Mod}$ model is compared with ${\rm TimeOnly}\_{\rm Mod}$ model on the off-task behavior detection task. The performance of ${\rm TimePerf}\_{\rm Mod}$ model is shown in comparison to ${\rm TimeOnly}\_{\rm Mod}$ in Table 4 for nonpersonalized versions of these models, and in Table 5 for personalized versions of these models. It can be seen from both tables that the ${\rm TimePerf}\_{\rm Mod}$ model outperforms ${\rm TimeOnly}\_{\rm Mod}$ model for both personalized and nonpersonalized versions. The lesson to learn from this set of experiments is that performance-related features are very helpful when they are combined with time features for off-task behavior detection. This explicitly demonstrates the power of incorporating the performance-related features into the time only modeling.

Table 4. Results of the Nonpersonalized Version of ${\rm TimePerfMouseT}\_{\rm Mod}$ Model in Comparison to Nonpersonalized Versions of ${\rm TimeOnly}\_{\rm Mod}$ and ${\rm TimePerf}\_{\rm Mod}$ Models

Table 5. Results of the Personalized Version of ${\rm TimePerfMouseT}\_{\rm Mod}$ Model in Comparison to Personalized Versions of ${\rm TimeOnly}\_{\rm Mod}$ and ${\rm TimePerf}\_{\rm Mod}$ Models

In the same way, ${\rm TimePerfMouseT}\_{\rm Mod}$ model is compared to ${\rm TimeOnly}\_{\rm Mod}$ and ${\rm TimePerf}\_{\rm Mod}$ models. The performance of ${\rm TimePerfMouseT}\_{\rm Mod}$ model is shown in comparison to ${\rm TimePerf}\_{\rm Mod}$ and ${\rm TimeOnly}\_{\rm Mod}$ models in Table 4 for nonpersonalized versions of these models, and in Table 5 for personalized versions of these models. It can be seen from both tables that the ${\rm TimePerfMouseT}\_{\rm Mod}$ model substantially outperforms both ${\rm TimePerf}\_{\rm Mod}$ and ${\rm TimeOnly}\_{\rm Mod}$ models for both nonpersonalized and personalized versions. Paired t-tests have been applied for this set of experiments, and statistical significance with p-value of less than 0.05 has been achieved in favor of using mouse movements (in different configurations). These sets of experiments show that mouse movement features are very helpful when they are combined with time features and performance features for off-task behavior detection. This explicitly demonstrates the power of incorporating the mouse tracking features into time and performance-based modeling.

### 6.2 The Performance of Utilizing the Ridge Regression Technique

The second set of experiments was conducted to measure the effect of utilizing the technique of Ridge Regression for learning the model parameters for each of the models. The details about this approach are given in Section 3.

More specifically, Ridge Regression learned models are compared to Least-Squares learned models for both nonpersonalized and personalized versions. The performance of Ridge Regression learned version of each model is shown in comparison to Least-Squares learned versions in Table 4 for nonpersonalized versions of these models, and in Table 5 for personalized versions of these models. It can be seen that the Ridge Regression learned version of each model outperforms Least-Squares learned versions with its regularization framework for both of nonpersonalized and personalized models. Paired t-tests have been applied for this set of experiments, and statistical significance with p-value of less than 0.05 has been achieved in favor of using ridge regression against using least-square regression with mouse movements.

### 6.3 The Performance of Utilizing the Approach of Personalization

The last set of experiments was conducted to measure the effect of utilizing the approach of personalization to better capture different behavior types of different students. The details about this approach are given in Section 4.2.

More specifically, personalized version of each model is compared to its corresponding nonpersonalized version. The performance of the personalized version of each model is shown in comparison to nonpersonalized version of that model in Table 6 for Least-Squares learned versions. The performance of personalized version of each model is shown in comparison to nonpersonalized version of that model in Table 7 for Ridge Regression learned versions. It can be seen that for both of Least-Squares and the Ridge Regression learned versions of each model, the personalized version outperforms the nonpersonalized versions in most cases with its capability to better capture the different usage styles of different students. Paired t-tests have been applied for this set of experiments. Although the results were not shown to be statistically significant (i.e., with p-value of less than 0.05), the personalization approaches outperform corresponding approaches without personalization consistently in different configurations, which demonstrates the robustness and effectiveness of using personalization. This explicitly demonstrates the power of personalized modeling for off-task behavior detection in intelligent tutoring systems.

Table 6. Results of the Least-Squares Version of the Personalized Version of All Models in Comparison to Nonpersonalized Versions of All Models

Table 7. Results of the Ridge Regression Version of the Personalized and Nonpersonalized Versions of All Models in Comparison to the Least-Squares Version of the Nonpersonalized Version of All Models

## Conclusion and Future Work

This paper proposes a novel machine learning model to identify students' off-task behaviors (which involves neither the system nor a learning task) while using an intelligent tutoring system. Only the data that are available from the log files from students' actions within the software are used to construct the model; therefore, the model does not need sophisticated instrumentation (e.g., microphones, gaze trackers, etc.) that are unavailable in most school computer labs. The proposed model makes use of a set of evidences such as time, performance, and mouse movement features, and is compared to 1) a model that only utilizes time features and 2) a model that uses time and performance features together. Different students have different types of behaviors; therefore, personalized versions of each model are constructed and compared to their corresponding nonpersonalized versions. To address data sparseness problem, the proposed model utilizes a robust Ridge Regression technique to estimate model parameters.

An extensive set of empirical results shows that the proposed off-task behavior detection model substantially outperforms the model that only uses time features as well as the model that utilizes time and performance features together. It is also shown by the experiment results that the personalized version of each model outperforms the corresponding nonpersonalized version indicating that personalization helps to improve the effectiveness of off-task detection. Furthermore empirical results also show that the proposed models attain a better performance by utilizing the technique of Ridge Regression over the standard Least-Squares technique.

There are several possibilities to extend the research. For example, features that explicitly model the difficult levels of available problems will help the system better identify students' behavior. Future research will be conducted mainly in this direction.

## Acknowledgments

This research was partially supported by US National Science Foundation grant nos. IIS-0749462, IIS-0746830, and DRL-0822296. Any opinions, findings, conclusions, or recommendations expressed in this paper are the authors' and do not necessarily reflect those of the sponsor.

## References

• 1. R. Baeza-Yates, and B. Ribeiro-Neto, Modern Information Retrieval, pp. 75-82. Addison Wesley, 1999.
• 2. R.S. Baker, “Modeling and Understanding Students' Off-Task Behavior in Intelligent Tutoring Systems,” Proc. SIGCHI Conf. Human Factors in Computing Systems, pp. 1059-1068, 2007.
• 3. R.S. Baker, A.T. Corbett, K.R. Koedinger, and A.Z. Wagner, “Off-Task Behavior in the Cognitive Tutor Classroom: When Students ‘Game the System,’” Proc. SIGCHI Conf. Human Factors in Computing Systems (CHI '04), pp. 383-390, 2004.
• 4. R.S. Baker, I. Roll, A.T. Corbett, and K.R. Koedinger, “Do Performance Goals Lead Students to Game the System,” Proc. 12th Int'l Conf. Artificial Intelligence and Education (AIED '05), pp. 57-64, 2005.
• 5. C. Beal, B.P. Woolf, J. Beck, I. Arroyo, K. Schultz, and D.M. Hart, “Gaining Confidence in Mathematics: Instructional Technology for Girls,” Proc. Int'l Conf. Math./Science Education and Technology, 2000.
• 6. J.E. Beck, and B.P. Woolf, “High Level Student Modeling with Machine Learning,” Proc. Intelligent Tutoring Systems Conf., pp. 584-593, 2000.
• 7. J. Beck, “Engagement Tracing: Using Response Times to Model Student Disengagement,” Proc. 12th Int'l Conf. Artificial Intelligence in Education (AIED '05), pp. 88-95, 2005.
• 8. C.M. Bishop, Pattern Recognition and Machine Learning, pp. 6-10, 144-145. Springer, 2006.
• 9. S. Cetintas, L. Si, Y.P. Xin, and C. Hord, “Predicting Correctness of Problem Solving from Low-Level Log Data in Intelligent Tutoring Systems,” Proc. Second Int'l Conf. Educational Data Mining (EDM '09), pp. 230-239, 2009.
• 10. S. Cetintas, L. Si, Y.P. Xin, C. Hord, and D. Zhang, “Learning to Identify Students' Off-Task Behavior in Intelligent Tutoring Systems,” Proc. 14th Int'l Conf. Artificial Intelligence and Education (AIED '09), pp. 701-703, 2009.
• 11. T. Dalton, R.C. Martella, and N.E. Marchand-Martella, “The Effects of a Self-Management Program in Reducing Off-Task Behavior,” J. Behavioral Education, vol. 9, nos. 3/4, pp. 157-176, 1999.
• 12. F. Davis, La Comunicación No Verbal, vol. 616, El Libro de Bolsillo, Alianza ed., translated by L. Mourglier from Inside Intuition— What We Knew about Non-Verbal Communication, McGraw-Hill Book Company, 1976.
• 13. A. De Vicente, and H. Pain, “Informing the Detection of the Students' Motivational State: An Empirical Study,” Proc. Intelligent Tutoring Systems, pp. 933-943, 2002.
• 14. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, pp. 61-68. Springer, 2001.
• 15. N. Karweit, and R.E. Slavin, “Time-on-Task: Issues of Timing, Sampling, and Definition,” J. Experimental Psychology, vol. 74, no. 6, pp. 844-851, 1982.
• 16. K.R. Koedinger, and J.R. Anderson, “Intelligent Tutoring Goes to School in the Big City,” Int'l J. Artificial Intelligence in Education, vol. 8, pp. 30-43, 1997.
• 17. H.M. Lahaderne, “Attitudinal and Intellectual Correlates of Attention: A Study of Four Sixth-Grade Classrooms,” J. Educational Psychology, vol. 59, no. 5, pp. 320-324, 1968.
• 18. S.W. Lee, K.E. Kelly, and J.E. Nyre, “Preliminary Report on the Relation of Students' On-Task Behavior with Completion of School Work,” Psychological Reports, vol. 84, pp. 267-272, 1999.
• 19. D.J. Litman, and K. Forbes-Riley, “Predicting Student Emotions in Computer-Human Tutoring Dialogues,” Proc. 42nd Ann. Meeting on Assoc. for Computational Linguistics, 2004.
• 20. E.M. Maletsky, et al., Harcourt Math, Indiana ed. Harcourt, 2004.
• 21. C. Merten, and C. Conati, “Eye-Tracking to Model and Adapt to User Meta-Cognition in Intelligent Learning Environments,” Proc. 11th Int'l Conf. Intelligent User Interfaces, pp. 39-46, 2006.
• 22. R.C. Murray, and K. VanLehn, “Effects of Dissuading Unnecessary Help Requests While Providing Proactive Help,” Proc. 12th Int'l Conf. Artificial Intelligence in Education (AIED '05), pp. 887-889, 2005.
• 23. N. Person, B. Klettke, K. Link, and R. Kreuz, “The Integration of Affective Responses into Autotutor,” Proc. Int'l Workshop Affect in Interactions. Towards a New Generation of Interfaces, 1999.
• 24. C.J. van Rijsbergen, Information Retrieval, second ed. Univ. of Glasgow, 1979.
• 25. J.W. Schofield, Computers and Classroom Culture. Cambridge Univ. Press, 1995.
• 26. L. Suchman, Plans and Situated Actions: The Problem of Human-Machine Communication. Cambridge Univ. Press, 1987.
• 27. J.A. Walonoski, and N.T. Heffernan, “Prevention of Off-Task Gaming Behavior in Intelligent Tutoring Systems,” Intelligent Tutoring Systems, pp. 722-724, Springer, 2006.
• 28. T.R. Ziemek, “Two-D or Not Two-D: Gender Implications of Visual Cognition in Electronic Games,” Proc. Symp. Interactive 3D Graphics and Games, pp. 183-190, 2006.