## Large-Scale Multiobjective Static Test Generation for Web-Based Testing with Integer Programming

(HTML)^{1}] being developed to support web-based education. Such websites aim to bring free education to the world by providing online contents, exercises, and quizzes such as Khan Academy,

^{1}or online classes such as Coursera,

^{2}and Udacity.

^{3}The large data sets of online materials have been created and evolved over time. Different from passive course archives like MIT OpenCourseWare,

^{4}the online classes are interactive and can assess learners automatically on what they have learned. The main benefit is that learners can take classes at their own pace and get immediate feedback on their proficiency, unlike traditional classes.

^{2}], [

^{3}]. However, there is a problem on conducting self-assessment in an online class. As there may have many students

^{5}with different proficiency levels in an online class [

^{4}], it is difficult to fulfill different assessment requirements of students if using tests composed from a small question pool [

^{5}]. To overcome this problem, pedagogical practitioners have suggested composing tests from a large question pool with different question properties [

^{6}]. This in turn requires the availability of a large question data set and huge human effort on composing the tests to assess students' proficiency.

^{7}]. Fig. 1 shows a typical workflow of a web-based testing environment with automatic assessment. In this environment, STG is the core component, which aims to find an optimal subset of questions from a question database to form a test paper automatically based on multiple assessment criteria such as total time, topic distribution, difficulty degree, discrimination degree, and so on. And the generated test paper can then be attempted over the web by students for assessment purpose as in traditional pen-and-pencil test. Finally, the students' answers will be checked automatically for proficiency evaluation.

^{8}]. However, it is a challenging problem especially with a large number of questions [

^{9}]. Manually browsing and composing test papers by users is ineffective because of the exponential number of feasible combinations of questions. In essence, STG is an optimal subset selection problem, called a multidimensional knapsack problem (KP) [

^{10}], [

^{11}], which is also NP-hard [

^{9}]. Formally, it is a 0-1 integer linear programming (ILP), which optimizes multiobjective constraints. Moreover, STG should also be solved efficiently for online requirement. Currently, the quality of generated test papers are often unsatisfactory [

^{12}], [

^{13}], [

^{14}] according to users' test paper specifications.

^{15}], [

^{16}] was proposed to solve STG on very small question data sets. Popular up-to-date commercial optimization software packages such as CPLEX [

^{17}] and GUROBI [

^{18}] are inefficient for 0-1 ILP of STG because of its large number of variables in the 0-1 ILP formulation [

^{19}]. Recently, many heuristic-based intelligent techniques such as tabu search (TS) [

^{13}], biologically inspired algorithms [

^{14}], [

^{20}], swarm optimization [

^{12}], [

^{21}], [

^{22}] and divide and conquer (DAC) [

^{23}] have been proposed in the research community for automatic test paper generation. Although these heuristic-based techniques are straightforward to implement, they suffer from some drawbacks. These techniques are mainly based on traditional weighting parameters for multiobjective constraint optimization. They tend to get stuck in a local optimal solution especially in a huge search space of large-scale question data sets. As a result, these techniques generally do not have performance guarantee on both test paper quality and runtime efficiency.

^{15}], [

^{16}] by taking advantages of the recent advancement in optimization techniques. Specifically, we have made the following two contributions in this paper:

We propose an effective and efficient ILP approach for STG, which generates high-quality test papers in a huge search space of large question data sets efficiently. This was not possible in the past. Our proposed BAC-STG approach is able to support web-based testing on large question data sets for online learning environments. Our performance results on various data sets have shown that the proposed BAC-STG approach has outperformed the current STG techniques in terms of paper quality and runtime efficiency.

We propose a novel framework for web-based testing with automatic assessment, in particular for mathematics testing. The proposed framework integrates the proposed BAC-STG approach for automatic test paper generation, automatic mathematics solution checking, and automatic question calibration. It is able to generate test papers automatically and provide students with immediate feedback on their performance.

^{7}] and computerized adaptive testing (CAT) [

^{24}]. STG generates full test papers automatically based on multiple assessment criteria, whereas CAT generates question-by- question tests in a dynamic and sequential manner according to student's ability and item response theory (IRT). STG is basically a multiobjective combinatorial optimization problem, whereas CAT is a sequential optimization problem [

^{25}]. In this section, we focus only on reviewing related work on STG, which can be categorized into two main groups: linear programming-based integer programming and heuristic-based methods.

^{15}], [

^{16}], used the LANDO program to solve the 0-1 ILP of STG. It is similar to our proposed approach because of the use of linear programming (LP) and branch and bound. In [

^{26}], [

^{27}], Boekkooi-Timminga attempted to combine ILP with heuristics to improve runtime performance for multiple test paper generation. Although these approaches have rigorous mathematical foundations on optimization, they can only solve STG for very small data sets of about 300-600 questions due to the limitations of the state-of-the-art optimization methods at that time. An in-depth review of the LP-based IP for STG can be found in [

^{28}].

^{29}] used a heuristic based on the characteristics of question item information function to optimize the objective function. Later, Luecht [

^{30}] proposed an efficient heuristic to solve STG on a data set with 3,000 questions. However, these heuristic-based methods were proposed to solve STG for small data sets and are ineffective for larger data sets.

^{9}], TS was proposed to construct test papers by defining an objective function based on multicriteria constraints and weighting parameters for test paper quality. TS optimizes test paper quality by evaluating the objective function. In [

^{13}], a genetic algorithm (GA) was proposed to generate quality test papers by optimizing a fitness ranking function based on the principle of population evolution. In [

^{14}], differential evolution (DE) was proposed for test paper generation. DE is similar to the spirit of GA with some modifications on solution representation, fitness ranking function, and the crossover and mutation operations to improve the performance. In [

^{20}], an artificial immune system was proposed to use the clonal selection principle to deal with the highly similar antibodies for elitist selection to maintain the best test papers for different generations. In [

^{21}], particle swarm optimization (PSO) was proposed to generate multiple test papers by optimizing a fitness function which is defined based on multicriteria constraints. In [

^{12}], ant colony optimization (ACO) was proposed to generate quality test papers by optimizing an objective function that is based on the simulation of the foraging behavior of real ants. Apart from these techniques for STG, an efficient DAC approach [

^{23}] was proposed for online STG, which is based on the principle of dimensionality reduction for multiobjective constraint optimization.

^{10}], [

^{11}] has been extensively studied for solving various real-world problems such as the traveling salesman problem, quadratic assignment problem, maximum satisfiability problem (MAX-SAT), KP, and so on. Specifically, the 0-1 ILP is a mathematical optimization program in which all of the variables are restricted to be binary:

^{11}], there are four main methods for solving 0-1 ILP including heuristic algorithms, cutting planes method, branch-and-bound, and branch-and-cut (BAC). As mentioned earlier, although heuristic algorithms can be applied quite straightforwardly to solve many 0-1 ILP problems, they do not have any performance guarantee. The remaining three methods are global methods, which can find the exact optimal solution based on LP for 0-1 ILP problems.

^{31}] have remarkably improved integer programming techniques [

^{32}]. The runtime performance has been improved significantly. Currently, a large ILP of about 18,000 variables can be solved in less than 3 minutes. However, the LP-based ILP is still not efficient in runtime performance especially for large-scale 0-1 ILP problems. In particular, the methods implemented in popular commercial optimization software such as CPLEX [

^{17}] and GUROBI [

^{18}] are ineffective to handle 0-1 ILP with more than twenty thousand of variables [

^{19}].

^{19}] is most efficient as it is able to solve and prove optimality for larger set of instances than the others. BAC is a global optimization method, which is based on the branch-and-bound method and cutting planes method such as the Gomory or Fenchel cutting planes [

^{11}]. The main idea of

*cutting planes*method is to add extra constraints to reduce the feasible region and find the integral optimal solution. For 0-1 ILP problems with the sparse matrix property, lifted cover cutting is an effective method for enhancing runtime performance. However, the BAC method suffers from several drawbacks when solving large-sized 0-1 ILP problems. It is difficult to approximate the integral optimal solution from the fractional optimal solution of the 0-1 ILP problem. In addition, the simplex algorithm used to solve LP relaxation is also not very efficient on large-sized ILP problems. As BAC is an exact algorithm, the size of the branch-and-bound search tree may combinatorially explode with the number of variables. Hence, BAC generally suffers from poor runtime performance on large-sized ILP problems. Moreover, finding lifted cover cutting planes efficiently is challenging as it is NP-hard [

^{31}].

^{7}], [

^{28}] or the average discrimination degree [

^{12}], [

^{13}]. Although they are different, the discrimination degree is easier to calibrate and thus preferred in practice by researchers than the information function. However, it is not important because the STG problems can be solved in either way using our proposed approach. Second, heuristic techniques are ineffective for large-scale STG, as they generally do not have guarantee for high-quality solutions. Third, although the current LP-based ILP approach [

^{15}], [

^{16}] has quality guarantee for STG, the popular optimization software such as CPLEX and GUROBI are unable to solve large-scale 0-1 ILP problems efficiently [

^{33}]. In this paper, we propose an efficient integer programming approach for solving large-scale 0-1 ILP of the STG problem by exploiting the sparse matrix property.

*Question *. It is used to store the question identity.

*Content o*. It is used to store the content of a question.

*Answer a*. It is used to store the answer of a question.

*Discrimination degree* . It is used to indicate how good the question is in order to distinguish user proficiency. It is an integer ranging from 1 to 7.

*Question time* . It is used to indicate the average time needed to answer a question. It is measured in minutes.

*Difficulty degree* . It is used to indicate how difficult the question is to be answered correctly. It is an integer ranging from 1 to 10.

*Related topic* . It is used to store a set of related topics of a question.

*Question type* . It is used to indicate the type of a question. There are mainly three question types, namely fill-in-the-blank, multiple choice, and long question.

Table 1. An Example of Math Data Set

^{6}accumulatively. Moreover, it can also be constructed by gathering freely available questions from online educational websites such as Khan Academy or Question Answering (Q&A) websites such as The Art of Problem Solving Portal.

^{7}

^{33}], it might be feasible to automatically label all the attributes of each question with little human effort in the future. Automatic text categorization techniques such as support vector machine can be used for automatic topic classification of questions [

^{34}]. However, human labeling on topics for training questions is still needed in the training phase. To calibrate the other attributes, we can use the historical correct/incorrect response information from students. These response information as well as other important information such as question time can be gathered automatically through the students' question answering activities [

^{35}] over a period of time. However, it is more difficult to calibrate the discrimination degree and difficulty degree attributes due to missing user responses on certain questions. To overcome this, it is possible to apply the collaborative filtering technique to predict missing user responses and use the IRT model to calibrate the two attributes automatically [

^{36}]. Moreover, in [

^{36}], it has also proposed an effective method to calibrate new questions, which do not have any student response information. As such, automatic labeling of question attributes for large-scale question data sets can be achieved.

*static test specification*is a tuple of five attributes which are defined based on the attributes of the selected questions as follows:

*Number of questions .* It is an input representing the number of questions specified for the paper.

*Total time .* It is the total time specified for the paper.

*Average difficulty degree .* It specifies the average difficulty degree of all questions in the paper.

*Topic distribution .* It specifies the proportion of topics. The user can enter either the proportion or the number of questions for each topic. If the number of questions is entered, then the number will be converted into the corresponding proportion.

*Question type distribution .* It specifies the proportion of question types. The user can enter either the proportion or the number of questions for each question type. Similarly, if the number of questions is entered, then the number will be converted into the corresponding proportion.

^{11}] as shown in Fig. 2. In Fig. 2, constraint (1) is the constraint on the number of questions, where is a binary variable associated with question , in the data set. Constraint (2) is the total time constraint. Constraint (3) is the average difficulty degree constraint. Constraint (4) is the topic distribution constraint. The relationship of a question , and a topic , is represented as such that if question relates to topic and otherwise. Constraint (5) is the question type distribution constraint. The relationship of a question , and a question type , is represented as such that if question is related to question type and if otherwise.

When the 0-1 ILP problem has the sparse matrix property, the proposed approach is able to approximate the binary optimal solution of the 0-1 ILP problem with the fractional optimal solution.

The proposed approach uses the primal-dual interior point (PDIP) [ ^{37}] which is the most efficient algorithm for solving the LP relaxation problem. In addition, the simplex method [ ^{11}] is also used for solving the LP relaxation problem efficiently in subsequent steps of the approach when new cutting planes are added.

An effective branching strategy is proposed for reducing the size of the branch-and-bound search tree.

An efficient approach is proposed for finding effective lifted cover cutting planes.

*enumeration tree*that is constructed iteratively in a top-down manner with new nodes created by branching on an existing node in which the optimal solution of the LP relaxation is fractional. The problem at the root node of the tree is the original 0-1 ILP. When a new node is created, it contains the corresponding 0-1 ILP subproblem and is stored in the list , , of all unevaluated or leaf nodes. Let be the formulation of the feasible region of the problem at node . Let be the local upper bound at each node and be the current global lower bound of the 0-1 ILP solution.

**4.2.1 Finding Initial Fractional Optimal Solution**In this step, we find the fractional optimal solution of the original 0-1 ILP problem. This is done by relaxing the constraints on binary value of variables. The 0-1 ILP formulation of the STG problem shown in Fig. 3 is transformed into a standard LP as follows:

or equivalently: , where denotes the constraint set of feasible regions of the original ILP problem.

The LP problem can then be solved by using the most efficient PDIP algorithm [ ^{37}]. PDIP solves the LP problem by resolving the following logarithmic barrier optimization problem:

The optimal solution of PDIP will be used to construct the corresponding tableau of the simplex method [ ^{11}]. It consists of two steps: initial tableau construction and simplex tableau construction.

In the *initial tableau construction*, the simplex algorithm works on inequalities of the form and the 0-1 ILP of the STG problem needs to satisfy the equality constraints given in (7)-(12) of the form . Thus, we replace each constraint of the form by the following two constraints: . So far, all the replaced constraints given in (7)-(12) are now in the form . By introducing new slack variables, we have the following *initial tableau*: , where is the vector of slack variables.

In the *simplex tableau construction*, we perform pivoting operations on the initial tableau such that all variables with are basic variables, whereas others are nonbasic variables. As a result, the optimal solution and its corresponding simplex tableau of the form are obtained.

**4.2.2 Root Node Initialization**It first creates the root node of the enumeration tree that contains the original 0-1 ILP problem with its fractional optimal solution and simplex tableau . Next, it initializes the local upper bound at as , the global lower bound and the current best 0-1 solution . Then, the root node is stored in the list for further processing.

**4.2.3 Unevaluated Node Selection**This step selects an unevaluated node in the list for processing and solving. If there is no unevaluated node, the algorithm will terminate. Otherwise, a node in the list will be selected. Here, we use a greedy strategy, namely

*best bound*, to choose the most promising node in with the largest local upper bound value :

**4.2.4 LP Relaxation**It iteratively solves the subproblem of the selected unevaluated node based on the optimal solution and simplex tableau of its parent node's problem (except the root node). At the th iteration of processing a node , it solves the following LP problem: . If the returned result is infeasible (i.e., when ), it ignores this node and continues processing another node in the list . Otherwise, it goes to the next step on lifted cover cutting for adding cutting planes. Note that for efficiency, it adds the new constraints into the simplex tableau of its parent node and continue reoptimizing this tableau. After solving the LP Relaxation at node , the fractional optimal solution and its corresponding simplex tableau are obtained.

**4.2.5 Lifted Cover Cutting**The main purpose of the lifted cover cutting is to add extra constraints, called

*cutting planes*, to reduce the feasible region and approximate the binary optimal solution, which is nearest to . Based on the current fractional optimal solution and its corresponding simplex tableau, this step helps LP relaxation to gradually approximate more closely to the binary optimal solution of the subproblem. It adds extra constraints or cutting planes into the current subproblem. To achieve this, it adds some

*lifted cover inequalities*to the current formulation such that a new formulation is formed. Then, this new formulation will go back to the LP Relaxation step for optimization. For efficiency, at most three cuts are added at each iteration according to an empirical study in [

^{31}]. It repeats until no more cutting plane is found. The lifted cover cutting will be discussed later in Section 4.3.

**4.2.6 Pruning and Bounding**After processing a node , it will consider whether this node should be pruned. To determine this, it checks the obtained local upper bound of the LP Relaxation at node (after the th iteration) and the global lower bound of the 0-1 ILP solution of the original 0-1 ILP:

If and the current fractional optimal solution is a binary solution, it updates the new global lower bound and the current best 0-1 ILP solution . Then, it prunes this node and all unevaluated nodes in the list whose upper bound is less than the new global lower bound , and processes another node in . If and is fractional, it goes to the branching step.

**4.2.7 Branching**If the solution of the LP relaxation in node is fractional, the branching step creates two child nodes of . First, it chooses the fractional variable in the current fractional optimal solution of the LP relaxation and performs the branching. We use a common choice, namely

*most fractional variable*[

^{11}], to select the variable : , where . Then, the two child nodes are placed into the list for further processing:

The size of the search tree may grow exponentially if branching is not controlled properly. To effectively reduce the size of the tree, we use a heuristic based on the number of specified questions in the generated test paper. Consider a path from the root to a given unevaluated node of the tree, if the number of branching variables with value 1 along the path is larger than or equal to , we stop branching at that node. The reason is that we only need questions in the generated test paper.

**4.2.8 Termination**It returns the current best solution of the 0-1 ILP.

**4.2.9 An Example**Suppose that we need to generate a test paper from the Math data set given in Table 1 based on the specification . We associate each question with a binary variable , . However, we eliminate inappropriate variables , and because they cannot satisfy the specification . Here, we formulate the problem as a 0-1 fractional ILP with five binary variables, which is shown in Fig. 5a. The 0-1 fractional ILP problem is then transformed into a standard 0-1 ILP, which is shown in Fig. 5b.

Fig. 6 shows an example on the construction of the enumeration tree of subproblems during the BAC-STG process. Initially, the LP relaxation of this problem is solved by the PDIP algorithm to obtain the fractional optimal solution . Next, the fractional optimal solution is updated when new lifted cover cutting planes are added to obtain the new fractional optimal solution . Then, the local upper bound value at is set as ; the global lower bound is set as ; and the current best 0-1 solution is set as .

After that, the root node is branched on the variable to create two child nodes and during branching. Then, the unevaluated node is selected for processing, as it has the largest local upper bound. After processing and branching at , two child nodes and are created in which is a feasible binary solution with its objective value . Then, the global local bound and best solution are updated as and , respectively. The node is then pruned because its local upper bound is less than the current global lower bound. Subsequently, branching at node will create and . Similar to , will then be pruned. Branching at node will create and , which are then pruned similarly to .

Finally, the best solution is obtained. It corresponds to the test paper for the specification . The generated test paper has the average discrimination degree and specification , .

**Definition 1 (Dominance).** *If and are two valid inequalities for , dominates if .*

**Definition 2 (Cover).** *Let be a set such that , then is a cover. A cover is minimal if is not a cover for any .*

*cover*definition:

**Proposition 1 (Cover inequality).** *Let be a cover for , the cover inequality is valid for , where is the cardinality of .*

**Proposition 2 (Extended cover inequality).** *Let be a cover for , the extendedcover inequality: is valid for , where .*

**Definition 3 (Lifted cover inequality (LCI)).** *LCI is an extended cover inequality which is not dominated by any other extended cover inequalities.*

^{32}]. In this research, we propose to generate LCIs efficiently as follows:

*minimal cover inequality*based on the most significant basic variables in the fractional optimal solution of the LP relaxation, where is the number of questions given in the test paper specification. Specifically, consider a row of the form in the tableau, the basic variables of the fractional optimal solution are then sorted in nonincreasing order. Let be a list of the first largest coefficients, and be the list of the remains of the sorted list. If is the minimal number such that the sum of the coefficients w.r.t. the first basic variables of the list exceeds , i.e., , then the set is a

*minimal cover*.

*extended cover inequalities*, we need to calculate the largest lifting value , for each variable . This can be done by using an incremental algorithm to calculate , then until in step by step manner. Specifically, the algorithm starts from calculating , the result of will then be used to calculate and so on. To obtain , , we need to solve the following 0-1 KP (0-1 KP):

*LCI*when .

^{31}] solved the 0-1 KP by using a dynamic algorithm that requires high computational complexity of . The experimental results have shown that the runtime performance is poor when handling test cases with a few thousand variables. This is not acceptable in STG in which the 0-1 ILP may have tens of thousands of variables that need to be solved efficiently. In this research, we apply an approximation algorithm from Martello and Toth [

^{38}] to efficiently solve the 0-1 KP that only requires .

^{17}]. From the CPLEX package, we use the PDIP and simplex methods. The performance of BAC-STG is measured and compared with other techniques including GA [

^{39}], PSO [

^{21}], DE [

^{14}], ACO [

^{12}], TS [

^{9}], DAC [

^{23}], and the conventional BAC method [

^{31}]. These techniques are reimplemented based on the published articles. We compare the BAC-STG approach with the conventional BAC technique, which has been shown to be more effective for large-scale 0-1 ILP with the sparse matrix property than the commercial software [

^{31}].

*Quality satisfaction*. The algorithm will terminate if a high-quality test paper is generated.

*Maximum number of evaluated nodes in which no better solution is found*. This parameter is experimentally set to 300 nodes for BAC and BAC-STG. Similarly, for other heuristic techniques, this parameter is set to the maximum number of iterations in which no better solution is found. It is generally set to 200 iterations.

**Definition 4 (Mean discrimination degree).** *Let be the generated test papers on a question data set w.r.t. different test paper specifications , . The mean discrimination degree is defined as*

*where is the average discrimination degree of .*

^{40}], which is used to measure the statistical differences of the topic and question type distributions between and ; and is a constant used to scale the value between 0 and 100.

**Definition 5 (Mean CV).** *The mean CV of generated test papers on a question data set w.r.t. test paper specifications , , is defined as*

*Data sets.*As there is no benchmark data set available, we generate four large-sized synthetic data sets, namely , and , for performance evaluation. Specifically, these four data sets , and have number of questions of 20,000, 30,000, 40,000, and 50,000, respectively. There are mainly three question types in each data set, namely fill-in-the-blank, multiple choice, and long question. In the two data sets, and , the value of each attribute is generated according to a uniform distribution. However, in the other two data sets, and , the value of each attribute is generated according to a normal distribution. Our purpose is to measure the effectiveness and efficacy of the test paper generation process of each algorithm for both balanced data sets and , and imbalanced data sets and . Intuitively, it is more difficult to generate good quality test papers for the data sets and than the data sets and . Table 2 summaries the four data sets.

*Experimental procedures.*To evaluate the performance of the BAC-STG approach, we have designed 12 test specifications in the experiments. We vary the parameters in order to have different test criteria in the test specifications. The number of topics is specified between 2 and 40. The total time is set between 20 and 240 minutes, and it is set proportional to the number of selected topics for each specification. The average difficulty degree is specified randomly between 3 and 9. We perform the experiments according to the 12 test specifications for each of the following eight algorithms: GA, PSO, DE, ACO, TS, BAC, DAC, and BAC-STG. We measure the runtime and quality of the generated test papers for each experiment.

*Performance on quality.*Fig. 7 shows the quality performance results of the eight techniques based on the mean discrimination degree and mean CV . As can be seen from Fig. 7a, BAC-STG has consistently achieved higher mean discrimination degree than the conventional BAC and other heuristic techniques for the generated test papers. Particularly, BAC-STG can generate test papers with quality close to the maximal achievable value of . In addition, we also observe that BAC-STG has consistently outperformed the other techniques on mean CV based on the four data sets. The average CVs of BAC-STG tend to decrease whereas the average CVs of the other techniques increase quite fast when the data set size or the number of specified constraints gets larger. In particular, BAC-STG can generate high-quality test papers with for all data sets. Also, BAC-STG is able to generate higher quality test papers on larger data sets while the other techniques generally degrade the quality of the generated test papers when the data set size gets larger.

*Performance on runtime.*Fig. 8 compares the runtime performance of the eight techniques based on the four data sets. Here, the 12 specifications are sorted increasingly according to the number of topics in the constraints. The results have clearly shown that the proposed BAC-STG approach outperforms the conventional BAC and other heuristic techniques, except DAC, in runtime for the different data sets. BAC-STG generally requires less than 2 minutes to complete the paper generation process. Moreover, the proposed BAC-STG approach is quite scalable in runtime on different data set sizes and distributions. In contrast, the other techniques (except DAC) are not efficient to generate high-quality test papers. Particularly, the runtime performance of these techniques degrades quite badly as the data set size or the number of specified constraints gets larger, especially for imbalanced data sets and .

*Discussion.*The good performance of BAC-STG is due to three main reasons. First, as BAC-STG is based on LP relaxation, it can maximize the average discrimination degree effectively and efficiently while satisfying the multiple constraints without using weighting parameters. As such, BAC-STG can achieve better paper quality and runtime efficiency as compared with other heuristic-based techniques. Second, BAC-STG has a more effective branching strategy than that of conventional BAC, thereby pruning the search space more effectively. This helps BAC-STG improve runtime and search on promising unvisited subproblem nodes of the search tree for paper quality enhancement. Third, BAC-STG uses an efficient algorithm to generate the

*lifted cover inequalities*. Thus, BAC-STG can also improve its computational efficiency on large-scale data sets as compared with the conventional BAC tech-nique. Moreover, as there are more questions with different attribute values on larger data sets and LP relaxation is effective for global optimization, BAC-STG is able to generate higher-quality test papers. Therefore, the BAC-STG approach is effective for STG in terms of paper quality and runtime efficiency.

*Comparison between BAC-STG and DAC.*DAC [

^{23}] is an efficient STG approach for online test paper generation. Table 3 gives the performance comparison between BAC-STG and DAC. The performance results are obtained based on the average results of the 12 test specifications for each data set. In Fig. 7, the quality performance of BAC-STG is consistently better than DAC for the four data sets. However, the runtime performance of BAC-STG and DAC depends on the user specifications and data set distributions, as shown in Table 3 and Fig. 8.

Table 3. Performance Comparison between BAC-STG and DAC

*Data sets.*To gain further insight into the paper quality generated by the proposed BAC-STG approach, we have conducted a user evaluation. Here, we use two math data sets, namely - and - , which are constructed from G.C.E. A-Level Math and Undergraduate Math, respectively. For experimental purposes, the question type and topic attributes of question are calibrated automatically, whereas the difficulty degree attribute is calibrated by 10 tutors who are tutors of the first year undergraduate mathematics subject CZ1800 Engineering Mathematics in the School of Computer Engineering, Nanyang Technolog-ical University. These tutors have good knowledge of the math contents contained in the two math data sets. In the tutor calibration, we adopt a rating method [

^{41}] for the tutors to rate the difficulty degree of each question from 1 to 7 corresponding to the following seven discretized difficulty levels of IRT: extremely easy, very easy, easy, medium, hard, very hard, and extremely hard. In addition, according to the IRT Normal Ogive model [

^{24}], an initial value for the discrimination degree of a question can be computed based on its relation formula with a fixed user's proficiency value and the difficulty degree. Table 5 summarizes the two data sets.

*Experimental procedures.*In the experiments, we have designed 12 new specifications for the A-Math data set and 12 new specifications for the U-Math data set with different parameter values. Then, we generate the test papers based on the test specifications using the eight techniques. As a result, a total of 96 papers are generated for each data set. To conduct the user evaluation, the same 10 tutors are participated. In the experiments, the tutors are asked to evaluate the test paper quality by comparing its attributes with its original specification. The tutors compare each corresponding attribute of the generated test with its specification by their own intuition without knowing our defined measures. Based on the similarities, they are asked to evaluate the overall quality of the generated test and classify it into one of the following three categories: high, medium, and low. As a result, a total of 960 test paper evaluations are conducted for each data set.

*Expert calibration on quality results.*Fig. 9 shows the user evaluation results on the generated test paper qualities from the eight techniques. As can be seen, BAC-STG has achieved better performance on quality than the other techniques. For the A-Math data set, BAC-STG has achieved promising results with 81 percent (97 papers), 12 percent (14 papers), and 7 percent (9 papers) for high-, medium-, and low-quality generated papers, respectively, from a total of 120 papers evaluated. Similarly, for the U-Math data set, BAC-STG has achieved promising results with 91 percent (109 papers), 6 percent (7 papers) and 3 percent (4 papers) for high-, medium-, and low-quality generated papers, respectively. On average, BAC-STG has achieved 86, 9, and 5 percent for high-, medium-, and low-quality generated papers, respectively. In addition, it can also be observed that the proposed BAC-STG approach is able to improve the quality of the generated papers with a larger data set. It has performed better with the U-Math data set than the A-Math data set.

Table 4. Quality Analysis of the Generated Test Papers According to the Expert Calibration Results

^{42}]. The automatic question calibration component is currently under development based on the educational data mining techniques mentioned earlier in Section 3.1. These components are implemented in Java. The system can be accessed through a web browser.

^{42}]. The proposed algorithm has shown promising performance on different types of mathematical expressions such as multivariate polynomials, trigonometric functions, and so on. As compared with other web-based testing systems, which are only able to support multiple choice answer checking, our proposed system has an additional advantage on supporting advanced mathematical expression answer checking.

*M.L. Nguyen and S.C. Hui are with the School of Computer Engineering, Nanyang Technological University, Block N4, B3c, DISCO Lab, 50 Nanyang Avenue, Singapore 639798.*

*E-mail: {NGUY0093, asschui}@ntu.edu.sg.*

*A.C.M. Fong is with the School of Computing and Math Sciences, Auckland University of Technology, New Zealand.*

*E-mail: acmfong@gmail.com.*

*Manuscript received 24 May 2012; revised 13 Sept. 2012; accepted 24 Nov. 2012; published online 29 Nov. 2012.*

*For information on obtaining reprints of this article, please send e-mail to: lt@computer.org, and reference IEEECS Log Number TLT-2012-05-0075.*

*Digital Object Identifier no. 10.1109/TLT.2012.22.*

1. http://www.khanacademy.org/.

2. https://www.coursera.org/.

3. http://www.udacity.com/.

4. http://ocw.mit.edu/index.htm.

5. http://wp.sigmod.org/?p=165.

6. http://www.ets.org.

7. http://www.artofproblemsolving.com/Forum/portal.php?ml=1.

#### References

**Minh Luan Nguyen**is currently working toward the PhD degree at the School of Computer Engineering, Nanyang Technological University, Singapore. He is a member of the IEEE.

**Siu Cheung Hui**received the BSc degree in mathematics in 1983 and the DPhil degree in computer science in 1987 from the University of Sussex, United Kingdom. He is an associate professor at the School of Computer Engineering, Nanyang Technological University, Singapore. He was with IBM China/Hong Kong Corporation as a system engineer from 1987 to 1990. His current research interests include data mining, web mining, Semantic Web, intelligent systems, information retrieval, intelligent tutoring systems, timetabling, and scheduling.

**Alvis C.M. Fong**is a professor at the Auckland University of Technology, New Zealand. Previously, he was an associate professor with Nanyang Technological University, Singapore. His research interests include information processing and management, multimedia, and communications. He is a senior member of the IEEE.

| |||