Abstract—Webbased testing has become a ubiquitous selfassessment method for online learning. One useful feature that is missing from today's webbased testing systems is the reliable capability to fulfill different assessment requirements of students based on a largescale question data set. A promising approach for supporting largescale webbased testing is static test generation (STG), which generates a test paper automatically according to user specification based on multiple assessment criteria. And the generated test paper can then be attempted over the web by users for assessment purpose. Generating highquality test papers under multiobjective constraints is a challenging task. It is a 01 integer linear programming (ILP) that is not only NPhard but also need to be solved efficiently. Current popular optimization software and heuristicbased intelligent techniques are ineffective for STG, as they generally do not have guarantee for highquality solutions of solving the largescale 01 ILP of STG. To that end, we propose an efficient ILP approach for STG, called branchandcut for static test generation (BACSTG). Our experimental study on various data sets and a user evaluation on generated test paper quality have shown that the BACSTG approach is more effective and efficient than the current STG techniques.
With the rapid evolution of the web, webbased education has advanced significantly over the last 20 years and become a ubiquitous learning platform in many institutions to provide students with online learning courses and materials. Currently, we are also seeing more freely accessible educational websites together with learning technologies [
^{1}] being developed to support webbased education. Such websites aim to bring free education to the world by providing online contents, exercises, and quizzes such as Khan Academy,
^{1} or online classes such as Coursera,
^{2} and Udacity.
^{3} The large data sets of online materials have been created and evolved over time. Different from passive course archives like MIT OpenCourseWare,
^{4} the online classes are interactive and can assess learners automatically on what they have learned. The main benefit is that learners can take classes at their own pace and get immediate feedback on their proficiency, unlike traditional classes.
Webbased testing has been popularly used for automatic selfassessment especially in a distance educational learning environment [
^{2}], [
^{3}]. However, there is a problem on conducting selfassessment in an online class. As there may have many students
^{5} with different proficiency levels in an online class [
^{4}], it is difficult to fulfill different assessment requirements of students if using tests composed from a small question pool [
^{5}]. To overcome this problem, pedagogical practitioners have suggested composing tests from a large question pool with different question properties [
^{6}]. This in turn requires the availability of a large question data set and huge human effort on composing the tests to assess students' proficiency.
One promising approach to support largescale webbased testing is static test generation (STG), which generates a test paper automatically according to user specification based on multiple assessment criteria. Here, the term "static test" refers to traditional test paper in psychometry [
^{7}].
Fig. 1 shows a typical workflow of a webbased testing environment with automatic assessment. In this environment, STG is the core component, which aims to find an optimal subset of questions from a question database to form a test paper automatically based on multiple assessment criteria such as total time, topic distribution, difficulty degree, discrimination degree, and so on. And the generated test paper can then be attempted over the web by students for assessment purpose as in traditional penandpencil test. Finally, the students' answers will be checked automatically for proficiency evaluation.
Fig. 1. Webbased testing workflow with automatic assessment.
Generating highquality test papers that satisfy the constraints and maximize the assessment objective is critical for formal tests and examinations [
^{8}]. However, it is a challenging problem especially with a large number of questions [
^{9}]. Manually browsing and composing test papers by users is ineffective because of the exponential number of feasible combinations of questions. In essence, STG is an optimal subset selection problem, called a multidimensional knapsack problem (KP) [
^{10}], [
^{11}], which is also NPhard [
^{9}]. Formally, it is a 01 integer linear programming (ILP), which optimizes multiobjective constraints. Moreover, STG should also be solved efficiently for online requirement. Currently, the quality of generated test papers are often unsatisfactory [
^{12}], [
^{13}], [
^{14}] according to users' test paper specifications.
One of the main issues of STG is the very large search space of possible candidates with multiobjective constraints. In the early 1980s, linear programmingbased ILP [
^{15}], [
^{16}] was proposed to solve STG on very small question data sets. Popular uptodate commercial optimization software packages such as CPLEX [
^{17}] and GUROBI [
^{18}] are inefficient for 01 ILP of STG because of its large number of variables in the 01 ILP formulation [
^{19}]. Recently, many heuristicbased intelligent techniques such as tabu search (TS) [
^{13}], biologically inspired algorithms [
^{14}], [
^{20}], swarm optimization [
^{12}], [
^{21}], [
^{22}] and divide and conquer (DAC) [
^{23}] have been proposed in the research community for automatic test paper generation. Although these heuristicbased techniques are straightforward to implement, they suffer from some drawbacks. These techniques are mainly based on traditional weighting parameters for multiobjective constraint optimization. They tend to get stuck in a local optimal solution especially in a huge search space of largescale question data sets. As a result, these techniques generally do not have performance guarantee on both test paper quality and runtime efficiency.
In this paper, we propose an efficient 01 ILP approach for highquality STG, called branchandcut for STG (BACSTG). Generally, there exists many topics (e.g., differentiation, integration, etc.) in a subject (e.g., mathematics). When the STG problem is formulated in 01 ILP for a large question data set, it has the sparse matrix property. The proposed BACSTG approach is based on the branchandbound method with the lifted cover cutting method for solving the 01 ILP by exploiting the sparse matrix property. As branch and bound is a global and parameterfree method to deal with multiple constraints of the STG, the proposed approach avoids getting stuck in local optimal solutions to achieve highquality test papers as well as eliminates the need of using weighting parameters as in heuristicbased techniques. Our approach can be considered as an extension of previous work [
^{15}], [
^{16}] by taking advantages of the recent advancement in optimization techniques. Specifically, we have made the following two contributions in this paper:
We propose an effective and efficient ILP approach for STG, which generates highquality test papers in a huge search space of large question data sets efficiently. This was not possible in the past. Our proposed BACSTG approach is able to support webbased testing on large question data sets for online learning environments. Our performance results on various data sets have shown that the proposed BACSTG approach has outperformed the current STG techniques in terms of paper quality and runtime efficiency.
We propose a novel framework for webbased testing with automatic assessment, in particular for mathematics testing. The proposed framework integrates the proposed BACSTG approach for automatic test paper generation, automatic mathematics solution checking, and automatic question calibration. It is able to generate test papers automatically and provide students with immediate feedback on their performance.
The rest of this paper is organized as follows: Section 2 reviews the related work. Section 3 describes the problem specification of STG. Section 4 presents the proposed BACSTG approach. Section 5 shows the performance results of the BACSTG approach and its comparison with other STG techniques. Section 6 gives the proposed webbased testing framework. Finally, Section 7 concludes the paper.
2.1 Automatic STG
There are two major paradigms for webbased testing: STG [
^{7}] and computerized adaptive testing (CAT) [
^{24}]. STG generates full test papers automatically based on multiple assessment criteria, whereas CAT generates questionby question tests in a dynamic and sequential manner according to student's ability and item response theory (IRT). STG is basically a multiobjective combinatorial optimization problem, whereas CAT is a sequential optimization problem [
^{25}]. In this section, we focus only on reviewing related work on STG, which can be categorized into two main groups: linear programmingbased integer programming and heuristicbased methods.
LPbased IP, which was proposed in 1986 by Adema et al. [
^{15}], [
^{16}], used the LANDO program to solve the 01 ILP of STG. It is similar to our proposed approach because of the use of linear programming (LP) and branch and bound. In [
^{26}], [
^{27}], BoekkooiTimminga attempted to combine ILP with heuristics to improve runtime performance for multiple test paper generation. Although these approaches have rigorous mathematical foundations on optimization, they can only solve STG for very small data sets of about 300600 questions due to the limitations of the stateoftheart optimization methods at that time. An indepth review of the LPbased IP for STG can be found in [
^{28}].
For heuristicbased methods, Theunissen [
^{29}] used a heuristic based on the characteristics of question item information function to optimize the objective function. Later, Luecht [
^{30}] proposed an efficient heuristic to solve STG on a data set with 3,000 questions. However, these heuristicbased methods were proposed to solve STG for small data sets and are ineffective for larger data sets.
Since 2003, there has been a revived interest for STG on larger data sets of about 3,00020,000 questions by using modern heuristic methods. In [
^{9}], TS was proposed to construct test papers by defining an objective function based on multicriteria constraints and weighting parameters for test paper quality. TS optimizes test paper quality by evaluating the objective function. In [
^{13}], a genetic algorithm (GA) was proposed to generate quality test papers by optimizing a fitness ranking function based on the principle of population evolution. In [
^{14}], differential evolution (DE) was proposed for test paper generation. DE is similar to the spirit of GA with some modifications on solution representation, fitness ranking function, and the crossover and mutation operations to improve the performance. In [
^{20}], an artificial immune system was proposed to use the clonal selection principle to deal with the highly similar antibodies for elitist selection to maintain the best test papers for different generations. In [
^{21}], particle swarm optimization (PSO) was proposed to generate multiple test papers by optimizing a fitness function which is defined based on multicriteria constraints. In [
^{12}], ant colony optimization (ACO) was proposed to generate quality test papers by optimizing an objective function that is based on the simulation of the foraging behavior of real ants. Apart from these techniques for STG, an efficient DAC approach [
^{23}] was proposed for online STG, which is based on the principle of dimensionality reduction for multiobjective constraint optimization.
To optimize the multiobjective criteria of test paper quality, the current STG techniques (except DAC) require weighting parameters and some other parameters such as population size, tabu length, and so on, for each test paper generation that are not only difficult but also computational expensive to determine. Hence, these techniques generally take long runtime for generating good quality test papers especially for large data sets of questions.
2.2 01 Integer Programming
The 01 ILP [
^{10}], [
^{11}] has been extensively studied for solving various realworld problems such as the traveling salesman problem, quadratic assignment problem, maximum satisfiability problem (MAXSAT), KP, and so on. Specifically, the 01 ILP is a mathematical optimization program in which all of the variables are restricted to be binary:
where
,
, and
is a
matrix with
,
is the number of constraints, and
is the number of variables or dimensions.
Solving a general 01 ILP problem is NPhard. Despite this fact, there are fast solvers available today providing practical solutions for many 01 ILP problems. The performance depends on the dimensions
and degree of sparsity of the constraint matrix
. According to [
^{11}], there are four main methods for solving 01 ILP including heuristic algorithms, cutting planes method, branchandbound, and branchandcut (BAC). As mentioned earlier, although heuristic algorithms can be applied quite straightforwardly to solve many 01 ILP problems, they do not have any performance guarantee. The remaining three methods are global methods, which can find the exact optimal solution based on LP for 01 ILP problems.
The performance of these global methods depends on the algorithms used for LP, preprocessing techniques, and computational processing power of the computer hardware. In the early 1990s, there is not much improvement on the simplex algorithm for LP. Since the early 2000s, the development of the dual simplex algorithm and other techniques such as lifted cover cutting planes [
^{31}] have remarkably improved integer programming techniques [
^{32}]. The runtime performance has been improved significantly. Currently, a large ILP of about 18,000 variables can be solved in less than 3 minutes. However, the LPbased ILP is still not efficient in runtime performance especially for largescale 01 ILP problems. In particular, the methods implemented in popular commercial optimization software such as CPLEX [
^{17}] and GUROBI [
^{18}] are ineffective to handle 01 ILP with more than twenty thousand of variables [
^{19}].
Among the three methods, BAC [
^{19}] is most efficient as it is able to solve and prove optimality for larger set of instances than the others. BAC is a global optimization method, which is based on the branchandbound method and cutting planes method such as the Gomory or Fenchel cutting planes [
^{11}]. The main idea of
cutting planes method is to add extra constraints to reduce the feasible region and find the integral optimal solution. For 01 ILP problems with the sparse matrix property, lifted cover cutting is an effective method for enhancing runtime performance. However, the BAC method suffers from several drawbacks when solving largesized 01 ILP problems. It is difficult to approximate the integral optimal solution from the fractional optimal solution of the 01 ILP problem. In addition, the simplex algorithm used to solve LP relaxation is also not very efficient on largesized ILP problems. As BAC is an exact algorithm, the size of the branchandbound search tree may combinatorially explode with the number of variables. Hence, BAC generally suffers from poor runtime performance on largesized ILP problems. Moreover, finding lifted cover cutting planes efficiently is challenging as it is NPhard [
^{31}].
2.3 Discussion
From the above discussion, we have the following observations. First, we note that the objective functions of the STG formulation in the related studies may be different. It maximizes either the test information function [
^{7}], [
^{28}] or the average discrimination degree [
^{12}], [
^{13}]. Although they are different, the discrimination degree is easier to calibrate and thus preferred in practice by researchers than the information function. However, it is not important because the STG problems can be solved in either way using our proposed approach. Second, heuristic techniques are ineffective for largescale STG, as they generally do not have guarantee for highquality solutions. Third, although the current LPbased ILP approach [
^{15}], [
^{16}] has quality guarantee for STG, the popular optimization software such as CPLEX and GUROBI are unable to solve largescale 01 ILP problems efficiently [
^{33}]. In this paper, we propose an efficient integer programming approach for solving largescale 01 ILP of the STG problem by exploiting the sparse matrix property.
3.1 Question Data Set
Let
be a data set consisting of
questions,
be a set of
different topics, and
be a set of
different question types. Each question
, where
, has eight attributes
defined as follows:
Note that the discrimination degree and difficulty degree attributes here refer to the classical IRT definitions.
Table 1 shows a sample Math question data set.
Table 1. An Example of Math Data Set
There are two possible ways to construct largescale question data sets for webbased testing. It can be constructed by gathering questions from past tests and examinations on subjects such as TOEFL and GRE
^{6} accumulatively. Moreover, it can also be constructed by gathering freely available questions from online educational websites such as Khan Academy or Question Answering (Q&A) websites such as The Art of Problem Solving Portal.
^{7} The large pool of questions has posed a great challenge on labeling all question attributes accurately and automatically. In this paper, we assume that question attributes are correctly calibrated. However, with the advancement in educational data mining techniques [
^{33}], it might be feasible to automatically label all the attributes of each question with little human effort in the future. Automatic text categorization techniques such as support vector machine can be used for automatic topic classification of questions [
^{34}]. However, human labeling on topics for training questions is still needed in the training phase. To calibrate the other attributes, we can use the historical correct/incorrect response information from students. These response information as well as other important information such as question time can be gathered automatically through the students' question answering activities [
^{35}] over a period of time. However, it is more difficult to calibrate the discrimination degree and difficulty degree attributes due to missing user responses on certain questions. To overcome this, it is possible to apply the collaborative filtering technique to predict missing user responses and use the IRT model to calibrate the two attributes automatically [
^{36}]. Moreover, in [
^{36}], it has also proposed an effective method to calibrate new questions, which do not have any student response information. As such, automatic labeling of question attributes for largescale question data sets can be achieved.
3.2 Static Test Specification
A
static test specification is a tuple of five attributes which are defined based on the attributes of the selected questions as follows:
3.3 Optimal STG
Given a static test specification
, where
is the number of questions,
is the total time,
is the average difficulty degree,
is the topic distribution, and
is the question type distribution. The STG process aims to find a subset of questions from a question data set
to form a test paper
with specification
that maximizes the average discrimination degree and satisfies the static test specification such that
.
Based on the user test specification
and the question attributes, the STG problem can be formulated as a 01 fractional ILP problem [
^{11}] as shown in
Fig. 2. In
Fig. 2, constraint (1) is the constraint on the number of questions, where
is a binary variable associated with question
, in the data set. Constraint (2) is the total time constraint. Constraint (3) is the average difficulty degree constraint. Constraint (4) is the topic distribution constraint. The relationship of a question
, and a topic
, is represented as
such that
if question
relates to topic
and
otherwise. Constraint (5) is the question type distribution constraint. The relationship of a question
, and a question type
, is represented as
such that
if question
is related to question type
and
if otherwise.
Fig. 2. The 01 fractional ILP formulation of STG.
4. Proposed BACSTG Approach
In this section, we propose an efficient 01 ILP approach for highquality STG, called BACSTG. When the STG problem is formulated in 01 ILP, we observe that it has the sparse matrix property. By exploiting the sparse matrix property and domainspecific property of the STG problem, the proposed approach combines the branchandbound method with the lifted cover cutting method to efficiently solve the largesized 01 ILP of the STG problem. The proposed BACSTG approach has the following important characteristics:
When the 01 ILP problem has the sparse matrix property, the proposed approach is able to approximate the binary optimal solution of the 01 ILP problem with the fractional optimal solution.
The proposed approach uses the primaldual interior point (PDIP) [ ^{37}] which is the most efficient algorithm for solving the LP relaxation problem. In addition, the simplex method [ ^{11}] is also used for solving the LP relaxation problem efficiently in subsequent steps of the approach when new cutting planes are added.
An effective branching strategy is proposed for reducing the size of the branchandbound search tree.
An efficient approach is proposed for finding effective lifted cover cutting planes.
4.1 01 ILP Formulation
In the proposed BACSTG approach, we first reformulate the 01 fractional ILP of the STG problem into a standard 01 ILP, which is given in
Fig. 3. Note that as the number of questions
is a constant, the denominator of the maximizing cost function can be eliminated from the fractional ILP during reformulation. In addition, as each question has only a few related topics and a question type, most of the coefficients in the topic constraint and question type constraint are zeros. Thus, for largesized STG problems, the matrix
of the 01 ILP is very sparse.
Fig. 3. The 01 ILP formulation of STG.
4.2 Branch and Bound
The branchandbound method is based on the DAC strategy, which iteratively partitions the original ILP problem into a series of subproblems. Each subproblem is then solved by LP relaxation to obtain an upper bound on its objective value. The key idea of the branchandbound method is that if the upper bound for the objective value of a given subproblem is less than the objective value of a known integer feasible solution, then the given subproblem does not contain the optimal solution of the original ILP problem. Hence, the upper bounds of subproblems are used to construct a proof of optimality without exhaustive search.
Fig. 4 shows the main steps of the branchandbound method in the BACSTG approach.
Fig. 4. Branchandbound flowchart of the BACSTG.
In BACSTG, the subproblems are organized as an
enumeration tree that is constructed iteratively in a topdown manner with new nodes created by branching on an existing node in which the optimal solution of the LP relaxation is fractional. The problem at the root node of the tree is the original 01 ILP. When a new node
is created, it contains the corresponding 01 ILP subproblem and is stored in the list
,
, of all unevaluated or leaf nodes. Let
be the formulation of the feasible region of the problem at node
. Let
be the local upper bound at each node
and
be the current global lower bound of the 01 ILP solution.
4.2.1 Finding Initial Fractional Optimal Solution In this step, we find the fractional optimal solution of the original 01 ILP problem. This is done by relaxing the constraints on binary value of variables. The 01 ILP formulation of the STG problem shown in Fig. 3 is transformed into a standard LP as follows:
or equivalently: , where denotes the constraint set of feasible regions of the original ILP problem.
The LP problem can then be solved by using the most efficient PDIP algorithm [ ^{37}]. PDIP solves the LP problem by resolving the following logarithmic barrier optimization problem:
where is a barrier parameter.
The optimal solution of PDIP will be used to construct the corresponding tableau of the simplex method [ ^{11}]. It consists of two steps: initial tableau construction and simplex tableau construction.
In the initial tableau construction, the simplex algorithm works on inequalities of the form and the 01 ILP of the STG problem needs to satisfy the equality constraints given in (7)(12) of the form . Thus, we replace each constraint of the form by the following two constraints: . So far, all the replaced constraints given in (7)(12) are now in the form . By introducing new slack variables, we have the following initial tableau: , where is the vector of slack variables.
In the simplex tableau construction, we perform pivoting operations on the initial tableau such that all variables with are basic variables, whereas others are nonbasic variables. As a result, the optimal solution and its corresponding simplex tableau of the form are obtained.
4.2.2 Root Node Initialization It first creates the root node of the enumeration tree that contains the original 01 ILP problem with its fractional optimal solution and simplex tableau . Next, it initializes the local upper bound at as , the global lower bound and the current best 01 solution . Then, the root node is stored in the list for further processing.4.2.3 Unevaluated Node Selection This step selects an unevaluated node in the list for processing and solving. If there is no unevaluated node, the algorithm will terminate. Otherwise, a node in the list will be selected. Here, we use a greedy strategy, namely best bound, to choose the most promising node in with the largest local upper bound value :
4.2.4 LP Relaxation It iteratively solves the subproblem of the selected unevaluated node based on the optimal solution and simplex tableau of its parent node's problem (except the root node). At the th iteration of processing a node , it solves the following LP problem: . If the returned result is infeasible (i.e., when ), it ignores this node and continues processing another node in the list . Otherwise, it goes to the next step on lifted cover cutting for adding cutting planes. Note that for efficiency, it adds the new constraints into the simplex tableau of its parent node and continue reoptimizing this tableau. After solving the LP Relaxation at node , the fractional optimal solution and its corresponding simplex tableau are obtained.4.2.5 Lifted Cover Cutting The main purpose of the lifted cover cutting is to add extra constraints, called cutting planes, to reduce the feasible region and approximate the binary optimal solution, which is nearest to . Based on the current fractional optimal solution and its corresponding simplex tableau, this step helps LP relaxation to gradually approximate more closely to the binary optimal solution of the subproblem. It adds extra constraints or cutting planes into the current subproblem. To achieve this, it adds some lifted cover inequalities to the current formulation such that a new formulation is formed. Then, this new formulation will go back to the LP Relaxation step for optimization. For efficiency, at most three cuts are added at each iteration according to an empirical study in [ ^{31}]. It repeats until no more cutting plane is found. The lifted cover cutting will be discussed later in Section 4.3.4.2.6 Pruning and Bounding After processing a node , it will consider whether this node should be pruned. To determine this, it checks the obtained local upper bound of the LP Relaxation at node (after the th iteration) and the global lower bound of the 01 ILP solution of the original 01 ILP: 4.2.7 Branching If the solution of the LP relaxation in node is fractional, the branching step creates two child nodes of . First, it chooses the fractional variable in the current fractional optimal solution of the LP relaxation and performs the branching. We use a common choice, namely most fractional variable [ ^{11}], to select the variable : , where . Then, the two child nodes are placed into the list for further processing: The size of the search tree may grow exponentially if branching is not controlled properly. To effectively reduce the size of the tree, we use a heuristic based on the number of specified questions in the generated test paper. Consider a path from the root to a given unevaluated node of the tree, if the number of branching variables with value 1 along the path is larger than or equal to , we stop branching at that node. The reason is that we only need questions in the generated test paper.
4.2.8 Termination It returns the current best solution of the 01 ILP.4.2.9 An Example Suppose that we need to generate a test paper from the Math data set given in Table 1 based on the specification . We associate each question with a binary variable , . However, we eliminate inappropriate variables , and because they cannot satisfy the specification . Here, we formulate the problem as a 01 fractional ILP with five binary variables, which is shown in Fig. 5a. The 01 fractional ILP problem is then transformed into a standard 01 ILP, which is shown in Fig. 5b.
Fig. 5. The ILP formulation example.
Fig. 6 shows an example on the construction of the enumeration tree of subproblems during the BACSTG process. Initially, the LP relaxation of this problem is solved by the PDIP algorithm to obtain the fractional optimal solution . Next, the fractional optimal solution is updated when new lifted cover cutting planes are added to obtain the new fractional optimal solution . Then, the local upper bound value at is set as ; the global lower bound is set as ; and the current best 01 solution is set as .
Fig. 6. The enumeration tree example.
After that, the root node is branched on the variable to create two child nodes and during branching. Then, the unevaluated node is selected for processing, as it has the largest local upper bound. After processing and branching at , two child nodes and are created in which is a feasible binary solution with its objective value . Then, the global local bound and best solution are updated as and , respectively. The node is then pruned because its local upper bound is less than the current global lower bound. Subsequently, branching at node will create and . Similar to , will then be pruned. Branching at node will create and , which are then pruned similarly to .
Finally, the best solution is obtained. It corresponds to the test paper for the specification . The generated test paper has the average discrimination degree and specification , .
4.3 Lifted Cover Cutting
This step aims to generate the lifted cover cutting planes from the fractional optimal solution
and simplex tableau
for the LP relaxation step. Before discussing this step in detail, we need to define some basic terminologies. Consider the set
, which represents a row of the simplex tableau
.
Definition 1 (Dominance). If and are two valid inequalities for , dominates if .
If there exists any nonnegative coefficient
in
, the variable
can be replaced by its complementary variable
, so that
contains only coefficients
. As all coefficients in the LHS of
are now nonnegative, we may assume that the RHS
. Let
in the following definitions.
Definition 2 (Cover). Let be a set such that , then is a cover. A cover is minimal if is not a cover for any .
The following two propositions are derived directly from the cover definition:
Proposition 1 (Cover inequality). Let be a cover for , the cover inequality is valid for , where is the cardinality of .
Proposition 2 (Extended cover inequality). Let be a cover for , the extendedcover inequality: is valid for , where .
Definition 3 (Lifted cover inequality (LCI)). LCI is an extended cover inequality which is not dominated by any other extended cover inequalities.
In general, the problem of finding LCI is equivalent to the finding of the best possible values for
for
such that the inequality:
is valid for X, where
is a
cover inequality.
When the matrix A in the 01 ILP is sparse, the lifted cover cutting plane defined by LCI is an effective cutting plane for the pruning step in the BAC method. However, the problem on finding LCI has been shown to be NPhard [
^{32}]. In this research, we propose to generate LCIs efficiently as follows:
First, we find a
minimal cover inequality based on the
most significant basic variables in the fractional optimal solution
of the LP relaxation, where
is the number of questions given in the test paper specification. Specifically, consider a row of the form
in the tableau, the basic variables
of the fractional optimal solution are then sorted in nonincreasing order. Let
be a list of the first
largest coefficients, and
be the list of the remains of the sorted list. If
is the minimal number such that the sum of the coefficients
w.r.t. the first basic variables of the list
exceeds
, i.e.,
, then the set
is a
minimal cover.
Next, we generate
extended cover inequalities from the
minimal cover as follows:
To generate LCIs from the
extended cover inequalities, we need to calculate the largest lifting value
, for each variable
. This can be done by using an incremental algorithm to calculate
, then
until
in step by step manner. Specifically, the algorithm starts from calculating
, the result of
will then be used to calculate
and so on. To obtain
,
, we need to solve the following 01 KP (01 KP):
The largest lifting value
is then calculated as
, where
,
, is the objective function value according to the optimal solution obtained from the 01 KP. It can be seen that
computes the maximum weight corresponding to the set
in the
LCI when
.
Gu et al. [
^{31}] solved the 01 KP by using a dynamic algorithm that requires high computational complexity of
. The experimental results have shown that the runtime performance is poor when handling test cases with a few thousand variables. This is not acceptable in STG in which the 01 ILP may have tens of thousands of variables that need to be solved efficiently. In this research, we apply an approximation algorithm from Martello and Toth [
^{38}] to efficiently solve the 01 KP that only requires
.
5. Performance Evaluation
In this section, we evaluate the performance of the proposed BACSTG approach for STG. The experiments are conducted on a Windows XP environment, using an Intel Core 2 Quad 2.66 GHz CPU with 3.37 GB of memory. The BACSTG approach is implemented in Java with the CPLEX API package, version 11.0 [
^{17}]. From the CPLEX package, we use the PDIP and simplex methods. The performance of BACSTG is measured and compared with other techniques including GA [
^{39}], PSO [
^{21}], DE [
^{14}], ACO [
^{12}], TS [
^{9}], DAC [
^{23}], and the conventional BAC method [
^{31}]. These techniques are reimplemented based on the published articles. We compare the BACSTG approach with the conventional BAC technique, which has been shown to be more effective for largescale 01 ILP with the sparse matrix property than the commercial software [
^{31}].
5.1 Experiments
We have conducted two sets of experiments. In the first set of experiments, we analyze the quality and runtime efficiency of our BACSTG approach based on four largescale data sets by using different specifications. In the second set of experiments, we evaluate the effectiveness of our proposed approach by conducting a user evaluation for the quality of test papers generated from different specifications based on the G.C.E ALevel math and the undergraduate engineering math data sets.
In the experiments, the test paper generation process is repeated until one of the following two termination conditions is reached:
Quality satisfaction. The algorithm will terminate if a highquality test paper is generated.
Maximum number of evaluated nodes in which no better solution is found. This parameter is experimentally set to 300 nodes for BAC and BACSTG. Similarly, for other heuristic techniques, this parameter is set to the maximum number of iterations in which no better solution is found. It is generally set to 200 iterations.
5.2 Quality Measures
The performance of the proposed BACSTG approach is evaluated based on paper quality and runtime. To evaluate the quality, we define mean discrimination degree and mean constraint violation (CV).
Definition 4 (Mean discrimination degree). Let be the generated test papers on a question data set w.r.t. different test paper specifications , . The mean discrimination degree is defined as
where is the average discrimination degree of .
CV indicates the differences between the test paper specification and generated test paper. Let
,
be a test paper specification and
be a generated test paper specification. CVs can be measured according to total time, average difficulty degree, topic distribution and question type distribution between the test paper specification (
) and the generated test paper specification (
) as follows:
where
is the KullbackLeibler divergence [
^{40}], which is used to measure the statistical differences of the topic and question type distributions between
and
; and
is a constant used to scale the value between 0 and 100.
The CV of a generated test paper
w.r.t. the test paper specification
can then be calculated as the average of the four violations:
Definition 5 (Mean CV). The mean CV of generated test papers on a question data set w.r.t. test paper specifications , , is defined as
where is the CV of w.r.t. .
A highquality test paper
should maximize the average discrimination degree and minimize CVs. In other words, it should have a high value on
and a low value on the constraint violation
. Hence, the overall quality of a generated test paper depends on user's preference between these two aspects. According to a pedagogical perspective, users often pay more attention to constraint satisfaction of test papers. To determine the quality of a generated test paper, CVs could be defined in a certain range. Here, we set the following four thresholds for a highquality test paper:
,
and
. The threshold values are obtained experimentally (as shown in Section 6.4). Based on these thresholds, we set
for highquality test papers,
for mediumquality test papers and
for lowquality test papers.
5.3 Performance on Quality and Runtime
Data sets. As there is no benchmark data set available, we generate four largesized synthetic data sets, namely
, and
, for performance evaluation. Specifically, these four data sets
, and
have number of questions of 20,000, 30,000, 40,000, and 50,000, respectively. There are mainly three question types in each data set, namely fillintheblank, multiple choice, and long question. In the two data sets,
and
, the value of each attribute is generated according to a uniform distribution. However, in the other two data sets,
and
, the value of each attribute is generated according to a normal distribution. Our purpose is to measure the effectiveness and efficacy of the test paper generation process of each algorithm for both balanced data sets
and
, and imbalanced data sets
and
. Intuitively, it is more difficult to generate good quality test papers for the data sets
and
than the data sets
and
.
Table 2 summaries the four data sets.
Table 2. Test Data Sets
Experimental procedures. To evaluate the performance of the BACSTG approach, we have designed 12 test specifications in the experiments. We vary the parameters in order to have different test criteria in the test specifications. The number of topics is specified between 2 and 40. The total time is set between 20 and 240 minutes, and it is set proportional to the number of selected topics for each specification. The average difficulty degree is specified randomly between 3 and 9. We perform the experiments according to the 12 test specifications for each of the following eight algorithms: GA, PSO, DE, ACO, TS, BAC, DAC, and BACSTG. We measure the runtime and quality of the generated test papers for each experiment.
Performance on quality.Fig. 7 shows the quality performance results of the eight techniques based on the mean discrimination degree
and mean CV
. As can be seen from
Fig. 7a, BACSTG has consistently achieved higher mean discrimination degree
than the conventional BAC and other heuristic techniques for the generated test papers. Particularly, BACSTG can generate test papers with quality close to the maximal achievable value of
. In addition, we also observe that BACSTG has consistently outperformed the other techniques on mean CV
based on the four data sets. The average CVs of BACSTG tend to decrease whereas the average CVs of the other techniques increase quite fast when the data set size or the number of specified constraints gets larger. In particular, BACSTG can generate highquality test papers with
for all data sets. Also, BACSTG is able to generate higher quality test papers on larger data sets while the other techniques generally degrade the quality of the generated test papers when the data set size gets larger.
Fig. 7. Performance results based on average quality of the 12 specifications .
Performance on runtime.Fig. 8 compares the runtime performance of the eight techniques based on the four data sets. Here, the 12 specifications are sorted increasingly according to the number of topics in the constraints. The results have clearly shown that the proposed BACSTG approach outperforms the conventional BAC and other heuristic techniques, except DAC, in runtime for the different data sets. BACSTG generally requires less than 2 minutes to complete the paper generation process. Moreover, the proposed BACSTG approach is quite scalable in runtime on different data set sizes and distributions. In contrast, the other techniques (except DAC) are not efficient to generate highquality test papers. Particularly, the runtime performance of these techniques degrades quite badly as the data set size or the number of specified constraints gets larger, especially for imbalanced data sets
and
.
Fig. 8. Performance results based on runtime of the 12 specifications .
Discussion. The good performance of BACSTG is due to three main reasons. First, as BACSTG is based on LP relaxation, it can maximize the average discrimination degree effectively and efficiently while satisfying the multiple constraints without using weighting parameters. As such, BACSTG can achieve better paper quality and runtime efficiency as compared with other heuristicbased techniques. Second, BACSTG has a more effective branching strategy than that of conventional BAC, thereby pruning the search space more effectively. This helps BACSTG improve runtime and search on promising unvisited subproblem nodes of the search tree for paper quality enhancement. Third, BACSTG uses an efficient algorithm to generate the lifted cover inequalities. Thus, BACSTG can also improve its computational efficiency on largescale data sets as compared with the conventional BAC technique. Moreover, as there are more questions with different attribute values on larger data sets and LP relaxation is effective for global optimization, BACSTG is able to generate higherquality test papers. Therefore, the BACSTG approach is effective for STG in terms of paper quality and runtime efficiency.
Comparison between BACSTG and DAC. DAC [
^{23}] is an efficient STG approach for online test paper generation.
Table 3 gives the performance comparison between BACSTG and DAC. The performance results are obtained based on the average results of the 12 test specifications for each data set. In
Fig. 7, the quality performance of BACSTG is consistently better than DAC for the four data sets. However, the runtime performance of BACSTG and DAC depends on the user specifications and data set distributions, as shown in
Table 3 and
Fig. 8.
Table 3. Performance Comparison between BACSTG and DAC
For the balanced uniform distribution data sets
and
, DAC outperforms BACSTG in runtime. DAC is in fact very fast, as it is designed for achieving online runtime requirement. As there may have enough relevant questions on
and
for DAC to optimize its solution without getting stuck on local optimal, the runtime performance of DAC outperforms BACSTG in this situation because it is a heuristicbased technique, whereas BACSTG is a global optimization method. For imbalanced normal distribution data sets
and
, BACSTG achieves better runtime performance if the specified test papers contain many topics or have high total time. Otherwise, DAC achieves better runtime performance. The main reason is that the sparse matrix property of 01 ILP formulation of BACSTG is satisfied in this situation, which makes lifted cover cuttings become more effective for 01 ILP optimization. Furthermore, as DAC focuses only on optimizing the constraint satisfaction of a unique initial solution, and it could easily lead to a local optimal. It is especially the case for imbalanced data sets, where there may not have enough relevant questions for DAC to optimize the unique initial solution. As shown in
Fig. 7, the quality performance of DAC degrades quite badly on both
and
.
In short, the quality performance of BACSTG consistently outperforms DAC, while its runtime performance is comparable to that of DAC.
5.4 Expert Calibration of Test Quality Measures
Data sets. To gain further insight into the paper quality generated by the proposed BACSTG approach, we have conducted a user evaluation. Here, we use two math data sets, namely

and

, which are constructed from G.C.E. ALevel Math and Undergraduate Math, respectively. For experimental purposes, the question type and topic attributes of question are calibrated automatically, whereas the difficulty degree attribute is calibrated by 10 tutors who are tutors of the first year undergraduate mathematics subject CZ1800 Engineering Mathematics in the School of Computer Engineering, Nanyang Technological University. These tutors have good knowledge of the math contents contained in the two math data sets. In the tutor calibration, we adopt a rating method [
^{41}] for the tutors to rate the difficulty degree of each question from 1 to 7 corresponding to the following seven discretized difficulty levels of IRT: extremely easy, very easy, easy, medium, hard, very hard, and extremely hard. In addition, according to the IRT Normal Ogive model [
^{24}], an initial value for the discrimination degree of a question can be computed based on its relation formula with a fixed user's proficiency value and the difficulty degree.
Table 5 summarizes the two data sets.
Experimental procedures. In the experiments, we have designed 12 new specifications for the AMath data set and 12 new specifications for the UMath data set with different parameter values. Then, we generate the test papers based on the test specifications using the eight techniques. As a result, a total of 96 papers are generated for each data set. To conduct the user evaluation, the same 10 tutors are participated. In the experiments, the tutors are asked to evaluate the test paper quality by comparing its attributes with its original specification. The tutors compare each corresponding attribute of the generated test with its specification by their own intuition without knowing our defined measures. Based on the similarities, they are asked to evaluate the overall quality of the generated test and classify it into one of the following three categories: high, medium, and low. As a result, a total of 960 test paper evaluations are conducted for each data set.
Expert calibration on quality results.Fig. 9 shows the user evaluation results on the generated test paper qualities from the eight techniques. As can be seen, BACSTG has achieved better performance on quality than the other techniques. For the AMath data set, BACSTG has achieved promising results with 81 percent (97 papers), 12 percent (14 papers), and 7 percent (9 papers) for high, medium, and lowquality generated papers, respectively, from a total of 120 papers evaluated. Similarly, for the UMath data set, BACSTG has achieved promising results with 91 percent (109 papers), 6 percent (7 papers) and 3 percent (4 papers) for high, medium, and lowquality generated papers, respectively. On average, BACSTG has achieved 86, 9, and 5 percent for high, medium, and lowquality generated papers, respectively. In addition, it can also be observed that the proposed BACSTG approach is able to improve the quality of the generated papers with a larger data set. It has performed better with the UMath data set than the AMath data set.
Fig. 9. Expert calibration on generated test quality measures.
Fig. 10. Webbased testing framework.
Moreover, we have further analyzed the quality of the generated papers based on the user evaluation results. After all the generated papers are classified into high, medium, and low quality, we have analyzed the CVs on topic, question type, difficulty degree, and total time based on each of the two data sets.
Table 4 gives the analysis results in terms of mean and standard deviation of the corresponding CVs. For the AMath data set, we have obtained the mean CVs of
,
, and
for high, medium, and lowquality generated papers, respectively. Similarly, for the UMath data set, we have obtained the mean CVs of
,
, and
for high, medium, and lowquality generated papers respectively. On average, we have obtained
,
, and
CVs for high, medium, and lowquality generated papers. In fact, we have used these values as the thresholds when we determine high, medium, or low quality for the generated papers during performance evaluation on test paper quality.
Table 4. Quality Analysis of the Generated Test Papers According to the Expert Calibration Results
Table 5. Math Data Sets
6. WebBased Testing Framework
In this research, we have investigated a webbased testing system for elearning in mathematics.
Fig. 10 shows the proposed system framework, which consists of the following components: webserver, math question database server using MySQL, STG, automatic solution checking, and automatic question calibration. The STG component is implemented based on the proposed BACSTG approach. The automatic solution checking component is implemented based on the mathematical equivalence checking algorithm [
^{42}]. The automatic question calibration component is currently under development based on the educational data mining techniques mentioned earlier in Section 3.1. These components are implemented in Java. The system can be accessed through a web browser.
As equivalent mathematical expressions can be expressed in different forms (e.g.,
and
), the automatic solution checking component automatically checks the equivalence of the students' answers with the standard solutions to evaluate the correctness of the students' answers. Currently, it focuses on automatic answer checking for mathematical expressions that are the most common form of required answers in most math questions. To check mathematical answers, a randomized algorithm is proposed based on the probabilistic numerical testing method [
^{42}]. The proposed algorithm has shown promising performance on different types of mathematical expressions such as multivariate polynomials, trigonometric functions, and so on. As compared with other webbased testing systems, which are only able to support multiple choice answer checking, our proposed system has an additional advantage on supporting advanced mathematical expression answer checking.
In this paper, we have proposed an efficient ILP approach called BACSTG for highquality STG from largescale question data sets. The proposed BACSTG approach is based on the branchandbound and lifted cover cutting methods to find the nearoptimal solution by exploiting the sparse matrix property of 01 ILP. The performance results on various data sets and a user evaluation on generated test paper quality have shown that the BACSTG approach has achieved test paper generation with not only high quality, but also runtime efficiency when compared with other STG techniques. As such, the proposed BACSTG approach is particularly useful for webbased testing and assessment for online learning environment.
From a pedagogical perspective on generating highquality test papers, the current webbased testing approach and system can be extended in the following four ways. First, automatic calibration of question attributes should be supported to generate largescale data sets. Currently, we are investigating different data mining techniques to achieve this purpose. Second, the current approach could be extended to support the generation of multiple test papers simultaneously based on the same test paper specification. The quality of these generated test papers should be comparable. Therefore, to evaluate students in an online class, different test papers of equivalent or similar properties and quality can be used. Third, in the expert calibration, it would also be very useful to consider a more general quality definition based on the capability of a test to make a valid cognitive assessment of a user over set of skills. Finally, the current system does not allow further changes on individual questions of a generated test paper after the test paper generation process. This could be enhanced by allowing users to edit, modify, and update individual questions of a generated test paper. As such, the webbased testing system will be more flexible in creating highquality test papers.
M.L. Nguyen and S.C. Hui are with the School of Computer Engineering, Nanyang Technological University, Block N4, B3c, DISCO Lab, 50 Nanyang Avenue, Singapore 639798.
Email: {NGUY0093, asschui}@ntu.edu.sg.
A.C.M. Fong is with the School of Computing and Math Sciences, Auckland University of Technology, New Zealand.
Email: acmfong@gmail.com.
Manuscript received 24 May 2012; revised 13 Sept. 2012; accepted 24 Nov. 2012; published online 29 Nov. 2012.
For information on obtaining reprints of this article, please send email to: lt@computer.org, and reference IEEECS Log Number TLT2012050075.
Digital Object Identifier no. 10.1109/TLT.2012.22.
1. http://www.khanacademy.org/.
2. https://www.coursera.org/.
3. http://www.udacity.com/.
4. http://ocw.mit.edu/index.htm.
5. http://wp.sigmod.org/?p=165.
6. http://www.ets.org.
7. http://www.artofproblemsolving.com/Forum/portal.php?ml=1.
Minh Luan Nguyen is currently working toward the PhD degree at the School of Computer Engineering, Nanyang Technological University, Singapore. He is a member of the IEEE.
Siu Cheung Hui received the BSc degree in mathematics in 1983 and the DPhil degree in computer science in 1987 from the University of Sussex, United Kingdom. He is an associate professor at the School of Computer Engineering, Nanyang Technological University, Singapore. He was with IBM China/Hong Kong Corporation as a system engineer from 1987 to 1990. His current research interests include data mining, web mining, Semantic Web, intelligent systems, information retrieval, intelligent tutoring systems, timetabling, and scheduling.
Alvis C.M. Fong is a professor at the Auckland University of Technology, New Zealand. Previously, he was an associate professor with Nanyang Technological University, Singapore. His research interests include information processing and management, multimedia, and communications. He is a senior member of the IEEE.