, University of Michigan
, Virginia Tech
, University of Florida
Pages: pp. 72-77
Abstract—This approach uses genetic programming to automatically evolve new retrieval algorithms based on a user's evaluation of previously viewed documents.
Anyone who's used a computer to find information on the Web knows that the experience can be frustrating. Search engines are incorporating new techniques (such as examining document link structures) to increase effectiveness. However, searchers all too often face one of two outcomes: reviewing many more Web pages than they'd prefer or failing to find as much useful information as they really want.
The problem might be most frustrating to knowledge workers whose livelihoods depend on keeping current with some aspect of the world. Even when such individuals have unchanging information needs, this consistency doesn't yield any extra advantage in terms of improved retrieval performance. Search engines don't typically tailor their algorithms to knowledge workers' persistent information needs.
Here, we introduce a new retrieval technique that exploits users' persistent information needs. These users might include business analysts specializing in genetic technologies, stockbrokers keeping abreast of wireless communications, and legislators needing to understand computer privacy and security developments. To help such searchers, we evolve effective search programs by using feedback based on users' judgments about the relevance of the documents they've retrieved. The algorithm we use is based on genetics and uses selection and modification to "breed" increasingly effective retrieval algorithms.
Historically, information retrieval has attempted to develop ways to effectively retrieve documents. Computer programs can't read and understand text written about an unrestricted range of topics. So, IR has adopted certain heuristics to try to represent documents' contents so that we can store these representations in computer databases and later retrieve the documents they represent.
By far the most common heuristic for doing this is attaching keywords to documents—each keyword suggesting a topic that the document is supposed to address. Similarly, searchers typically enter queries using keywords. A matching function matches the information in document representations with the information in the query representation to assign each document a retrieval score. The application presents documents to the user in decreasing order of this retrieval score. One of the most popular matching functions, for example, is the cosine measure. 1 It finds the angle between a query keyword vector and a document keyword vector. The closer these two vectors are to each other, the higher the document's matching score.
A problem with using keyword representations is that, although two keywords might each accurately describe the same document, one might be more applicable given the document's contents.
To accommodate such differences, IR algorithms have many methods of weighting terms and phrases. For instance, we presume that a document that uses a particular term frequently is better represented by that term than a document that uses it less frequently. Similarly, when a relatively rare term—such as "accordion"—occurs in a document, we might deem it more indicative of the document's subject than when a more common term—such as "and" or "computer"—occurs. So, term-weighting strategies often weight a term on the basis of its number of occurrences in a document as well as its rarity in the document collection. Obviously, different term-weighting strategies would result in different retrieval performance because the retrieval score would differ with each strategy.
No uniformly best approach exists for supplying term weights. Various careful studies have compared different term-weighting strategies, but none consistently outperforms the others when measured by its ability both to identify relevant documents and to insulate the user from nonrelevant documents. 1,2 Several factors exacerbate the difficulty of developing effective weighting strategies: differences among document collections (their contents, style, and vocabularies), among users (from novice to sophisticated), and among queries (number of terms used and terms' specificity).
However, although a single best general-purpose approach to weighting terms might not exist, a best method does exist for weighting terms for a particular user with a recurring need for information. By employing an adaptive algorithm using genetic programming (see the " Intro to Genetic Programming" sidebar), we seek to evolve a matching function that approximates that method.
To determine an effective IR weighting program for a specific user, we first choose the appropriate features (see table 1). We can easily compute each of these descriptive statistics for any term, any document, and any document collection. Additionally, we used five operators to construct our "weighting trees": +, -, ×, /, and log.
The token frequency ( tf) function returns a number indicating how frequently a given word or phrase occurs in a document. As we've suggested, this measure provides information about the term's suitability as an index term—with terms occurring more frequently thought to be better descriptors and to deserve larger weights. The document frequency ( df) function returns a number that tells, for the entire document collection, how many documents contain a particular term. This also provides useful information on the assumption that a rare term's appearance offers more meaning than that of a more common term. Table 1 describes other terminals such as tf_max and tf_avg.
Genetic programming can help us generate a program that effectively combines document features for a particular user with a persistent information need. For example, figure 1 shows one combination of features that the program might produce.
Figure 1 A sample genetic programming tree ( tf denotes token frequency; df denotes document frequency).
The program's semantics gives each term in a document this weight: 1/8th its token frequency times the difference between its document frequency and token frequency. Although this is simply an example, the fact that our program could have evolved it is important for two related reasons. First, genetic programming often produces feature combinations that differ significantly from what a knowledgeable designer might think up. (Most weighting algorithms involve a ratio-like relationship between a term's token frequency and its document frequency). Second, despite the strange (even cumbersome) programs that genetic programming produces, they often produce strikingly effective results—even if an expert programmer would have an extremely difficult time reading, let alone writing, the program. What matters is performance, not comprehensibility or ease of maintenance.
Genetic algorithms are attractive because of their intrinsically parallel search mechanisms and powerful global exploration capabilities in a high-dimensional space. Programmers have used genetic algorithms to adapt document descriptions, modify query representations, and fine-tune parameters in matching functions. 3-7 However, up to now, genetic programming has seen little application in IR.
We ran 35 separate mini-experiments to determine how well we could evolve effective weighting programs. For each experiment, we began with a query selected from a TREC ( Text Retrieval Conference) test collection of stories from the Associated Press. TREC is the primary conference on IR techniques for large-scale text collections. 8 Many IR systems and commercial engines use TREC data as a test-bed for validating and evaluating their performance. Figures 2a and 2b show a sample TREC document and query. We randomly divided the TREC data into three sets: one each for training, validation, and testing. Each data set contained approximately 80,000 documents. A three-data-set design is a common practice in machine learning experiments to avoid overfitting a solution to the given data. We used the training data set to train the algorithm and the validation data set to pick the best tree available from the training phase (in the hope that it would perform well on unseen documents). Finally, we used the test data set to test the algorithm on unseen documents and to report the results.
Figure 2 A sample (a) TREC ( Text Retrieval Conference) document and (b) TREC query (topic).
Table 2 lists the experimental settings. Beginning with 200 randomly generated weighting programs associated with a particular user's query, we determined the effectiveness of each in retrieving 500 documents from a training set of nearly 80,000 documents. That is, for each of the nearly 80,000 documents, a weighting program computed document weights for each associated term in the user's query. Added together, the weights produced an overall matching score for that document, according to that particular weighting program. By imposing a document cut-off value (DCV) (the number of documents the user is willing to see) of 500 and using TREC's predetermined judgments of which documents were relevant for this query, we could tell how many of the top 500 ranked documents were and weren't relevant. This let us calculate the fitness function, P_Avg (defined in table 3), for the individual weighting program.
We performed separately a similar set of computations and evaluations for each of the other 199 weighting programs in use for this query. The top 10 percent (the reproduction rate) of the programs in terms of fitness function were automatically entered into the next generation. We used tournament selection to select the trees for the next generation. We reproduced more copies of the better-performing programs, subjected most to crossover, and tested this new set of programs against the same training documents. We repeated this process 30 times. We also noted, for each generation, the best 10 of the 200 programs.
At the end of a given document's training period (conducted to evolve programs that we hoped would be effective for retrieval), we used the 300 saved weighting programs on a different set of validation documents in the TREC collection to select the best-performing weighting program—that is, one we thought would work well for new, unseen data. We ran this program against the test documents in the TREC collection to obtain the final performance result. For each of the 35 mini-experiments, we followed the same process using a different query.
We compared the best performing of these weighting programs in several ways. (All results reported in this section are statistically significant at p < 0.05.) Table 3 shows the performance measures. First, we compared our program's performance against a retrieval program called SMART, which has performed extremely well in various retrieval conditions in the TREC competitions. 9 Using a DCV of 1,000 for testing and TREC's benchmark statistic P_Avg, our genetic programming approach was superior for 34 out of 35 test queries, usually many times better. We also compared the two systems' performances when only the first 10 documents were retrieved ( P_10)—a situation that resembles many real-life retrieval situations where users require only a few documents. Again, our system outperformed SMART—this time for all 35 queries.
Next, we compared our program's performance against a competing adaptive approach—a neural network. Neural networks have also been used widely in IR and routing experiments. 10 In terms of P_Avg, genetic programming outperformed the neural network approach 30 out of 35 times. Most of the performance comparisons are quite dramatic (more than 100 percent improvement). We obtained similar performance results when considering only the top 10 documents retrieved. Overall, genetic programming outperformed the neural network by more than 100 percent in both P_Avg and P_10. However, a neural network's performance partly depends on its tuning. Although we strived to find effective parameter settings, we selected them empirically, not systematically through an optimization routine.
We repeated all our experiments with a slight variation. Whereas our original experiments took into account only the title and the description portions of topics such as shown in figure 2b to serve as a user query (we call this a short query), our modified experiments considered these fields plus other fields in figure 2b, such as the narrative and concepts, to form a long query. With this modified query representation, genetic programming still outperformed SMART and the neural network.
Figures 3a–c graphically compare the results of the three systems for P_Avg, P_10, and T_Rel_Ret, for both the short and long queries.
Figure 3 Comparison of system performance for (a) P_Avg, (b) P_10, and (c) T_Rel_Ret.
Interestingly, as the basis of query representation expanded to include narrative and concepts as well as title and description (that is, as we began to use long, not short, queries), all three systems improved. For all three performance measures, the neural network improved the most on a percentage basis; for P_Avg and P_10, genetic programming improved approximately 20 percent more than did the SMART system. For T_Rel_Ret, the two systems improved about the same. These results imply that adaptive retrieval systems benefit from rich query representations.
We've shown that genetic programming offers a new way to develop IR programs. Translated into everyday practice, the technique we developed can help provide knowledge workers regularly seeking information on the same topic with increasingly effective, customized retrieval programs.
Much exciting work remains in applying genetic programming to the problem of IR. Others might use a broader set of features than what we used in our work (functions, operators, and terminals). And although we applied the same weight to every term under consideration, you don't have to have identical weights for all terms.
This work doesn't represent the last word on evolving programs that are highly effective in retrieving information from the World Wide Web or other large data stores. But, we believe it is the first word, and it represents a promising avenue of research and application.
Genetic programming is based on genetic algorithms, which mimic natural genetic operations in which the genetic material of fitter entities flourish, and crossover and modifications to their genes ensure the exploration of new genetic combinations. 1 With genetic algorithms, although strings of bits or real values typically represent individual chromosomes, other data structures are possible. John Koza showed that you can evolve tree structures representing programs, resulting in a technique he called genetic programming. 2
In genetic programming, programs representing a potential solution to a problem are represented using a tree structure, with many solutions (or trees) in a generation. Each solution's fitness is found and the solutions are arranged in the order of fitness. Fitter trees from one generation get more representation in the next generation (the selection operation). The trees also undergo crossover, an operation which exchanges genetic materials between two parent trees. Sometimes another operation, mutation, is applied on randomly selected nodes in the trees to ensure that the solution trees don't get stuck in a locally optimal solution. Selection, crossover, and mutation repeat for a sufficient number of generations until a solution is found or until no further significant performance improvement in the solution occurs from one generation to the next.
For example, suppose you wanted to evolve a program that could compute a number's square root. (Naturally, this wouldn't be a typical genetic-programming application, which normally evolves a program for which there is no known solution or acceptable approximation). At some intermediate generation, we might have two programs in the population of possible solutions, as shown in figures A1 and A2. Each of these solutions (or trees) operates on the input value n.ASolutions for the square root problem: The (1) first, (2) second, and (3) third trees, and (4) the trees after crossover.
Table A shows how the two programs would perform for several values of n. Although neither program computes an effective approximation of the square root function, the second performs consistently better. So, its constituent parts would appear more often in the intermediate stage between generations, just before crossover. Crossover would involve the exchange of subtrees. Crossing the second tree with the tree in figure A3 could result in the two trees shown in figure A4. (We say "could" because the resulting trees depend on the crossover points used in each tree.) If applied, a mutation operator could randomly change the value of one or more nodes in these figures. These trees' performance would be computed and compared against the performance of all other trees during the then-current generation, thus determining how often each of these trees would be represented in the next intermediate generation before crossover occurs again. Typically, the trees become increasingly fit (that is, they represent the solution better and better) from one generation to the next. Then, they eventually settle down on or near the optimal solution.
As this example illustrates, genetic programming features two different types of nodes. Leaf nodes are constants or variables that have their own value. Nonleaf nodes are operators and functions that can compute values based on subtrees, variables, or constants. Because computing a square root is a numeric problem, we chose numeric leaf nodes and operators for the example. Programmers choose the "raw materials" for genetic programming before adaptation, and this choice is an important factor in determining whether the evolution will produce a favorable result. We might say that programmers must determine the right features for a problem to be solved, and the genetic program itself attempts to determine the best combination of those features.ReferencesJ.H.HollandAdaptation in Natural and Artificial Systems,2nd ed., MIT Press,1992.J.R.KozaGenetic Programming: On the Programming of Computers by Means of Natural Selection,MIT Press,1992.