SEPTEMBER/OCTOBER 2006 (Vol. 21, No. 5) pp. 72-77
1541-1672/06/$31.00 © 2006 IEEE
Published by the IEEE Computer Society
Published by the IEEE Computer Society
Adaptive Web Search: Evolving a Program That Finds Information
|Information retrieval and evolutionary algorithms|
|Experiments and results|
PDFs Require Adobe Acrobat
This approach uses genetic programming to automatically evolve new retrieval algorithms based on a user's evaluation of previously viewed documents.
Anyone who's used a computer to find information on the Web knows that the experience can be frustrating. Search engines are incorporating new techniques (such as examining document link structures) to increase effectiveness. However, searchers all too often face one of two outcomes: reviewing many more Web pages than they'd prefer or failing to find as much useful information as they really want.
The problem might be most frustrating to knowledge workers whose livelihoods depend on keeping current with some aspect of the world. Even when such individuals have unchanging information needs, this consistency doesn't yield any extra advantage in terms of improved retrieval performance. Search engines don't typically tailor their algorithms to knowledge workers' persistent information needs.
Here, we introduce a new retrieval technique that exploits users' persistent information needs. These users might include business analysts specializing in genetic technologies, stockbrokers keeping abreast of wireless communications, and legislators needing to understand computer privacy and security developments. To help such searchers, we evolve effective search programs by using feedback based on users' judgments about the relevance of the documents they've retrieved. The algorithm we use is based on genetics and uses selection and modification to "breed" increasingly effective retrieval algorithms.
Information retrieval and evolutionary algorithms
Historically, information retrieval has attempted to develop ways to effectively retrieve documents. Computer programs can't read and understand text written about an unrestricted range of topics. So, IR has adopted certain heuristics to try to represent documents' contents so that we can store these representations in computer databases and later retrieve the documents they represent.
By far the most common heuristic for doing this is attaching keywords to documents—each keyword suggesting a topic that the document is supposed to address. Similarly, searchers typically enter queries using keywords. A matching function matches the information in document representations with the information in the query representation to assign each document a retrieval score. The application presents documents to the user in decreasing order of this retrieval score. One of the most popular matching functions, for example, is the cosine measure. 1 It finds the angle between a query keyword vector and a document keyword vector. The closer these two vectors are to each other, the higher the document's matching score.
A problem with using keyword representations is that, although two keywords might each accurately describe the same document, one might be more applicable given the document's contents.
To accommodate such differences, IR algorithms have many methods of weighting terms and phrases. For instance, we presume that a document that uses a particular term frequently is better represented by that term than a document that uses it less frequently. Similarly, when a relatively rare term—such as "accordion"—occurs in a document, we might deem it more indicative of the document's subject than when a more common term—such as "and" or "computer"—occurs. So, term-weighting strategies often weight a term on the basis of its number of occurrences in a document as well as its rarity in the document collection. Obviously, different term-weighting strategies would result in different retrieval performance because the retrieval score would differ with each strategy.
No uniformly best approach exists for supplying term weights. Various careful studies have compared different term-weighting strategies, but none consistently outperforms the others when measured by its ability both to identify relevant documents and to insulate the user from nonrelevant documents. 1 , 2 Several factors exacerbate the difficulty of developing effective weighting strategies: differences among document collections (their contents, style, and vocabularies), among users (from novice to sophisticated), and among queries (number of terms used and terms' specificity).
However, although a single best general-purpose approach to weighting terms might not exist, a best method does exist for weighting terms for a particular user with a recurring need for information. By employing an adaptive algorithm using genetic programming (see the " Intro to Genetic Programming " sidebar), we seek to evolve a matching function that approximates that method.
To determine an effective IR weighting program for a specific user, we first choose the appropriate features (see table 1 ). We can easily compute each of these descriptive statistics for any term, any document, and any document collection. Additionally, we used five operators to construct our "weighting trees": +, -, ×, /, and log.
The token frequency ( tf) function returns a number indicating how frequently a given word or phrase occurs in a document. As we've suggested, this measure provides information about the term's suitability as an index term—with terms occurring more frequently thought to be better descriptors and to deserve larger weights. The document frequency ( df) function returns a number that tells, for the entire document collection, how many documents contain a particular term. This also provides useful information on the assumption that a rare term's appearance offers more meaning than that of a more common term. Table 1 describes other terminals such as tf_max and tf_avg.
Genetic programming can help us generate a program that effectively combines document features for a particular user with a persistent information need. For example, figure 1 shows one combination of features that the program might produce.
The program's semantics gives each term in a document this weight: 1/8th its token frequency times the difference between its document frequency and token frequency. Although this is simply an example, the fact that our program could have evolved it is important for two related reasons. First, genetic programming often produces feature combinations that differ significantly from what a knowledgeable designer might think up. (Most weighting algorithms involve a ratio-like relationship between a term's token frequency and its document frequency). Second, despite the strange (even cumbersome) programs that genetic programming produces, they often produce strikingly effective results—even if an expert programmer would have an extremely difficult time reading, let alone writing, the program. What matters is performance, not comprehensibility or ease of maintenance.
Genetic algorithms are attractive because of their intrinsically parallel search mechanisms and powerful global exploration capabilities in a high-dimensional space. Programmers have used genetic algorithms to adapt document descriptions, modify query representations, and fine-tune parameters in matching functions. 3 - 7 However, up to now, genetic programming has seen little application in IR.
Experiments and results
We ran 35 separate mini-experiments to determine how well we could evolve effective weighting programs. For each experiment, we began with a query selected from a TREC ( Text Retrieval Conference) test collection of stories from the Associated Press. TREC is the primary conference on IR techniques for large-scale text collections. 8 Many IR systems and commercial engines use TREC data as a test-bed for validating and evaluating their performance. Figures 2 a and 2 b show a sample TREC document and query. We randomly divided the TREC data into three sets: one each for training, validation, and testing. Each data set contained approximately 80,000 documents. A three-data-set design is a common practice in machine learning experiments to avoid overfitting a solution to the given data. We used the training data set to train the algorithm and the validation data set to pick the best tree available from the training phase (in the hope that it would perform well on unseen documents). Finally, we used the test data set to test the algorithm on unseen documents and to report the results.
Table 2 lists the experimental settings. Beginning with 200 randomly generated weighting programs associated with a particular user's query, we determined the effectiveness of each in retrieving 500 documents from a training set of nearly 80,000 documents. That is, for each of the nearly 80,000 documents, a weighting program computed document weights for each associated term in the user's query. Added together, the weights produced an overall matching score for that document, according to that particular weighting program. By imposing a document cut-off value (DCV) (the number of documents the user is willing to see) of 500 and using TREC's predetermined judgments of which documents were relevant for this query, we could tell how many of the top 500 ranked documents were and weren't relevant. This let us calculate the fitness function, P_Avg (defined in table 3 ), for the individual weighting program.
We performed separately a similar set of computations and evaluations for each of the other 199 weighting programs in use for this query. The top 10 percent (the reproduction rate) of the programs in terms of fitness function were automatically entered into the next generation. We used tournament selection to select the trees for the next generation. We reproduced more copies of the better-performing programs, subjected most to crossover, and tested this new set of programs against the same training documents. We repeated this process 30 times. We also noted, for each generation, the best 10 of the 200 programs.
At the end of a given document's training period (conducted to evolve programs that we hoped would be effective for retrieval), we used the 300 saved weighting programs on a different set of validation documents in the TREC collection to select the best-performing weighting program—that is, one we thought would work well for new, unseen data. We ran this program against the test documents in the TREC collection to obtain the final performance result. For each of the 35 mini-experiments, we followed the same process using a different query.
We compared the best performing of these weighting programs in several ways. (All results reported in this section are statistically significant at p < 0.05.) Table 3 shows the performance measures. First, we compared our program's performance against a retrieval program called SMART, which has performed extremely well in various retrieval conditions in the TREC competitions. 9 Using a DCV of 1,000 for testing and TREC's benchmark statistic P_Avg, our genetic programming approach was superior for 34 out of 35 test queries, usually many times better. We also compared the two systems' performances when only the first 10 documents were retrieved ( P_10)—a situation that resembles many real-life retrieval situations where users require only a few documents. Again, our system outperformed SMART—this time for all 35 queries.
Next, we compared our program's performance against a competing adaptive approach—a neural network. Neural networks have also been used widely in IR and routing experiments. 10 In terms of P_Avg, genetic programming outperformed the neural network approach 30 out of 35 times. Most of the performance comparisons are quite dramatic (more than 100 percent improvement). We obtained similar performance results when considering only the top 10 documents retrieved. Overall, genetic programming outperformed the neural network by more than 100 percent in both P_Avg and P_10. However, a neural network's performance partly depends on its tuning. Although we strived to find effective parameter settings, we selected them empirically, not systematically through an optimization routine.
We repeated all our experiments with a slight variation. Whereas our original experiments took into account only the title and the description portions of topics such as shown in figure 2 b to serve as a user query (we call this a short query), our modified experiments considered these fields plus other fields in figure 2 b, such as the narrative and concepts, to form a long query. With this modified query representation, genetic programming still outperformed SMART and the neural network.
Figures 3 a–c graphically compare the results of the three systems for P_Avg, P_10, and T_Rel_Ret, for both the short and long queries.
Interestingly, as the basis of query representation expanded to include narrative and concepts as well as title and description (that is, as we began to use long, not short, queries), all three systems improved. For all three performance measures, the neural network improved the most on a percentage basis; for P_Avg and P_10, genetic programming improved approximately 20 percent more than did the SMART system. For T_Rel_Ret, the two systems improved about the same. These results imply that adaptive retrieval systems benefit from rich query representations.
We've shown that genetic programming offers a new way to develop IR programs. Translated into everyday practice, the technique we developed can help provide knowledge workers regularly seeking information on the same topic with increasingly effective, customized retrieval programs.
Much exciting work remains in applying genetic programming to the problem of IR. Others might use a broader set of features than what we used in our work (functions, operators, and terminals). And although we applied the same weight to every term under consideration, you don't have to have identical weights for all terms.
This work doesn't represent the last word on evolving programs that are highly effective in retrieving information from the World Wide Web or other large data stores. But, we believe it is the first word, and it represents a promising avenue of research and application.
Michael Gordon is a professor of business information technology at the Ross School of Business at the University of Michigan. His research interests include using information and communication technology to help alleviate poverty and improve health and education; the retrieval and discovery-based uses of textual information; information-based communities; appropriate uses of technology to support teaching, learning, and information sharing. He received his PhD in computer science from the University of Michigan. Contact him at the Univ. of Michigan, Ross School of Business, Wyly, Ann Arbor, MI 48109; email@example.com.
Weiguo (Patrick) Fan is an associate professor of information systems and computer science at the Virginia Polytechnic Institute and State University. His research interests focus on the design and development of novel information technologies (data mining, text and Web mining, personalization, and knowledge management techniques) to support better management of business information and improved decision making. He received his PhD in computer and information systems from the University of Michigan's Ross School of Business. Contact him at Virginia Tech, 3007 Pamplin Hall, Blacksburg, VA 24061; firstname.lastname@example.org.
Praveen Pathak is an assistant professor of decision and information sciences at the University of Florida. His research interests include information retrieval, adaptive algorithms, AI, Web mining, e-commerce, and knowledge management. He received his PhD in computer and information systems from the University of Michigan's Ross School of Business. Contact him at the Univ. of Florida, Warrington College of Business, PO Box 117169, Gainesville, FL 32611; email@example.com.