Issue No.03 - March (2005 vol.38)
Published by the IEEE Computer Society
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MC.2005.103
As PC hard drives get bigger and new information sources become available, users willhave much more data of different types, including multimedia, on their computers. This makesit increasingly difficult to find documents, e-mail messages, spreadsheets, audio clips, and other files. Current desktop-based search capabilities, such as those in Windows, are inadequate to meet this challenge. <p>In response, major Web search providers and other companies are offering engines for searching PC hard drives. This requires new search approaches because desktop-based documents are generally structured differently than those on the Web.</p>
As PC hard drives get bigger and new information sources become available, users will have much more data of different types, including multimedia, on their computers. This makes it increasingly difficult to find documents, e-mail messages, spreadsheets, audio clips, and other files. Current desktop-based search capabilities, such as those in Windows, are inadequate to meet this challenge.
In response, major Web search providers and other companies are offering engines for searching PC hard drives. This requires new search approaches because desktop-based documents are generally structured differently than those on the Web.
A number of smaller vendors such as Accona Industrier, Automony, Blinkx, Copernic Technologies, dTSearch, and X1 Technologies are upgrading or providing free basic versions of their existing desktop search engines.
Google has introduced a free beta version of an integrated desktop and Web search engine. Search providers Ask Jeeves, HotBot, Lycos, Microsoft, and Yahoo, as well as major Internet service providers such as AOL and Earthlink, are developing similar technologies.
One important factor in the competition is the desire by some Web search providers to use desktop search as a way to convince people to always or at least regularly use their portals. This would create a large user base that could encourage businesses to either advertise on the portals or buy other services.
In addition, some desktop search providers may want to generate revenue by charging businesses for sending their advertisements, targeted to user queries, along with responses. Such advertising has generated considerable revenue for Web search providers.
A user could work with several desktop search engines, said Larry Grothaus, lead product manager for Microsoft's MSN Desktop search products. "But practically speaking, the average consumer will stick with the most attractive, easy-to-use, and familiar alternative."
Some desktop search approaches present security and privacy problems. Nonetheless, search providers are pushing ahead and adding usability features to attract users.
Desktop search challenges
Desktop search features built into current operating systems, e-mail programs, and other applications have far fewer capabilities than Web search engines. They generally offer only simple keyword searches of a set of files, usually of a single file type.
On the Web, search engines can exploit information organized into a common HTML format with standardized ways of identifying various document elements. The engines can use this information, along with links to other documents, to make statistical guesses that increase the likelihood of returning relevant results.
The desktop is more complicated to search because Microsoft Word and other applications format different types of documents in various ways. In addition, desktop files can be either structured or unstructured.
The function and meaning of structured files—such as information in a relational database or a text document with embedded tags—are clearly reflected in their structure. The easily identified structure makes searching such files easier. This is not the case with unstructured information, which includes natural-language documents, unformatted text files, speech, audio, images, and video.
Therefore, desktop search engines must add capabilities in different ways than Web search applications.
The Boolean AND, OR, and NOT mechanisms and keyword-indexing algorithms by which searches are conducted on the desktop are similar to those used for years on the Web, said Daniel Burns, CEO of X1.
However, desktop search engines face the additional challenge of recognizing which of the many file types it is dealing with. The engines also must derive whatever metadata authors have chosen to include in e-mail notes, database files, and other document types.
While conducting searches, desktop engines must be efficient and avoid imposing a substantial processing or memory load on the computer.
"A Web search service can set aside entire server farms to do only searches, while the desktop search engine has to be as efficient as possible within the constraints of the user's computing resources," explained Susan Feldman, search-engine market analyst at IDC, a market research firm.
To gain these desktop search capabilities, some Web search vendors have either acquired or licensed desktop-based technology, noted Nelson Mattos, distinguished engineer and director of information integration at IBM. For example, Momma.com bought part of Copernic, Ask Jeeves purchased Tokaroo, and AOL and Yahoo have licensed X1's technology.
Desktop search methodologies
Desktop search engines employ one or more file crawler programs—similar to those used by Web search engines—that, upon installation, move through disk drives. As Figure 1 shows, the crawlers use an indexer to create an index of files; their location on a hard drive's hierarchical tree file structure; file names, types, and extensions (such as .doc or .jpg); and keywords. Once existing files are indexed, the crawler indexes new documents in real time. During searches, the engine matches queries to indexed items to find relevant files faster.
The crawlers also collect metadata, which lets the engine access files more intelligently by providing additional search parameters, according to X1's Burns.
Several desktop search engines are integrated with the providers' Web engines and simultaneously run both types of searches on queries. These providers are putting considerable effort into desktop feature sets and interfaces that will be as familiar and easy to use as their Web-based counterparts, said IDC's Feldman.
Because they want to reach the broadest number of users, all Web search providers entering the desktop arena work only with the market-leading Windows and Internet Explorer platforms, explained Ray Wagner, search-engine analyst at Gartner, a market research firm. Some providers that offer only desktop search engines have versions for other operating systems and browsers.
Much of the industry's attention is focused on three major companies: Google, Microsoft, and Yahoo.
Google Desktop Search
Google was the first major Web search company to release a desktop search beta application ( http://desktop.google.com), a free, simple, lightweight (400-Kbyte) plug-in.
The Google Desktop Search beta is configured as a local proxy server that stands in for the Web search engine. It performs desktop searches only via Internet Explorer. By default, GDS returns desktop and Web search results together, but users can configure it to return them separately. The GDS beta does not let users search a specific field within a file, such as e-mail messages' "To" and "From" fields.
Google expects to release a commercial GDS version this year.
The company's search-related business model relies on revenue generated from real-time advertisements selected to match query terms and search results. With the Web and desktop search engines operating in tandem, the latter maintains a link with the former, which connects to a server responsible for providing advertising that relates to search terms.
GDS tracks and fully indexes Outlook and Outlook Express e-mail messages; AOL instant messages; the Internet Explorer history log; and Microsoft Word, Excel, and PowerPoint documents. Currently it does not index PDF files. And for nondocument files such as those with images, video, and audio, GDS indexes only file names.
Reflecting GDS's use of a Web server as the main mechanism for coordinating desktop and Web searches, the search engine indexes URLs for Web pages saved to the Internet Explorer favorites or history list, noted Nikhil Bhatla, product manager for desktop search at Google.
GDS uses a single crawler that indexes all file types.
MSN desktop search
MSN's 400-Kbyte desktop search application, part of the MSN Toolbar Suite ( http://beta.toolbar.msn.com), is closely integrated with Windows.
When the search utility is available commercially, slated for later this year, users will see it as part of the MSN Deskbar, noted Grothaus. The Deskbar, which appears on the Taskbar when Windows boots, contains buttons for direct access to MSN services. The engine also appears as MSN search bars within Outlook, Windows Explorer, and Internet Explorer.
Unlike Google's tool, MSN's application doesn't search local files and the Web at the same time. However, the MSN tool can index and search files on network-based drives, which Google's and Yahoo's engines don't.
Grothaus said Microsoft doesn't plan to display advertisements along with the results of desktop searches.
The Deskbar tool enables searches for any supported file type—Outlook and Outlook Express e-mail; Microsoft Office's Word, Excel, PowerPoint, Calendar, Task, and Notes files; plain-text and PDF documents; MSN messenger conversation logs; HTML file names; and many types of media files.
By default, the Outlook-based toolbar searches only Outlook and Outlook Express e-mail files, and the Internet Explorer-based toolbar enables searches only of HTML and e-mail files. The Windows Explorer toolbar allows keyword searches of all drives and maintains a history of previous searches.
The MSN desktop search engine uses separate file crawlers, each coded to search only for video or documents or any other supported file type, according to Grothaus. On the desktop, he explained, it's important not to use more computing resources than necessary. MSN has tailored each desktop crawler to perform only the work necessary to do its job.
Yahoo Desktop Search
The Yahoo Desktop Search beta ( http://desktop.yahoo.com) is a stand-alone application that runs on Windows. Designed to look and feel like the Yahoo Web search engine, the YDS beta is built on X1's commercial tool. For the upcoming commercial version, Yahoo says, it intends to create additional customized features and layer them on top of the X1 technology it licensed.
Unlike some other desktop engines, YDS also searches compressed ZIP and Adobe PDF, Acrobat Illustrator, and Photoshop files. Users can find and play audio and video files without launching a separate media player.
YDS can only search for Outlook and Outlook Express e-mails, unlike X1's engine, which also handles Eudora and Mozilla/Netscape mail.
A YDS convenience that neither GDS nor the MSN Desktop Search offers is the ability to preview files before opening them.
Yahoo's tool searches HTML pages that users download from the Web and those they create locally. However, Yahoo says, YDS doesn't index Internet Explorer history or favorites files or the browser's hard-drive-based cache memory, to keep others from accessing Web files that previous users have viewed.
Users can control and change settings to index only specific files or file types or files smaller than a given size.
In the future, Yahoo says, it hopes to make the desktop search tool particularly useful by tying it to the company's portal offerings, including its e-mail, calendar, photo, music, and chat services.
Security and privacy issues
Integrating desktop and Web search capabilities into the same application presents security and privacy challenges.
Integrated search engines use a local proxy-server program on the desktop to coordinate the delivery of real-time targeted advertising from Web servers for placement along with search results.
This could open a security hole in the connection between the PC and the Web, according to Daniel Wallach, Rice University assistant professor of computer science. "The more tightly the two are coupled," he said, "the more likely there are to be holes that hackers can breach."
Also, hackers in some cases could insert an applet to open a control channel within the proxy server, letting them issue queries to obtain private information.
Providers are taking steps to block these attacks.
Some desktop search engines' use of the browser cache to look for previously viewed Web pages could lead to other security breaches. "Access to the browser cache through the integrated search interface is an extraordinary lure to potential hackers," said Richard Smith, Internet security analyst at ComputerBytesMan.com.
Blinkx's desktop search engine prevents this by encrypting the cache, as well as communications between server and client.
Some integrated search tools make stored personal files, including e-mail and AOL chat logs, viewable on the Web browser, which could prove embarrassing if someone else has access to the computer.
And some tools also allow searches of recently viewed Web sites, a feature that has raised privacy concerns, particularly for users of shared PCs.
Microsoft's desktop tool doesn't index or allow searches of recently viewed Web sites, although it hasn't eliminated the possibility of doing so in the future, Grothaus said. YDS doesn't index the browser cache or the browser history or favorites files.
Also, Microsoft's tool searches for information based on each user who logs in. If one person uses a computer for personal banking, the next person logging into that machine can't access the sensitive data, Grothaus said.
According to Gartner's Wagner, the deciding factors in the marketplace competition between desktop search engines "will be the unique usability features they bring to the game and how well they deal with a number of perceived, rather than actual, security and privacy issues that have emerged."
However, said IBM's Mattos, search engine technology on the Web and the desktop needs radical changes to become truly useful. "On the Web, when a user puts in a sequence of keywords, even with advanced keyword search capabilities, he is liable to get a page telling him there are a million files that match the requirements," he said. "Searches on the desktop are not much better. They yield several hundred or several thousand. What is needed is something more fine-grained and able to pinpoint more exactly what you are looking for."
The goal of a desktop search is different from that of a Web search. On the Web, you are looking for information, not necessarily a specific document, explained X1's Burns. "On the desktop," he said, "you know that what you are searching for is there. You don't want to wade through pages and pages of possibilities to find it. You want it now—not several possibilities, but the right file."
Many industry observers are thus waiting to see the new XML-based WinFS file system ( http://msdn.microsoft.com/data/winfs) that Microsoft plans to incorporate in future Windows versions. The company originally anticipated including WinFS in its upcoming Longhorn version of Windows but apparently won't be able to do so.
According to Blinkx cofounder Suranga Chandratillake, moving to an XML-based structure is difficult and won't occur for years. The Web and local storage are growing rapidly, and most of the growing number of data types they contain work with traditional file structures, he explained. Imposing a new file structure on all this data is impractical, he said.
He concluded, "The alternative that I favor and that offers the only hope of keeping up with the growth and the increasing diversity of information on both the desktop and the Web, is wrestling with data, finding clever ways to add metadata, and discovering better search mechanisms that work within the file structures with which we are already familiar."
Bernard Cole is a freelance technology writer based in Flagstaff, Arizona. Contact him at BernardCole@techrite-associates.com.