• the inherently open, nondeterministic nature of the Web and
• the complex, leakage-prone information flow of many Web-based transactions that involve the transfer of sensitive, personal information.
• Personal data include information such as a person's name, marital status, mailing and email addresses, phone numbers, financial information, and health information.
• Digital behavior refers to Web users' activities while using the Web, including the sites they visit, frequency and duration of these visits, and online shopping patterns.
• Communication includes Web users' electronic messages, postings to electronic boards, and votes submitted to online polls and surveys.
Explicitly collecting information through online registration. Online registration entails that users provide personal information such as name, address, telephone number, email address, and so on. More importantly, in the registration process, users might have to disclose other sensitive information such as their credit card or checking account numbers to make online payments.
Identifying users through IP addresses. Generally, each time a person accesses a Web server, several things about that person are revealed to that server. In particular, a user's request to access a given Web page contains the user's machine's IP address. Web servers can use that to track the user's online behavior. In many situations, the address can uniquely identify the actual user "behind" it.
Software downloads. Companies that let their customers download their software via the Internet typically require a unique identifier from each user. In some cases, companies use these identifiers to track users' online activity. For example, in 1999, RealNetworks came under fire for its alleged use of unique identifiers to track the music CDs or MP3 files that users played with its RealPlayer software.
Cookies. A cookie is a piece of information that a server and a client pass back and forth. 9 In a typical scenario, a server sends a cookie to a client that stores it locally. The client then sends it back to the server when the server subsequently requests it. Cookies are generally used to overcome the HTTP protocol's stateless nature; they let a server remember a client's state at the time of their most recent interaction. They also let Web servers track Web users' online activities—for example, the Web pages they visit, items accessed, and duration of their access to every Web page. In many situations, this monitoring constitutes a violation of users' privacy.
Trojan horses. These applications might seem benign but can have destructive effects when they run on a user's computer. Examples of Trojan horses include programs that users install as antviruses but that actually introduce viruses to their computers. For example, a Trojan attack might start when a user downloads and installs free software from a Web site. The installation procedure might then launch a process that sends back to the attack initiator sensitive personal information stored on the local computer.
Web beacons. A Web beacon—also known as a Web bug, pixel tag, or clear gif—is a small transparent graphic image that is used in conjunction with cookies to monitor users' actions. 8 A Web beacon is placed in the code of a Web site or a commercial email to let the provider monitor the behavior of Web site visitors or those sending an email. When the HTML code associated with a Web beacon is invoked (to retrieve the image), it can simultaneously transfer information such as the IP address of the computer that retrieved the image, when the Web beacon was viewed, for how long, and so forth.
Screen scraping. Screen scraping is a process that uses programs to capture valuable information from Web pages. The basic idea is to parse the Web pages' HTML content with programs designed to recognize particular patterns of content, such as personal email addresses. A case that illustrates how screen scraping can violate privacy is the one in which the US Federal Trade Commission alleged that ReverseAuction.com had illegally harvested data from the online auction site eBay.com to gain access to eBay's customers. 8
Federated identity. A Web user's federated identity is a form of identity (for example, a user name and password pair) that lets a user access several Web resources. Microsoft's .Net My Services is an example of one architecture that provides a federated identity mechanism, with which a user can create an identity at one Web site and use it to access another Web site's services. This extensive sharing of users' private information raises concerns about the misuse of that information.
Indirectly collecting information. Users can authorize organizations or businesses to collect some of their private information. However, their privacy can be implicitly violated if their information undergoes analysis processes that produce new knowledge about their personality, wealth, behavior, and so on. This deductive analysis might, for example, use data mining techniques to draw conclusions and produce new facts about the users' shopping patterns, hobbies, or preferences. These facts might be used in recommender systems through a process called personalization, in which the systems use personalized information (collected and derived from customers' past activity) to predict or affect their future shopping patterns. Undeniably, personalization makes users' shopping experience more convenient. However, in more aggressive marketing practices (such as advertising phone calls) it can negatively affect customers' privacy.
Privacy can also be violated through the misuse of statistical databases, which contain information about numerous individuals. Examples include databases that provide general information about the health, education, or employment of groups of individuals living in a city, state, or country. Typical queries to statistical databases provide aggregated information such as sums, averages, pth percentiles, and so on. A privacy-related challenge is to provide statistical information without disclosing sensitive information about the individuals whose information is part of the database.
Client-based solutions. These solutions target privacy aspects relevant to individual users. Examples include protecting personal data stored on a personal computer, protecting email addresses, deleting any trace of Web access, and hiding Web surfers' real identities. We discuss four types of solutions: personal firewalls, remailers, trace removers, and anonymizers (see Figure 1).
A firewall is a software and/or hardware system that provides a private network with bidirectional protection from external entities gaining unauthorized access. Generally, firewalls protect medium-to-large networks (such as an enterprise's intranet). A personal firewall is a software firewall that protects a single user's system (typically, a single machine). It runs in the background on a PC or a server and watches for any malicious behavior. A user might even configure the firewall to detect specific types of unwanted events—for example, access from a specific IP address or a given port.
Personal firewalls have recently become a significant market. Many software firms propose personal firewalls with different capabilities. Examples include ZoneAlarm, NetBoz, and Outpost. In addition, general Web users can also use network address translation (NAT) devices to help preserve network privacy. Developers have initially proposed NATs to provide one IP for a set of home machines, thus providing a single point of entry for that network. While providing relative anonymity, its strength is on providing a firewall to provide reasonable security against external attacks.
A remailer is an application that receives emails from senders and forwards them to their respective recipients after it alters them so that the recipients cannot identify the actual senders. If necessary, a recipient can send a reply to the remailer, which then forwards it to the sender of the original message. Babel and Mixminion are examples of remailers.
When users navigate through the Web, their browsers or any other external code (such as a downloaded script) can store different types of information on their computers. This navigation trace provides details of users' surfing behavior, including the sites they visit, the time and duration of each visit, what files they download, and so on. Trace removers are available as a conservative measure to prevent disclosure of users' Web navigation history. They simply erase users' navigation histories from their computers. Examples of trace removers include Bullet Proof Soft and No Trace.
For many reasons, Web users would like to visit a Web site with the guarantee that neither that site nor any other party can identify them. Researchers have proposed several techniques to provide this anonymous Web surfing. These solutions' basic principle is preventing requests to a Web site from being linked to specific IP addresses. We can classify anonymizing techniques into four types:
• Proxy-based anonymizers. A proxy-based anonymizer uses a proxy host to which users' HTTP requests are first submitted. The proxy then transforms the requests in such a way that the final destination cannot identify its source. Requests received at the destination contain only the anonymizer's IP address. Examples of proxy-based anonymizers include Anonymizer, Lucent Personal Web Assistant (LPWA), iPrivacy, and WebSecure. Some proxy-based anonymizers can also be used to access registration-based Web sites. For example, LPWA uses alias generators, giving users consistent access to registration-based systems without revealing potentially sensitive personal data. More effective proxy-based anonymizers such as iPrivacy can conceal users' identity even while making online purchases that, normally, would require them to disclose their actual identities.
• Routing-based anonymizers. This class of anonymizers has Web requests traverse several hosts before delivering them to their final destination so that the destination cannot determine the requests' sources. An example of a tool that uses this technique is Crowds. 10 Its philosophy is that a good way to become invisible is to get lost in a crowd. The solution is to group Web users geographically into different groups, or crowds. A crowd performs Web transactions on behalf of its members. When users join a crowd, a process called jondo starts running on their local machines. This process represents the users in the crowd. It engages in a protocol to join the crowd, during which it is informed of the current crowd members. Once users' jondos have been admitted to the crowd, they can use the crowd to anonymously issue requests to Web servers. Users' requests are routed through a random sequence of jondos before they are finally delivered to their destinations. Neither the Web servers nor any other crowd members can determine who initiated a specific request.
• Mix-based anonymizers. Mix-based anonymizers are typically used to protect communication privacy. In particular, they protect against traffic-analysis attacks, which aim to identify who is talking to whom but not necessarily to directly identify that conversation's content. One technique that protects against traffic-analysis attacks is onion routing. 2 It is based on the idea that mingling connections from different users and applications makes them difficult to distinguish. The technique operates by dynamically building anonymous connections within a network of real-time Chaum mixes. 1 A Chaum mix is a store-and-forward device that accepts fixed-length messages from numerous sources, performs cryptographic transformations on the messages, and then forwards the messages to the next destination in a random order.
• Peer-to-peer anonymizers. Mix-based anonymizers generally use static sets of mixes to route traffic. This obviously poses three major problems: scalability, performance, and reliability. One way to overcome these drawbacks is to use peer-to-peer (P2P) anonymizers, which distribute the anonymizing tasks uniformly on a set of hosts. Examples of P2P anonymizers include Tarzan, MorphMix, and P5 (Peer-to-Peer Personal Privacy Protocol). For example, Tarzan uses a pool of voluntary nodes that form mix relays. It operates transparently at the IP level and, therefore, works for any Internet application.
Server-based solutions. Server-based solutions target aspects of Web privacy relevant to large organizations such as enterprises and government agencies. For example, an online business might deploy a server-based privacy-preserving solution to protect hospital patients' records or a customers database. Privacy preservation in these solutions is a side effect of strong security mechanisms typically employed in large organizations. Virtual private networks (VPNs) and firewalls are two mechanisms that have been particularly effective in protecting security and privacy at an enterprise scale. VPNs are secure virtual networks built on top of public networks such as the Internet. They generally use several security mechanisms (such as encryption, authentication, and digital certificates) and are often used in conjunction with firewalls to provide more stringent levels of security and privacy enforcement.
Client-server-based solutions. In these solutions, clients and servers cooperate to achieve a given set of privacy requirements. Two examples illustrate this: negotiation- and encryption-based solutions.
Encryption-based solutions encrypt the information exchanged between two or more Web hosts so that only legitimate recipients can decrypt it. Web users might use encryption in different Web activities and to enforce several privacy requirements. One of these requirements is the privacy of personal communication, or email. Typically, Internet-based communication is exchanged in clear text. An encryption-based protocol that has particularly addressed protecting email is Pretty Good Privacy. PGP has become the de facto standard for email encryption. It enables people to securely exchange messages and to secure files, disk volumes, and network connections with both privacy and strong authentication. It ensures privacy by encrypting emails or documents so that only the intended person can read them.
Mandatory regulation. Several countries and political entities have adopted laws and legal measures to address the Web privacy problem. A notable example of privacy-preserving regulations is the European Union's Data Protection Directive, adopted in October 1995. The directive limits access to electronic data contained in the EU member nations. According to the directive, certain personal information (such as an individual's race, creed, sexual orientation, or medical records) cannot leave the EU unless it is going to a nation with laws offering privacy levels that the EU has deemed adequate .
Governments might also impose privacy-related regulations on their own agencies. The US has passed statutes and laws to regulate its federal agencies' data collection. In fact, some of these laws were passed even before the Web era. One example is the Privacy Act passed in 1974. The act aimed at regulating activities of all agencies that collect and maintain personal information.