, University of California, Los Angeles
, Yahoo! Research
Pages: pp. 13-15
The past few years have witnessed the rapid rise of social media Web sites such as Flickr, del.icio.us, YouTube, Myspace, and Facebook, as well as the proliferation of "mashup" applications created when users combine services from multiple sources. These sites contain user-generated content in various forms, from plain text to rich multimedia. In fact, most publicly available text content created during the next 24 hours will be generated by end users, rather than professional writers, journalists, corporate communications departments, or others whose job it is to create and publish content. Furthermore, end users will generate an additional two orders of magnitude more text that they will send privately to other users through a communications channel such as email. 1 The emergence of user content as the dominant content form on the Web raises various questions about the most effective approach to processing it.
Much user-generated content is hosted on social media Web sites, which commonly allow users to form communities based on shared interests, and to associate tags, reviews, recommendations, and comments with content. These metadata are invaluable in helping assess the highly variable quality of content that end users are creating. Visitors to these sites often seek not just the content itself, but also an understanding of the individuals who posted it. Furthermore, they might visit a site without any particular goal or informational need, but rather based on the simple desire to "get an update" or "be entertained" during their spare time.
Social media innovation occurs largely in the corporate sector, with many offerings arising from small Internet start-ups. Academic work in this area has focused primarily on studying the dynamics of social media generation or consumption. Significant literature exists on the dynamics of personal publishing through blogs and of distributed metadata generation through tagging. Academic work also exists on bulletin boards, wikis, and other creation modalities as well as on comments, reviews, ratings, bookmarks, and other forms of metadata. Workshops in various disciplines have sprung up around this area, and 2007 witnessed the first annual international conference on weblogs and social media (ICWSM).
Work in information retrieval has only recently begun to address social media corpora and to incorporate social media metadata as features. 2 General-purpose Web search engines index and return social media content in response to queries, and specialized search engines perform even more targeted analysis of particular social media — technorati.com or blogpulse.com for blogs, boardreader.com for bulletin boards, and so forth. To maintain a competitive advantage, however, these companies typically don't publish their techniques.
At the fringes of social search are implicit techniques that capture users' consumption behavior to modify retrieval for new users. Substantial literature discusses collaborative filtering in this space, and emerging work considers user click behavior as a feature in Web search.
The proliferation of user-generated content and the resulting associated metadata on the Web introduce new challenges and opportunities in search. For example, the rich metadata users provide help distinguish the high-quality content from the vast amount of noise, but might also be susceptible to user manipulation. In particular, the following characteristics make searching such user-generated content more challenging:
Despite these challenges, the richer context that social media content provides gives us exciting opportunities. For example, users often form explicit and implicit communities around their interests, letting us apply collaborative filtering techniques at an unprecedented scale. Users also provide a rich body of metadata, in the form of tags, bookmarks, and favorites, and they leave detailed interaction history while they explore the content.
The three articles selected for this special issue present some early work in understanding the characteristics of user-generated content and its metadata, and in making high-quality content more accessible and comprehensible.
In "Social Information Processing in News Aggregation," Kristina Lerman studies the mechanisms by which a broad user community produces a set of community recommendations for new articles. She investigates several factors influencing an article's overall popularity on news aggregation site Digg, including how well the article's author is "connected" within the online community, and tries to quantify these factors' impact by fitting the data against a mathematical model.
Next, in "Social Bookmarking for Scholarly Digital Libraries," Umer Farooq and his colleagues study the process by which users save references to objects (in this case, technical papers) for later discovery or for social discovery. In particular, they study tag use on a social bibliography site, Cite-ULike, and suggest how such a site might be improved.
In the last article, "Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges," Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina study an increasingly prevalent problem in search and information discovery: malicious manipulation of content or metadata to influence results. They survey three common countermeasures against such spam — detection, demotion, and prevention — from the standpoint of social media Web sites and discuss some differences and challenges of fighting spam in this context.
The articles in this special issue represent just a sample of the early findings in this fledgling research area. As users become more familiar with social media, and as service providers gain a better understanding of their users, user behavior and content and metadata characteristics are likely to change, necessitating the continuous reevaluation of what we've learned before.
Social media analysis is well positioned to continue advancing its understanding of content dynamics and metadata generation. At the same time, social media is becoming big business and is driving a significant fraction of worldwide page-views on the Web. It is imperative to develop better and more effective technologies to cope with commercial users' ongoing attempts to manipulate the system to their advantage. Likewise, as competition continues to increase in these domains, social search will become a differentiating technology, resulting in continued investment and material advances beyond the state of the art today. At the same time, the high volume of social media consumption will result in another critical problem: the monetization of social media sites. Here, the problem is one of searching for relevant advertisements based on user properties as well on content properties. We expect these two problems to receive increasing attention over the next few years.
The 2007 annual index includes a list of all articles and departments published this year
IC's prior and current editors in chief, Robert E. Filman and Fred Douglis, helped us greatly with this issue. We thank them for giving us the opportunity to create a special issue for this new and emerging research area and for their constant commitment and encouragement. We also thank the many academics and practitioners who submitted their excellent research results to this special issue. Finally, numerous reviewers spent an enormous amount of time evaluating the submitted articles and providing thoughtful comments and guidance to the accepted articles. Without their help, it wouldn't have been possible to reach the level of quality the articles in this issue represent.