NEWS


News Archive

July 2012

Gig.U Project Aims for an Ultrafast US Internet

June 2012

Bringing Location and Navigation Technology Indoors

May 2012

Plans Under Way for Roaming between Cellular and Wi-Fi Networks

Encryption System Flaw Threatens Internet Security

April 2012

For Business Intelligence, the Trend Is Location, Location, Location

Corpus Linguistics Keep Up-to-Date with Language

March 2012

Are Tomorrow's Firewalls Finally Here Today?

February 2012

Spatial Humanities Brings History to Life

December 2011

Could Hackers Take Your Car for a Ride?

November 2011

What to Do about Supercookies?

October 2011

Lights, Camera, Virtual Moviemaking

September 2011

Revolutionizing Wall Street with News Analytics

August 2011

Growing Network-Encryption Use Puts Systems at Risk

New Project Could Promote Semantic Web

July 2011

FBI Employs New Botnet Eradication Tactics

Google and Twitter "Like" Social Indexing

June 2011

Computing Commodities Market in the Cloud

May 2011

Intel Chips Step up to 3D

Apple Programming Error Raises Privacy Concerns

Thunderbolt Promises Lightning Speed

April 2011

Industrial Control Systems Face More Security Challenges

Microsoft Effort Takes Down Massive Botnet

March 2011

IP Addresses Getting Security Upgrade

February 2011

Studios Agree on DRM Infrastructure

January 2011

New Web Protocol Promises to Reduce Browser Latency

To Be or NAT to Be?

December 2010

Intel Gets inside the Helmet

Tuning Body-to-Body Networks with RF Modeling

November 2010

New Wi-Fi Spec Simplifies Connectivity

Expanded Top-Level Domains Could Spur Internet Real Estate Boom

October 2010

New Weapon in War on Botnets

September 2010

Content-Centered Internet Architecture Gets a Boost

Gesturing Going Mainstream

August 2010

Is Context-Aware Computing Ready for the Limelight?

Flexible Routing in the Cloud

Signal Congestion Rejuvenates Interest in Cell Paging-Channel Protocol

July 2010

New Protocol Improves Interaction among Networked Devices and Applications

Security for Domain Name System Takes a Big Step Forward

The ROADM to Smarter Optical Networking

Distributed Cache Goes Mainstream

June 2010

New Application Protects Mobile-Phone Passwords

WiGig Alliance Reveals Ultrafast Wireless Specification

Cognitive Radio Adds Intelligence to Wireless Technology

May 2010

New Product Uses Light Connections in Blade Server

April 2010

Browser Fingerprints Threaten Privacy

New Animation Technique Uses Motion Frequencies to Shake Trees

March 2010

Researchers Take Promising Approach to Chemical Computing

Screen-Capture Programming: What You See is What You Script

Research Project Sends Data Wirelessly at High Speeds via Light

February 2010

Faster Testing for Complex Software Systems

IEEE 802.1Qbg/h to Simplify Data Center Virtual LAN Management

Distributed Data-Analysis Approach Gains Popularity

Twitter Tweak Helps Haiti Relief Effort

January 2010

2010 Rings in Some Y2K-like Problems

Infrastructure Sensors Improve Home Monitoring

Internet Search Takes a Semantic Turn

December 2009

Phase-Change Memory Technology Moves toward Mass Production

IBM Crowdsources Translation Software

Digital Ants Promise New Security Paradigm

November 2009

Program Uses Mobile Technology to Help with Crises

More Cores Keep Power Down

White-Space Networking Goes Live

Mobile Web 2.0 Experiences Growing Pains

October 2009

More Spectrum Sought for Body Sensor Networks

Optics for Universal I/O and Speed

High-Performance Computing Adds Virtualization to the Mix

ICANN Accountability Goes Multinational

RFID Tags Chat Their Way to Energy Efficiency

September 2009

Delay-Tolerant Networks in Your Pocket

Flash Cookies Stir Privacy Concerns

Addressing the Challenge of Cloud-Computing Interoperability

Ephemeralizing the Web

August 2009

Bluetooth Speeds Up

Grids Get Closer

DCN Gets Ready for Production

The Sims Meet Science

Sexy Space Threat Comes to Mobile Phones

July 2009

WiGig Alliance Makes Push for HD Specification

New Dilemnas, Same Principles:
Changing Landscape Requires IT Ethics to Go Mainstream

Synthetic DNS Stirs Controversy:
Why Breaking Is a Good Thing

New Approach Fights Microchip Piracy

Technique Makes Strong Encryption Easier to Use

New Adobe Flash Streams Internet Directly to TVs

June 2009

Aging Satellites Spark GPS Concerns

The Changing World of Outsourcing

North American CS Enrollment Rises for First Time in Seven Years

Materials Breakthrough Could Eliminate Bootups

April 2009

Trusted Computing Shapes Self-Encrypting Drives

March 2009

Google, Publishers to Try New Advertising Methods

Siftables Offer New Interaction Model for Serious Games

Hulu Boxed In by Media Conglomerates

February 2009

Chips on Verge of Reaching 32 nm Nodes

Hathaway to Lead Cybersecurity Review

A Match Made in Heaven: Gaming Enters the Cloud

January 2009

Government Support Could Spell Big Year for Open Source

25 Reasons For Better Programming

Web Guide Turns Playstation 3 Consoles into Supercomputing Cluster

Flagbearers for Technology: Contemporary Techniques Showcase US Artifact and European Treasures

December 2008

.Tel TLD Debuts As New Way to Network

Science Exchange

November 2008

The Future is Reconfigurable

Corpus Linguistics Keep Up-to-Date with Language

by George Lawton

Language is a living thing that changes regularly. New words are added, and others become archaic. Words take on new meanings, and some old uses fall out of fashion.

This can leave traditional dictionaries years behind in terms of current language usage, said Erin McKean, founder of Wordnik, a corpus-linguistic database.

Creating traditional dictionaries is a huge project, with linguistic experts and others studying the language to determine which additions and changes are appropriate. Thus, new dictionaries can appear only every few years.

For more up-to-date information, a growing number of researchers and organizations are turning to corpus linguistic (CL) techniques to analyze large quantities of text and speech and automatically generate language databases. These databases provide information about words' various meanings, usages, and contexts.

CL applications send bots to automatically search news feeds, blogs, and other sources on the Internet — as well as previously curated and tagged bodies of text — for new instances of language usage.

They also provide more metadata than natural-language-processing (NLP) techniques, as well as information about the sentiments associated with various word usages, and the context in which words have certain meanings.

Researchers are beginning to use CL tools to improve dictionaries, language learning, and the sorting of large document sets based on various criteria.

In the future, those tools could enable more accurate speech recognition and speech synthesis, said University of Georgia professor William Kretzschmar.

Computer processing and storage advances are enabling systems to analyze and index ever larger datasets, noted University of Pennsylvania linguistics professor Mark Liberman.

Categorizing the data and its associated metadata enables CL systems to find more precise relationships among words, which leads to more accurate document processing and speech recognition.

Linguistic Databases in Detail

CL emerged from British linguistics professor John Firth's research in the first half of the 20th Century.

This work led to the release of the one-million-word Brown University Standard Corpus of Present-Day American English in the 1960s.

Interest in CL techniques was minimal for years because there wasn’t enough computational power and storage available to maximize their effectiveness.

Interest increased as processing and storage capabilities grew, leading to the Collins Birmingham University International Language Database, a University of Birmingham research facility established in 1980 and funded by William Collins & Sons publishers. The facility released the English COBUILD Dictionary in 1987.

The BNC Consortium, an industrial/academic organization led by the Oxford University Press, completed the 100-million-word British National Corpus in 1994.

In 2008, Google produced a linguistic dataset from Google Books with 155 billion words but relatively little metadata, noted the University of Georgia's Kretzschmar.

Technology

CL applications use bots that aggregate online material such as news feeds, transcripts of spoken presentations, academic journals, and books.

They also use data-mining tools — such as Oxford University Press Wordsmith Tools or the open source AntConc — that search through curated material, scan textual datasets, and categorize word usage across multiple domains, cultures, and individuals.

A linguist can curate content by deciding which sources to sample, a process that could involve manually collecting material or sending bots to specific websites. Curators must select useful sets of sources to ensure that their CL linguistic databases are effective.

After they categorize words, CL programs add relevant metadata and then organize the collection into a linguistic database.

In some cases, the programs generate indexes that produce faster responses to queries but that could also increase tenfold the amount of material in a CL database, noted Brigham Young University (BYU) linguistics professor Mark Davies.

CL techniques determine definitions by analyzing words near, as well as next to, the target words in text. This differs from NLP, which generally analyzes only words adjacent to the targets.

More processing power is thus required to assemble and index a CL database.

However, said the University of Georgia's Kretzschmar, this approach makes it easier to find relevant definitions and improves text processing, speech recognition, and speech generation.

Wordnik

The Wordnik database combines CL and NLP techniques to generate a dictionary with about 6 million fully indexed and 50 million partially indexed English words.

This compares with about 600,000 words in the Oxford English Dictionary, one of the largest traditional English dictionaries.

To update its database, Wordnik uses its own technology that analyzes multiple news feeds and data gleaned from various news, cultural, and scientific sources.

The technology processes the data into word graphs stored in a MongoDB nonrelational database. These graphs present visualizations of the relationship among collections of words and thus are similar to social-network graphs, Wordnik’s McKean explained. They let users more easily look for and identify associations among words.

The company provides an API that lets organizations and individuals design programs that access the Wordnik database for various purposes, such as researching word usage and providing websites with a better dictionary for their content.

Corpus of Contemporary American English, 1990–2011

BYU's Mark Davies and his team have amassed the Corpus of Contemporary American English, 1990–2011, the world's largest open source CL database.

COCA has 425 million words and adds 25 million new entries annually. It is not as large as some commercial CL databases, such as one that Ebsco Publishing has developed.

COCA's database consists of five categories with 85 million words each: fiction, academic, popular culture, news, and spoken English.
Metadata about entries' literary genre, context, and era of popularity help identify word-usage trends.

COCA also lets users see words that were located near a new entry in textual sources, to help them identify how the new word is utilized in context.

According to Davies, COCA is used primarily for teaching English; conducting academic linguistics research; and improving human-computer interaction by, for example, enabling better speech recognition and more realistic text-to-speech synthesis.

Most of COCA's data is available for free at www.wordfrequency.info. A richer set of information with more metadata is also available for commercial licensing at www.wordfrequency.info/purchase.asp.

Talking about the Future

A hurdle to widespread CL-database adoption is that the technology can require considerable manual effort to curate content and tag information with metadata, Kretzschmar said.

Indexing CL-database content is also a challenge. Research is thus focused on developing better algorithms and parallel-processing techniques for indexing.

The University of Pennsylvania's Liberman said better techniques are also needed for determining the sentiment a word reflects or its usage in different contexts, and for word disambiguation and associating individual words with multiple meanings in different settings.

The University of Georgia's Kretzschmar predicted that CL will enable many types of speech-recognition and speech-synthesis applications because the metadata the technology provides will allow greater linguistic precision.

He also said companies will use CL techniques to scan and then sort documents for compliance with governmental record-storage, auditing, and reporting requirements.

According to BYU's Davies, CL's Holy Grail lies in analyzing large datasets of informal language used on social-networking sites such as Facebook and in everyday users' day-to-day online communications.

However, he said, it will be at least 15 years until the available computing horsepower, storage capacity, and analysis techniques will enable this.