20110222

Catherine Havasi: Digital Intuition through Natural Language Processing

ConceptNet Map describing
relationships between concepts
20110221

As my faithful readers will know, I have recently become very interested in data processing.  This season's ATLAS lecture series has been excellent so far, and I couldn't pass this one up.  Catherine Havasi conducts research at the MIT Media Lab in the field of Natural Language Processing or NLP.  This is the field that is defining the way  we use computers to analyze data, specifically in the linguistic realm.  For the past 12 years, she has been working on  a research project, Open Mind Common Sense, which shifts the focus towards Natural Language Understanding.  Using assumptions about how things work and the connections between objects and concepts, conclusions can be made about very large data sets.  She presents these conclusions in a three dimensional vector space known as AnalogySpace (built in Processing) and a database known as ConceptNet.

AnalogySpace
A few common themes of the year have been emerging in this lecture series.  For one, analysis of linguistic principles has become more and more prominent.  This makes sense as the majority of our programming experiences and interactions rely on concepts like NLP and the simple idea that coding requires grammar and syntax.  Rules of language apply here just as much, if not more so than in 'real world' interactions.

The other theme I have been noticing more and more of is the use of video games to conduct research.  Ian Bogost's lecture on Art and the Constraints of Programming utilized video game platforms to examine their limitations and place within art and society.  Catherine uses video games themselves to collect data and have it analyzed by real human beings.  The project does this in half a dozen languages and processes in about 150 different dimensions.


Cat vs. Dog vs. Airplane
The program that is the 'face' of her research is called AnalogySpace.  This program analyzes shifts in language relationships (along up to 150 dimensions including time) that occur on the internet through sources such as Twitter, blogs, and other sources of internet conversation.  The program then takes each vector, which describes an object, idea, or concept and compares it with another.  For example Cat vs. Dog or Cat vs. Airplane, or the big picture: Dog vs. Cat vs. Airplane.  The features which describe the object and the relationships it can make define its vector.  Simple (if you understand it) vector addition gives a final representation of the comparison in question.  Remember, this goes up to 150 dimensions... so displaying this information in a 3D space (which is actually 2D on a computer screen) is difficult, if not impossible.

The best example she gave is presented here.  Along the y-axis is Good vs. Bad, along the x-axis is the feasibility of each action occurring.  Examine the picture for a while, and you will see some amusing relationships, as well as begin to understand how AnalogySpace works.  For example, "know everything" or "die in war".  The goal of all this research is to build a system, based on human input, to make sense of everyday life.
Good/Bad (y-axis) vs. Impossible/Possible (x-axis)
The dataset relationships are merely representations of life, but they provide key sociological and cultural insight into what is happening in the world.  NLP has been in use by companies like Google and Yahoo for years, analyzing the way we use language to find information on the internet.  However, projects like these are designed to evolve the system beyond processing into understanding.  In other words, Catherine is developing the search engine of the future.  What it will look like, I cannot say, but for it to be successful, computers must understand the ever changing way in which humans communicate.

"The goal is to find the signal in the noise."