The Challenge of Natural Language Processing with Big Data

7th July 2015 0 Comments

You have mass quantities of all sorts of data imaginable. And yet, to tease any insight out of that, you have to learn the equivalent of SQL or hire a bunch of data scientists to write queries for you full time. What is the alternative do this? It is called natural language processing, and it is the future of Big Data. Specifically, Big Data will become Big Democratized Data will be when computers and servers and services can understand regular English—in that users will be able to ask questions in conversational language.

However, natural language processing is a very challenging field and still has a long way to go before it becomes a mainstream way to interact with data. Why is having a computer understand conversation style questions such a difficult problem to solve? There are three main reasons:

  • Human languages are varied and heterogeneous. There are various dialects and idioms that human brains can learn to understand and from which people can get used to teasing the meaning out of, but computers are literal devices. They do not know how to adjust for these sorts of idiosyncrasies, meaning people almost have to learn a different dialect just to interact with the computer. For example, have you ever sat up a little straighter and modified your voice when you called an airline and tried to spell out your confirmation code over the phone? That sort of forced inflection and required language has to be overcome for natural language processing to really take off.
  • The intent of a question is not always clear. The easiest way to explain this is that users do not know what they do not know, and so by asking a certain question a certain way they may want to know a piece of information or a set of data that they do not know even exists. For example, if you were to ask a computer, “what is the average age of my youngest buyers?” the software has to figure out what product you are referring to, how to separate that buyer list into useful age cohorts, how to calculate the age, and how to account for customers about whom you have no data on age. That is a lot of information to glean out of a simple English statement, and yet those calculations and algorithms must be run every time a natural language query is issued. Meanwhile, the user may have just wanted to know whether the key 25-49 demographic skewed higher or lower across everything—but he did not know how to state that question in a way the computer would understand, so he asked something different. Additionally, the computer must be as aware, based on the context of the question, of what is not being asked as it is aware of what is being asked. Sometimes that distinction is a nuance, but sometimes it makes all the difference between finding out what you want to know and finding out something that is irrelevant to a current query.
  • There is more than just the words involved in natural language processing. In addition to intent and structure, there is also the sentiment or emotion involved in a statement. People can be excited, they can be nervous, or they can be angry or frustrated, and they speak differently and choose different words and key phrases when each of these emotions is presenting. Natural language processing algorithms must cut through this emotion to get to the point of a query. But there is also predictive value in that emotion, and these algorithms can use the emotion to build context and predict questions that a user might want to ask. This is particularly helpful when analyzing data around certain situations—when something goes wrong with a customer experience, or when a deal is about to close, or when a customer might be about to default on his or her obligations.

The best example of natural language processing I know on the market today is Microsoft’s Power Query, which uses Excel Services and PowerPivot along with Office 365 to allow users to just type in natural language a limited set of queries about data. For example, in a search box, the user can type “best selling drink per region” and the service would return a map of a relevant geographic area with the most popular item highlighted in the correct location. A preview of the future, if you will.

Are you using natural language processing applications? What would you like to be able to ask of a computer when it comes to your data?

Share This