In the alphabet soup of artificial intelligence (AI) acronyms, NLP is one of many tools that could have a huge impact on data science. NLP stands for natural language processing. It’s the process of teaching computers to comprehend and interpret human language. And it’s often something that feels like more of a pipe dream than reality, having the potential to unlock a massive trove of unstructured data for analysis.
But human language comes with more than just words. It has intent, dialect, slang, and sarcasm that are all difficult for computers to understand and translate into meaningful and consistent results. However, there are tools making strides into NLP and in this article, we’ll go over some of the key terms you need to know about NLP and how it works within data science.
Key terms in natural language processing.
There are several different ways NLP tools approach reading and understanding language. Here are some terms you need to understand about the different processes and approaches.
Tokenization: algorithms segment text into sentences or words while ignoring or removing punctuation and extra symbols.
Stop Words Removal: removes common language words and articles like “and”, “the”, “a”, in English. This reduces the number of words the tool analyzes and focuses only on those that have actual real-world meaning.
Stemming: use algorithms to reduce words into their root form. So machines can understand that words like stop, stopped, stopping all mean the same thing.
Word embeddings: represents words as numbers. Words with similar meanings have similar numbers.
Term Frequency-Inverse Document Frequency: also called, TF-IDF mines text and uses statistical models to determine how important a word is in a specific document.
Topic modeling: extracts the main topics from a collection of text data or documents. Is used in processes like extracting underlying trends in data, classifying text, and grouping different text sets.
Sentiment Analysis: identifies and extracts the subjective information contained within text data. It provides a basic summary of the text and additional grouping based on different sentiments.
What impact does NLP have on data science?
In many instances, NLP can feel like the Holy Grail of AI tools in data science–so powerful and searched for by so many data scientists, but difficult to actually find. Part of the reason NLP is desirable is that it scales across industries and departments. Every data science team deals with unstructured, free text data that has the potential to provide impactful insights.
There are two ways NLP will greatly impact data science. The first is on the query end: translating questions users ask into actionable intelligence. If any analyst or business user could type in queries and NLP tools could understand the free-text question and serve up actionable insight, it could be business-changing. Opening up data tools and insights to the unsophisticated business users to find their own insights without needing to understand the tools or data behind it. Some data science tools out there, like Domo, have chatbots already working on their platforms.
The second way NLP impacts data science is on the data ingestion side. For example, in the medical field, there is data abounding on people’s health. But the critical intersection of a person’s interaction with their medical provider can be difficult to get because so often doctors are recording notes on visits as free text. NLP tools can be a critical cog in helping get the complete data picture on a person’s health.
There are other job functions that use NLP:
News websites can use NLP to track what users are searching for and serve up more relevant content on their front pages.
HR teams can use NLP to filter candidate resumes and online job posting responses to better prioritize candidates beyond just how well they match keywords.
Financial traders would benefit from real-time NLP analysis of conversations about key companies and potential trades or mergers.
From marketing teams that want to analyze brand sentiment on social media to researchers parsing through free text responses on surveys, it’s easy to imagine what data science could do with NLP tools that give them easy access to unstructured text data.
Best practices for natural language processing in data science.
Don’t rely on it entirely. It’s still in its infancy.
There are high-profile examples of how NLP tools show a lot of promise but still require a lot of tweaking. Google’s flu trends tool was released with much fanfare in 2009, using search data to predict the flu. But it struggled with accuracy and disappeared quickly.
Consider the ethics of NLP.
Microsoft once demonstrated that by analyzing search engine queries they could identify users who had pancreatic cancer, some before they had received a diagnosis. As with all AI, the potential for good here is mind-blowing. However, the ethics of telling someone they may or may not have a life-threatening disease have huge implications. And AI analysis of unstructured data can come with a huge error rate.
As with all AI tools, the potential for NLP is huge. But the reality currently requires users to focus on smaller and more niche applications of the technology. Rather than looking at every part of your unstructured, free text data, train NLP tools to look in specific areas for specific things.
Level Up your Analytics Strategy with Augmented BI
Get Leverage | Webinar: How to Predict & Forecast Business Needs Using Domo’s Data Science Suite
Domo Named a Leader in The Forrester Wave™: Augmented BI Platforms, Q3 2021
Decision intelligence: What it is and why you need it
Ready to get started? Try Domo now or watch a demo.