Text Analytics with Text Mining and NLP

Have you ever commented on social platforms about a new product that you bought or a movie that you watched? All such data found online is key to understanding of customers’ behavior and is sought by companies and product owners. This data is largely text data or unstructured data. It is here where text mining comes in the picture.Use of text mining has grown as unstructured data continues to increase exponentially in both relevance and quantity. Text mining extracts useful information with various techniques. The goal is to turn text into meaningful data for analysis, via application of analytical methods and Natural Language Processing (NLP). Being able to reveal insights even in large volumes of unstructured data is leading to its rapid adoption in the business world!

With text analytics, you can spot patterns in massive collections of textual data that an individual human mind could never detect. On the other hand, text often contains grammatical errors, slangs, sarcasms, double meanings etc. which humans can interpret but machines must be trained. This means that before the machine can perform pattern recognition, it needs to extract the linguistic meaning from textual data. In other words, the machine must read the text before it can truly be analyzed.

The best way to do it is with natural language processing tools. NLP transforms text with the use of linguistic analysis. The goal to accomplish human-like language processing by having the machine learn and interpret the way our brains have been tuned to interpret languages.


Where do we mine text from?
Typically, from social networking sites, news, company reports, magazines, wikis, blogs, community based forums / discussion boards, reviews and ratings, e-mail, knowledge management repositories.
The source data goes through various pre-processing and core mining techniques. The text processing workflow can be visualized as a pipeline, running from raw data to output.

Text Analytics Steps

I will share with you the steps followed and some of the powerful text mining techniques applied.

In Python, there is an excellent package NLTK which is the Natural Language Processing toolkit.  The NLTK package can help you to select, filter, clean and standardize the data before we run the analysis. The NLTK and NLP terms that I used were sentence splitting, noun phrase chunking, lemmatization, stemming, POS tagging, tokenization, parsing, named entity recognition, text classification and relation extraction.


Tokenization – Tokenization is a first step toward structuring text for further analysis, as otherwise text is treated as an unstructured “string.” Once the string has been broken into individual words or “tokenized,” you can determine word frequency in the document and begin identifying relationships between words. Sentence and Word Tokenization techniques are used.


Sentence tokenization or boundary detection – finds breaks in given text. Sentences are usually separated by punctuation marks (like “.”, “!” or “?”), but you also need to be aware of context where the punctuation is used. For e.g. you cannot end the sentence when a punctuation mark is used in abbreviation or compound words (like “Dr. Smith” or “quick-thinking”).

Example of sentence tokenizing with nltk:

from nltk import sent_tokenize

sentence = "Dr. Hawking is a great scientist who always solves the problems of his fellow-citizens by means of his scientific theories."

print( sent_tokenize(sentence))


['Dr. Hawking is a great scientist who always solves the problems of his fellow-citizens by means of his scientific theories.']


Word tokenization – separates continuous text into independent words. In most languages words are separated by space. But for languages like Japanese where words aren’t delimited, it can be a problem!

Example of word tokenizing with nltk: “He solves problems with his inventions and his quick-thinking”. With the help of nltk, we can receive list of all words in the sentence including complex double words.

from nltk import word_tokenize

text = "He solves problems with his inventions and his quick-thinking."



['He', 'solves', 'problems', 'with', 'his', 'inventions', 'and', 'his', 'quick-thinking', '.']


Stopword removal – Textual data contains lots of little insignificant yet highly common words: “to,” “for,” “of,” “the” etc. If you’re analyzing the frequency of words in a document, these will always surface to the top. Stopword removal is the process of filtering out these words to look for longer, more significant words. There are built-in stopword dictionaries for different languages, and you can also build your own.


Part-of-speech tagging (POS) (or grammatical tagging) is used to identify the class of word as a noun, verb, adjective or adverb, etc. Many words like “outside” can be used as multiple parts of speech and it totally depends on the context.  This is essential because words get different importance scoring depending on their part of speech.

Example: depending on the context the word “outside” is a noun, adjective or adverb.

The outside (noun) of this bungalow looks impressive.

Dr. Smith works rarely outside (adjective) his clinic.

All the patients are waiting outside. (adverb)

The same example can be processed with nltk word-tokenize and nltk.pos_tag to get the right classification.

text = word_tokenize("All the patients are waiting outside.")

nltk.pos_tag (text)

[('All', 'PDT'),

('the', 'DT'),

('patients', 'NN'),

('are', 'VBZ'),

('waiting', 'VBG'),

('outside', 'RB'),

('.', '.')]


Stemming – is a process of reducing inflected words to their word stem, i.e. root. For example, word “solv” is the stem of words “solve” and “solved”.


Lemmatization – similarly to stemming, lemmatization reduces words to their lemmas (dictionary form) with the use of vocabulary and morphological analysis. nltk provides WordNet lemmatizer based on large WordNet dictionary of the English words.

 The text is further classified based on the context and business objectives that you are running with. For e.g. in my case since I was creating a sentiment analysis algorithm I was more interested in classifying it as a positive or negative sentiment. Once the text classification is done we can run a whole bunch of text analytics around that.


Tools and packages that I used:

Python, Numpy, Scipy, Scikit-learn, NLTK, gensim, Textblob and tweepy, Twitter/Facebook API for integration


Where can we apply text mining & NLP?

These can be used in just about any domain and field. Few key areas highlighted below:

  • Customer Service – Text mining coupled with conversational interfaces likes chatbots are adopted to improve customer experience. Different sources such as trouble tickets, customer call notes and reviews are used to build the database for typical questions and their possible responses.  Text analysis is used to provide an automated response to the customer, thereby reducing the dependency of human call center operators and improving the quality, effectiveness and speed in resolving problems.
  • Spam filtering  – With the proliferate use of E-mail in corporate world, spam emails have been a major challenge for service providers. It is worrying service providers as they incur a cost to ensure control of spam both on the software and hardware side. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not. Text mining techniques are implemented to improve the effectiveness of statistical-based email filtering methods.
  • Sentiment analysis / Market Research / Customer Analytics – Social media is increasingly being recognized as a valuable source of market and customer intelligence.  Companies are using text mining and analytics to measure brand/product effectiveness and analyze/ predict customer needs.  Data is extracted by connecting to social media platforms and fetching unstructured data in the form of tweets & comments. This is processed to extract sentiments and their relations with brands and products. With Sentiment Analysis businesses can now measure ROI of campaigns, review marketing strategy, develop product quality and overall improve revenue growth.
Loading Likes...
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published.