Aim:
The aim of this blog is to provide a detailed guideline that by using sentiment analysis (with the combination of the techniques of “Natural Language Processing” and “Machine learning” ) you can analyze people reviews about a specific product on Social Media like Twitter, Facebook, Instagram etc.
What will you need:
Before implementing this blog, you will need:
- Python 3.5 (or any other version) that is available on https://www.python.org/downloads/
- NLTK 3.4.5 (or any other version) that you need to first install and then download. For installation just use windows PowerShell and use the command “pip install nltk”. As installation will be complete, then open available Python IDLE and first write “import nltk” and after that write the command “nltk.download()”.
- Pycharm Community or other edition of any version.
Outline:
- History
- Methodology
- Conclusion
History:
Web is a huge source of information where different people exchange their ideas or share information about a specific product, topic or brand. People mostly share their reviews on review and rating sites. With the rapid increase, in the use of these sites companies also divert their interest towards these sites for their publicity. These sites consists of a huge amount of data and manual analyzation of such big data was impossible. So analysts, started to find out the way using which they can analyze such a huge amount of data and their struggles lead them towards Sentiment Analysis.
“Sentiment Analysis is the computational study of public moods and interests shared in the form of text.”
Now-a-days, Sentiment Analysis is of great popularity among organizations. With the help of this blog, you might be able to find out the way how we can analyze a text with the combination of techniques of Natural Language Processing and Machine Learning.
Methodology:
You can find sentiment analysis of any text by passing the text from following phases:
- Data Extraction.
- Preprocessor
- Subjectivity Classification
- Sentiment Analysis
Data Extraction:
Data extraction is the important phase of any process. The accuracy of the result depends upon totally the data extracted from Social Media (like Facebook, Twitter etc.). But extraction of data is not enough, we need to extract useful tweets or data so that our results would be accurate. For loading data in python, you can use below given code:

Preprocessor:
“Preprocessor is the computational analysis of data so that machine can use it for many useful processes”. For Sentiment Analysis, used preprocessing techniques are:
- Tokenization
- Stopwords Removal
- Lemmatization
- Parts of Speech Tagging
Tokenization:
Processing of whole sentence in context of machine is very difficult, so, first data is divided into smaller linguistic units that are known as tokens or lexicons.
“Process of converting Large amount of Data into smaller linguistic units is known as tokenization”.
The process of creating tokens is known as tokenization. For this purpose, nltk (Natural Language Toolkit) provides tokenizers using which we can tokenize large amount of data. NLTK provides different tokenizers but as this blog is on Sentiment Analysis of Social Media Reviews so, we will use Word Tokenizer. Below is the code how can you use this and how it creates token:

Stopwords Removal:
In computing, clean data is required for processing so after creating tokens, we have to clean the text using one of the Natural Language Processing techniques known as Stopwords Removal.
“Stopwords are the words that are useless or that have no meaning.”
NLTK provides a list of stopwords for almost every language. Stopwords of English Language Provided by the NLTK are:

For the removal of Stopwords, you can use the below given code:

Lemmatization:
“Lemmatization is a process of grouping together such words that are inflected so that they can be analyzed as a single item”.
Text processing includes Stemming and Lemmatization. Many times people become confuse in understanding these two terms and treat them as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words. Stemming creates stems of word without checking the meaning of the word is changed or not but lemmatization creates lemmas of the word and then lemma is check with an additional dictionary so that word can’t be changed. For this purpose we use WordNetLemmatizer. You can create lemmas by using WordNetLemmatizer with the below given code:

Parts of Speech Tagging:
“The process of assigning parts of speech tags to each individual token according to the grammatical category it holds is known as parts of speech tagging”.
Part of speech tagging converts the tokens into different grammatical categories that can be further used for subjectivity classification. Categories can be Verb, Adverbs, Noun, Pronoun, Adjectives etc. By using the given code you can perform Part of Speech Tagging upon tokens:

Subjectivity Classification:
Data extracted from social media may be opinionative and non-opinionative too. So we need to classify the opinionative from non-opinionative data.
“The process of separating opinionative tokens from non-opinionative tokens is known as subjectivity classification”.
Opinionative tokens are known as subjective tokens while non-opinionative tokens are known as objective tokens. For Sentiment Analysis, there is only need of subjective tokens like Adverb, Noun & Adjectives etc. For this purpose, we use startswith functions and thus desired tokens are separated out from the whole tokens list. By using the given you can separate your desire tokens in a single array or list for sentiment analysis or any other purpose.

Sentiment Analysis:
“Sentiment Analysis is the computational study of public moods shared in the form of text”.
Sentiment Analysis was for the first time introduced in 1950’s manually but with the increased interest of organizations towards sentiment analysis, analysis system was introduced so that larger data can be processed and people’s opinion can be analyzed. For this purpose, SentiWordNet scorers is used through which people’s opinion are analyzed. You can perform Sentiment Analysis using the below given code:

Conclusion:
This blog presents the results of applying an improved method based on four-way rule-based classification scheme to detect and classify sentiments expressed by users in online discussion forums. The proposed method is comprised of following modules:
- Acquire set of reviews that mention user reviews about different products;
- Apply noise reduction steps;
- Use Subjectivity classification to get subjective tokens;
- Apply sentiment classification of words using SentiWordNet-based classifier.
Great Work Sharjeel, I was getting through the post and found some grammar mistakes, beside this its very useful article.
Sharjeel, Good Word. One thing… How do you analysis the Roma Urdu Word like “Daraz ki products achi nahi hoti hain” and like these? Do you have track of such words?
And where you’re getting the dictionary [NLTK] word list.
It would consider such words as fake and in context of NLP these are called as acronyms. I will explain in the next blog with example how can we handle such words and how we can get sentiment score by handling acronyms. And there is an acronym dictionary that track such words and this dictionary consists of 30,000 words among which some are positive words, negative words as well neutral. So we can use that dictionary to track such fake words or you can say that Roma Urdu words.
These words are not fake, just written in another lang, we must consider these words. Either we can make dictionary of such words too, what you think?
Yes we can make dictionary of such words and then can compare them and then can score them
Without any dictionary nltk can handle such words.