Chapter 1


chapter, explained about Sentiment analysis and its applications.

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

1.1  Sentiment Analysis

Sentimental Analysis
is nothing but the task of Natural Language Processing. It observes the
attitude of customer behind the comments. Sentiment analysis is a method of identifying
sentiments in text. 1

Researchers and
Decision makers better understand customers point of view using sentiment
analysis and can make decision accordingly. Business analysis application can
be developed by using this technique. 2

Sentiment analysis
is computational methodology of extracting sentiments from text, speech or
dataset. It can classify emotions, attitude, opinion and subjective impression
into polarity. 3

Applications of sentiment analysis

and Recommendation

Whenever any
costumer wants to buy any product he/she first check the reviews for that
product. From reviews customer get information of Reputation of product and
opinion of other customers.


Companies consider
opinion of customers to improve their product. According to those opinions
company focuses on unsatisfying aspects of product and work on them.


By using opinion of
customer, appropriate internet ads of products can be automatically proposed.






Chapter 2

Literature Survey

In this chapter,
literature survey on the existing system is highlighted in reference to the
performance and approach of the current system. As I have read and study some
journal, conference papers which as follows:

2.2 Random Forest

Forest is developed as an ensemble approach based on combine prediction value
of each decision trees. Random Forest uses majority vote method in this
algorithm returns class with majority votes. Sometimes Decision tree grows
deeply and faces problem of overfitting and learn irregular patterns. 23

Forest classifier provides two types of randomness:1) randomness with respect
to data and 2) randomness with respect to features. Random Forest classifier
uses the concept of Bagging and Bootstrapping.

Features of Random Forest

are two features of random forest which are as follows:


are number of decision trees in random forest and random forest uses concept of
bootstrapping so, each tree consider random subset of training data. That means
each tree trains on different data. So, it is much robust for noise.


forest uses the concept of bagging. In this average of all classifiers is
calculated for final output. Giving huge data to single classifier will not
return appropriate result but if those data can divide into number of
classifiers then averaging of results of classifiers will give consistent solute
on. 5

2.2.1 Decision trees (DT)

Decision tree
algorithms are becoming more popular in machine learning because of its
predictive modeling approach. DT are commonly used algorithms because of its
properties like ease of understanding and interpretation, require little data
preparation, handle numerical and categorical data and comparatively faster

DT represented
using nodes and edges. Starting point of DT is their root node. Node which has
no incoming edges is known as root node. All other nodes have incoming edges.
The node which has outgoing edges are called internal nodes. The nodes which
has no outgoing edges are leaves or terminal node or decision node. 7 8

Classification and Regression Trees (CART)

Classification and
Regression Trees (CART)is developed by Breiman, Friedman, Olshen, and Stone, in
1984. CART is binary split type machine learning algorithm. Its main feature is
it can be use in regression problem also. CART allows growing the tree first
and then use pruning. Pruning CART makes small decision trees.


CART can easily handle
both numerical and categorical variables.

CART algorithm will
itself identify the most signi?cant variables and eliminate nonsigni?cant ones.

CART can easily handle


CART may have unstable
decision tree. Insignificant modification of learning sample such as
eliminating several observations and cause changes in decision tree: increase
or decrease of tree complexity, changes in splitting variables and values.

CART splits only by one
variable. 7 8


C4.5 decision tree
presented by Quinlan, in 1993. It is extension of ID3 decision tree algorithm. The
decision tree grows using Depth-first strategy. C4.5 allows pruning.  C4.5 can handle noisy data, missing values
and numeric attribute. 


C4.5 can handle both
continuous and discrete attributes.

C4.5 goes back through
the tree once it’s been created and attempts to remove branches that do not
help by replacing them with leaf nodes.


C4.5 constructs empty
branches. It creates many nodes with zero values or close to zero values. These
values neither contribute to generate rules nor help to construct any class for
classification task. Rather it makes the tree bigger and more complex.

C4.5 algorithm constructs
tree and grows it branches to perfectly classify the data. This strategy
performs well with noise free data. But over fits the training examples with
noisy data.

Susceptible to noise. 7


Automatic Interaction Detector (CHIAD) developed by Kass in 1980. This tree is
based on statistical hypothesis testing. It mostly used as tool to segment or
grow trees. This can be used as classification as well as regression tree. It
is not a binary split decision tree, it is multiway splitting tree. It can analyze
complex interaction between variables. 7 9


output is highly visual and easy to
interpret. 9

2.3 SVM

Support Vector
Machine is a supervised machine learning algorithm used for classification and
regression. It is very popular machine learning algorithm because of its
predictive performance in classification and regression. It is mostly considered
as extension of traditional feedforward network. SVM solves classification using
hyperplane. Hyperplane accomplish classification which ensure maximum
separation between data. Data used in SVM can be more than 2 dimensional and
SVM can separate data using (n-1) dimensional hyperplane where n is dimension
of data. Kernel function can be used in SVM for converting low dimensional
nonlinear problems to high dimensional linear problem. There are many kernels:
linear kernel, Gaussian kernel, radial basis function and so on. 267


Bhavitha, B. K.,
Anisha P. Rodrigues, and Niranjan N. Chiplunkar. “Comparative study of
machine learning techniques in sentimental analysis.” Inventive Communication and
Computational Technologies (ICICCT), 2017 International Conference on. IEEE, 2017.

This paper discusses the
sentiment classification on Machine Learning Methods in detail. By using Random
Forest Classifier, it shows that the result is obtained with greater accuracy
and performance. But the classifier requires high processing power and training
time.  Support Vector Machine provides
excellent accuracy as compared to many other classifiers. Lexical based
approaches are ideally aggressive because it requires manual work on document.
Maximum Entropy also performs better but it is suffered from over fitting.


Wan, Yun, and
Qigang Gao. “An ensemble sentiment classification system of twitter data
for airline services analysis.” Data Mining Workshop (ICDMW), 2015 IEEE International Conference on. IEEE, 2015.

In this paper they have
discussed Business analysis application using sentiment classification. In
feature selection features can be unigrams, bigrams, trigrams and more. The sentiment
analysis is implemented in both the three-class dataset and two-class dataset.
Accuracy Evaluation Based on F-measure-Recall, Precision, F-measure, Error Rate.


Yashaswini, and S. K. Padma. “Sentiment Analysis Using Random Forest
Ensemble for Mobile Product Reviews in Kannada.” Advance Computing Conference
(IACC), 2017 IEEE 7th International. IEEE, 2017.

This paper discussed the Sentiment
analysis of Kannada language. Generally, accuracy of sentiment analysis is depending
on preprocessing and sentiment extraction. Also, accuracy of classifier is depending
on feature selection and efficiency of classification algorithm.


Collomb, Anaïs,
et al. “A study and comparison of sentiment analysis methods for
reputation evaluation.” Rapport de recherche RR-LIRIS-2014-002 (2014).

This paper gives idea of
Applications of sentiment analysis. Sentiment analysis can be classi?ed from
di?erent points of views like Technique, text view, rating level. Types of
solutions for sentiment analysis: Lexical Contextual Sentence Structure,
Combining Lexicon and Learning based Approaches, Interdependent Latent
Dirichlet Allocation, A Joint Model of Feature Mining and Sentiment Analysis,
Opinion Digger, Latent Aspect Rating Analysis on Review Text Data.


Parmar, Hitesh,
Sanjay Bhanderi, and Glory Shah. “Sentiment Mining of Movie Reviews using
Random Forest with Tuned Hyperparameters.” (2014).

This paper discussed different
feature selection model while dealing with text classification or sentiment
classification. Random Forest classifier provides two types of randomness,
first is with respect to data and second is with respect to features. Random
Forest is considered to be an accurate and robust classifier.


Zhao, Yan, Suyu
Dong, and Leixiao Li. “Sentiment analysis on news comments based on
supervised learning method.” (2014).

In this paper they adopted 3 feature
selection methods, 4 feature representation methods and 5 machine learning
algorithms for sentiment analysis of Chinees news comments.


Kuzey, Cemil,
Ali Uyar, and Dursun Delen. “An Investigation of the Factors Influencing
Cost System Functionality Using Decision Trees, Support Vector Machines and
Logistic Regression.” (2018).

In this paper C5, CART,
CHIAD, SVM are discussed. SVM had the second highest accuracy level.  It also discussed the main reasons for DT
popularity which include (1) intuitiveness, (2) expressiveness, (3)
transparency, (4) efficiency, (5) robustness, (6) accuracy, and (7) deploy

Singh, Sonia,
and Priyanka Gupta. “Comparative study ID3, cart and C4. 5 decision tree
algorithm: a survey.” (2014).

This is a survey paper on
ID3, CART, C4.5 decision trees. It explains its advantages, disadvantages in
detail. Decision tree algorithms require splitting criteria for splitting a
node to form a tree- Entropy, Gini Index, Classification Error, Information
Gain, Gain Ratio, Towing Criteria. The splitting phase continues until a
stopping criterion is triggered.


Bhargava, Rupal,
and Yashvardhan Sharma. “MSATS: Multilingual sentiment analysis via text
summarization.” Cloud Computing, Data Science & Engineering-Confluence, 2017 7th
International Conference on. IEEE, 2017.

This paper gives
methodology of multilingual sentiment analysis It is converting different
languages to standard language i.e., English and then go for sentiment analysis
using text summarization and hybrid approach i.e., lexicon and machine


Rosenthal, Sara,
Noura Farra, and Preslav Nakov. “SemEval-2017 task 4: Sentiment analysis
in Twitter.” Proceedings of the 11th International Workshop on Semantic Evaluation
(SemEval-2017). 2017.

In this paper they have
used multiclass classification for multilingual sentiment analysis- English and
Arabic language. Using twitter data- user profile and reviews. Sentiment
analysis using different combination of topics and reviews.


Cer?ak, Miloš.
“A comparison of decision tree classifiers for automatic diagnosis of
speech recognition errors.” Computing and Informatics 29.3 (2012): 489-501.

This paper gives
information of performance of CART and C4.5 decision tree. There are three most
popular CART styles 1) using Gini index 2) information gain 3) twoing.  The lower the misclassi?cation rate is, the
better classi?er (predictor) of the error made. CART information gain has
lowest misclassification rate. The best CART DT style is information gain.


Peng, Haiyun,
Erik Cambria, and Amir Hussain. “A review of sentiment analysis research
in chinese language.” Cognitive Computation (2017): 1-13.

This is review paper on two
main approaches of sentiment analysis research: the monolingual approach and
the bilingual. Corpus and Lexicon Approach. Monolingual Approaches- The machine
learning based approach treats the sentiment classification as a topic-based
categorization problem. The knowledge based method uses the sentiment lexicon,
which consists of sentiment polarity for each word, to label the sentiments of
words. The machine learning does not need predefined semantic rules but
requires a labelled dataset. Multilingual Approach-  1) machine translation -ferrous and reduce the
accuracy of translation. 2)  structural
correspondence learning (SCL) –  link two
languages at the feature-level.


Pham, Binh Thai,
Khabat Khosravi, and Indra Prakash. “Application and comparison of
decision tree-based machine learning methods in landside susceptibility
assessment at Pauri Garhwal area, Uttarakhand, India.” Environmental Processes 4.3 (2017):

In this paper the RF model
showed the best performance, followed by the LMT, BFDT and CART models. RF
method has many advantages. (i) it does not need assumptions on the
distribution of explicative factors; (ii) it is capable of calculating
interaction between factors; (iii) the random predictor selection used in RF
holds low bias; and (iv) it is able to deal with unbalanced data and


Shivaraju. Classification of Sentiment
Analysis on Tweets using Machine Learning Techniques. Diss. 2015.

This paper gives
information of POS Tagging. POS Tagging helps us to ?nd parts of speech of that
word. POS Tagging is done by utilizing the HMM (Hidden Markov Model) model
which used to tokenize and Tag the words further more for naming elements.


Dixit, Apurva,
et al. “Emotion Detection Using Decision Tree.” Development 4.2 (2017).

This paper give idea about
methodology. Adjectives, adverbs and verbs mainly prove to be useful in
detection of emotion. Useful content from tweet is extracted using Natural
Language Processing. Later using Machine Learning technique exact emotion is
classified.  In first phase cleaning data
from regex, hashtag, non-letter character, username, URL and email. Second
phase is learning or training where a model is built to classify opinion data.
In this, the text is given the weight using Term Frequency-Inversed Document
Frequency(TF-IDF) algorithm. Some of the pre-processing steps that have been
carried out are-Tokenization, Filtering, Lemmatization, Stemming. POS tagging
will be done on the extracted data to remove all the special symbols and hash
tags. It is also known as lexical categories or word classes.


Soni, Rishabh,
and K. James Mathai. “Improved Twitter Sentiment Prediction through
Cluster-then-Predict Model.” arXiv preprint arXiv:1509.02437 (2015).

In this paper collection
procedure of twitter data is explained. Python’s API named Tweepy have been
used to implement streaming API of Twitter. It provides libraries to collect
streaming twitter data. The incoming tweets were stored in CSV (Comma Separated
Values) file format in real-time by importing Python’s CSV library functions.


Sannikumar, et al. “Sentiment Analysis: Comparative Analysis of
Multilingual Sentiment and Opinion Classification Techniques.” World Academy of Science,
Engineering and Technology, International Journal of Computer, Electrical,
Automation, Control and Information Engineering 11.6 (2017): 565-571.

This paper gives
information about Machine Translation. Machine Translation- A MT is a process
of translating text from one language to another. Rule-based, statistical,
example-based, hybrid and neural machine translation these are some methods of
machine translation.


Hssina, Badr, et
al. “A comparative study of decision tree ID3 and C4. 5.” International Journal of
Advanced Computer Science and Applications 4.2 (2014): 13-19.

This paper gives comparative study of
ID3. C4.5, C5.0 and CART. One limitation of ID3 is that it is overly sensitive
to features with large numbers of values. Accuracy of C 4.5 is more than ID3.
Also, Execution time of ID3 is more than C4.5.


Vimalkumar B., and Bhumika M. Jadav. “Analysis of Various Sentiment
Classification Techniques.” Analysis 140.3 (2016).

This paper is to find out
approaches that generate output with good accuracy. This paper is to give idea
about careful feature selection and existing classification approaches can give
better accuracy. SVM is giving good accuracy and can be improved by modifying

Chapter 3

Problem Statement

objective of sentiment Analysis is to identify polarity of reviews or opinion
given by customers. Considering Indian Regional languages like Hindi and
Marathi will help to get information from these languages. Multi-Class
Classification of Sentiment Analysis gives better idea about polarity of
reviews. With the help of Feature Selection and Machine Learning Algorithms
like Random Forest can improve the Multi-Class Classification of Sentiment

Chapter 4

Problem Definition

Chapter 5

Proposed Work

The proposed system
predicts polarity of opinions given by user. The system will do Sentiment
Analysis using Machine Learning
Algorithms like Random Forest.  

5.1 Proposed system:

The proposed
system will work as shown in block diagram. The step wise flow of system given
as follows.

5.1 Block diagram of Proposed System


Step 1: Language

collected data is Multilingual data. So, for further operations need to
translate that data to standard language i.e. English.  

Step 2: Preprocessing

of cleaning and preparing data as input to classifier is known as
preprocessing. In this many subtasks are included like tokenization, stop word
and punctuation removal, streaming etc.

Step 3: Apply
Machine Learning algorithm

Machine Learning algorithms and perform classification on dataset. In that use
Random Forest algorithm and SVM.

4: Polarity

            From applying Random Forest and SVM on
dataset, proposed system gives the polarity of sentiments.

Chapter 6




I'm Niki!

Would you like to get a custom essay? How about receiving a customized one?

Check it out