Sentiment Analysis refers to analyzing text data and assigning some kind of sentiment to it. For e.g., we see a movie review on the IMDB website such as – “It was a good movie”. We can understand that the viewer liked the movie and we can go on to say that this review can be assigned a positive sentiment. Similarly, If the review was “It was a bad movie”, we can consider this as negative feedback for the movie and say that it generates a negative sentiment. So how do we go about creating a model that takes in a review/ statement as input and gives out the corresponding sentiment ?
One way to go about it is to create a dictionary of all words that correspond to positive feedback, and another of all words that provide negative feedback. Run a search through the statement and see which words are present. This is a very very simple example of how Sentiment analysis can be done. More on this later, but why would we need a Sentiment Analysis model?
Most companies provide their customers a platform, which could be social media or a company website, to express their feelings about a product or service so that the company can then use this data to improve the quality of their product / services. Since people can post feedback at ease with just their mobile phones, this generates a huge amount of data. Going through all data manually is a labor-intensive process. Hence, sentiment analysis has become an important tool for companies to track and monitor their online feedback and brand value. This is just one example; there are other areas as well where one can use a tool such as sentiment analysis. Let’s look at how it can help to solve some challenges in the Healthcare Industry.
Defining our Input to the model:
Before we discuss the details of the model, we need to find a way to represent sentences for a model to understand. The first approach is bag of words. The way this approach works is we first create a vocabulary of all words we have in our training data. This vocabulary then forms our feature space, or our X’s for the training. Given the number of words in the English language, we could have 10000 words in our vocabulary. We will get to reduce our feature space later.
So now that we have our X’s, we define a way to represent a sentence. We can do so by assigning the count of each word to our feature space. Coming back to our previous example, “It was a good movie”, we will have the following counts or word frequencies- It:1 , was: 1, a: 1, good:1, movie:1. The values for all other words will be 0. Each word will have a fixed position on our feature space, so for all other words, if we substitute zero then we have 0, 0,0,…,1,0,,…1,0…0. Note we have counts only at the position of the words in our current example. This type of encoding is also called one hot encoding.
We can limit the number of words our vocabulary has by using a few tricks, for instance removing words like a, the, this, is, etc. These words do not generally add any meaning to our sentences. These are stop words. Next, to further limit our vocabulary, we can keep only those words that have a frequency above a certain threshold. Doing this, we can reduce our vocabulary to 1/10th of our initial size. Now coming to encoding our target variable. Since this is a good review, we have 1 as our target variable. Note that we are only trying to classify good or bad reviews and having a 1 for good and 0 for bad is sufficient for training our model. All of these can be achieved by a few lines of code using python’s NLTK (Natural Language ToolKit) library and python’s Scikit-learn library.
Defining our model:
Machine learning can broadly be categorized into two parts, supervised and unsupervised learning. Supervised learning is one where we give both input and targets as training data. This is generally used for classification or regression tasks. Unsupervised learning contains just input data, no output is associated with it, and this is used for clustering problems, where the algorithm tries to group the input data into clusters. Since this is a classification problem, we will be classifying the review into one of the good or bad classes, we will rely on either Logistic regression or Random Forest classifier.
We will go into the details of logistic regression or Random Forest in a different article. But for the purpose of this venture, we will pass a python list (arrays) to the model. This is the input list that we get using one hot encoding as defined above, and get a one or zero as an output. Let’s assume the model we build has around 80% chance for classifying a given review accurately. This may not necessarily be difficult to achieve for a task such as this. All of these can be achieved fairly easily using python’s Scikit Learn library.
Problems faced by the service Industry and Solution with Sentiment Analysis:
An insurance company wants to find out whether imparting behavioral training to their service staff has an impact on the overall feedback they receive as reviews.
We start by defining a metric for valuation. Let our performance indicator be the number of positive reviews/ total number of reviews. Assuming there are at least 30 reviews for each staff under consideration. We get the reviews for all staff before and after behavioral training has been imparted. Put them through our model and generate positive and negative outcomes for these reviews. Then do a comparison of percentage positive and negative reviews before and after the training. This will establish a correlation between training and change in reviews. To establish causation, we also need to create treatment and control groups. The treatment group will have the staff that receives the training and control will be the staff that doesn’t. Comparing the change between treatment and control groups will tell us whether the training has an impact.
Like other businesses, if you too are looking for solutions in Sentiment Analysis, Mindfire Solutions can be your partner of choice. We have deep expertise in AI and ML Capabilities. With a team of highly skilled and certified software professionals, that have developed many custom solutions for our global clients over the years.