Solved–Homework 4 –Solution

$30.00 $19.00

Description Code Homework 4 Starter Code. While the download is instant, Google takes a second or two to pack the download. Sentiment Analysis In this assignment, you will be performing sentiment analysis on movie re-views by classifying the reviews as positive or negative. Sentiment analysis is used to extract people’s opinions on a variety of…

You’ll get a: . zip file solution

 

 
Categorys:
Tags:

Description

5/5 – (2 votes)
  • Description

Code

Homework 4 Starter Code. While the download is instant, Google takes a second or two to pack the download.

Sentiment Analysis

In this assignment, you will be performing sentiment analysis on movie re-views by classifying the reviews as positive or negative. Sentiment analysis is used to extract people’s opinions on a variety of topics at many levels of granularity. Our goal, however, for this assignment is to look at an entire movie review and classify it as positive or negative.

You and Na ve Bayes

We will be using Na ve Bayes, following the pseudocode in Chapter 6 of Jurafsky and Martin, using Laplace smoothing. The classi er you make will use words as features, add the logarithmic probability scores for each token, and make binary decisions between positive and negative sentiment. You will implement the binary version of the Na ve Bayes with boolean features and explore the impact of stop-word ltering. Stop-word ltering helps to improve performance by removing common words like “this”, “the”, “a”, “of”, “it” from your train and test sets. A list of stop words you will be using is included in the starter code at data/english.stop

Finally you will implement your own features and or heuristics to im-prove performance. To help understand some key concepts, algorithms, and strategies with this assignment, we have included the paper from Pang and Lee 1.

What is Na ve Bayes?

Na ve Bayes is a classi cation technique that uses Bayes’ Theorem with an underlying assumption of independence among predictors. Na ve Bayes clas-si ers assume a features presence is unrelated to another feature. For example if an orange is considered to be an orange if it is orange, round, and 4 inches

1

in diameter; regardless of if two of these features are related or even existent upon another feature. Bayes’ Theorem is the following

P (A j B) = P (B j A) P (A)

P(B)

where A is the class and B is the predictor. There are many examples online, I will leave that for outside of this document.

There are several pros and cons to Na ve Bayes.

Pros

  • Easy and fast to predict classes of test data. Performs well with multi class prediction

  • If the assumption of independence holds, a Naive Bayes classi er per-forms better than most of models like logistic regression and with less training data.

Cons

  • If categorical variables have a category that is not observed well in training, then the model will assign a 0 or near zero probability and will be unable to make predictions. To solve this Zero Frequency problem, we can use a smoothing technique like Laplace estimation.

  • Bayes is known to be a bad estimator

  • Assumption of independence is usually rare within a data set

  • Performs poorly on a set where more false positives are identi ed than true positives.

So naturally how can we improve Na ve Bayes? This will be part of the assignment

Details and Tasks

With the IMDB data set from the original Pang and Lee paper (included in the starter code), train a Na ve Bayes classi er. This code is already set up to perform 10-fold cross validation training and testing. Cross validation is

2

simply splitting the data into several sections and then training and testing the classi er repeatedly. Speci cally, this means training on 9 folds and testing on a di erent held out set. The average across the 10 iterations are calculated. This prevents bias towards a particular partition.

When using movie reviews for training, the fact that the review is positive or negative to train and help compute the correct values, but when using the same review for testing, only use the label to compute the accuracy. The data that comes with cross validation sections that are de ned in the le data/poldata.README.2.0

Task 1: Implement Classi er

Implement a Na ve Bayes classi er training and test method and evaluate them using the provided cross-validation mechanism. While Wikipedia2 has the critical information to understand Na ve Bayes, there are great tutorials available from James Brownlee3.

Splitting Data For 10-fold Cross Validation

The data has been split for you across all documents providing 10 splits.

Adding Documents

With the data split, we need to perform a series of operations on each split (or set of data). Notice in the train method, for every document, we use the addDocument function provided a classi cation (pos, neg) and a set of words. This function’s task is to add the information the document provides in our Na ve Bayes model. Information worth adding in a dictionary count (similar to the unigram count in assignment 1) might be: the frequency of the classi er, the frequency of words, the frequency of words per classi er, and the number of words per classi er. Also potentially worthwhile is a similar listing for unique words in each document (for binary Na ve Bayes).

1.0.1 Adding The Classi er

In order to predict sentiment (classi cation), we need to calculate the prob-abilities of each sentiment. This is done with the following formula where c

2https://en.wikipedia.org/wiki/Naive_Bayes_classifier

3https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

3

i2words

represents the particular classi cation.

cNB = argmax P (cj) P (wijcj)

cj 2C

P (cj) represents the probability of that classi cation in all possible classi ca-tions (pos, neg). This is a simple probability of just the number of documents classi ed with this sentiment! So the formal equation follows,

Ncj

P (cj) = Ncdoc

This gives us the rst part of the classi cation equation above. Now let’s look at the summation of individual words of that sentiment classi cation.

The probability of a word given a sentiment can be formalized by this equation,

count(wi; c)

P (wi j c) = P

count(w; c)

w2V

where count(wi; c) considers the frequency of a word in a sentiment given all words in a sentiment. This should remind you of homework 1 regarding unigram counts in the corpus!

Note that it is easiest to prevent under ow by decrementing a -log func-tion by each of these probabilities as they are seen.

Task 2: Evaluate Model

Now evaluate your model again with the stop words removed. The imple-mentation of using the stop words has been provided and can be run by using the following commands

python NaiveBayes.py -f /data/imdb

How does this approach impact the average accuracy for the given data set?

Task 3: Binary Version of Na ve Bayes

Now implement a binary version of Na ve Bayes classi er that uses the pres-ence or absence of a feature rather than the feature counts. Let’s look at a formal equation for a Binary Na ve Bayes,

c

argmax P

c

P

w

k j

c

NB =

cj 2C

(

j)

k2words

(

j)

4

where P (wk; cj) is equal to the Binary event model,

8

<

P (wk j cj) =

doc count(wk;c)

P

doc count(w;c)

w2V

when doc count(wk; c) > 0

: 0 when doc count(wk; c) = 0

where wk represents each word in the vocabulary. To clarify, count in this case is the document frequency as opposed to the actual word frequency. Do not confuse this with Bernoulli Na ve Bayes which penalizes the probability when the vocabulary word for a sentiment does not exists in the document. This might be worthwhile to implement when creating your best model.

Task 4: Can Our Model Be Improved?

Experiment with di erent heuristics and features to increase the classi er’s accuracy. Here are some pointers that should help you. We also recommend looking at the data and reading the suggested paper from Pang and Lee.

  • Adding More/Other features: You’ll have used the unigram words in each review as features in your model. What other features can you use that will boost your classi er’s performance?

  • Feature Selection: Feature selection is a tool that can be used to remove some possibly-redundant features. The stop word-removal is a simplistic feature selection, and there are more sophisticated ones from the reading. See if any of these feature selection techniques can be used here to give better cross-validation scores. One strategy, removing correlated features, is a good place to start as the highly correlated features are accounted for twice can lead to over in ating importance in the model.

  • Laplace Smoothing: Laplace smoothing is a must for Na ve Bayes, and you should be doing it already in your model. However Bayes classi ers have limited options for parameter tuning. It is highly rec-ommended to focus on pre-processing of data and the feature selection rather than speci c parameter tuning. Are there other smoothing tech-niques that could help?

5

  • Weighting features: In the multinomial Na ve Bayes classi er, each feature was weighed according to the frequency. The binary version was weighed by presence and absence. Perhaps there are some better ways to weigh the features?

  • Classi er Combination Techniques: We would not recommend classi er combination techniques like bagging, boosting, and ensem-bling because their purpose is to reduce variance, but with Na ve Bates there is no variance to minimize.

This link has some great insights to improving Na ve Bayes performance4.

There are also some common techniques to improve the performance of Na ve Bayes. Here are some examples of where improvements should be made.

  • Many people thought this movie was very good, but I found it bad. This sentence has two strong and opposing words in this sentence (good, bad), but it can be inferred that the sentiment of the sentence can be determined by the word bad and not good. How can you weigh your features for ’good’ and ’bad’ di erently to re ect this sentence’s sentiment?

  • Paige’s acting was neither realistic or inspired. Currently, the feature set comprises individual words in a bag of words approach. Because of this, the words inspired and realistic are considered separately de-spite the surrounding words like neither. How can you model take into consideration this word order?

  • The sites are chosen carefully; it is no wonder this comes across as a very beautiful movie. How can you change the weight for a feature like beautiful, since it is expressed more strongly than saying ’it is a beautiful lm’. How does very impact the next word?

Some notes on generalization It is important to come up with simple features that maintain generality and are not xed to speci c examples. Try to choose features that model your both training and held-out/unseen data as opposed to just examples in the training set. Optimizations regarding just the provided data set, will not do well in the held out set.

4https://machinelearningmastery.com/better-naive-bayes/

6

  • Your Implementation

To ensure that your code functions properly from the command line and our grader, you should limit your changes to addDocument() and classify() prototypes in NaiveBayes.py. Feel free to add other elements invoked by these methods. Note main() is not executed so you cannot rely on anything added there.

The grader calls the following functions: crossValidationSplits(self, trainDir), trainSplit(self, trainDir), train(self, split), and test(self, split). It also relies on the de nitions of classes TrainSplit and Document with the associated ags. Note that for your best model, you should have the appropriate ags turned on including, for example, stopWordsFilter if your best model uses it.

  • Evaluation

Your classi er will be evaluated on two datasets: included IMDB and a sec-ond, held-out test set. This will be done for Na ve Bayes with and without stop-word ltering, binary Na ve Bayes, and your best model.

Minimum Requirements

  1. Your best model should be achieve higher probability than your basic naive bayes classi er and the binary version naive bayes classi er.

  1. It will need to achieve at least 83% average accuracy with the 10-fold cross validation on imdb dataset.

  1. It will need to achieve a higher accuracy than a TA model on the hold out dataset.

Competitive Task

Try to improve your classi er as much as you can! We will test all of your classi ers with a hold out dataset. The top 10% classi ers with the highest prediction accuracy will be awarded 5 bonus points for this assignments!

  • Running Your Classi er

Use the command line to run this code. You may have issues in IDEs like PyCharm because the NaiveBayes must be able to nd the data directory in

7

the default location. Your code will run with the command

  • python NaiveBayes.py data/imdb

Adding ags should be done as such. Flag -f adds stop word ltering, ag -b uses your binarized model, and ag -m invokes the best model. These ags should be used one at a time. Your assignment will be graded similarly using one ag at a time.

If you wish to use your classi er on other data sets, you can specify a second directory to test on the entirely held out second set! This can be done with the following command.

  • python NaiveBayes.py -fbm <train directory> <test directory>

Submitting to Canvas

Please archive the following in a .tgz (other formats not accepted) labeled hw4.tgz.

  • NaiveBayes.py

Don’t forget to cite your sources as well!

Errata

Please email Kevin Jesse for any errata in this document or code.

8