Exactly How

Helpful Are Reviews?

Purpose of Investigation

Amazon.com is a place where many people go to shop for many different products. People often decide to make a purchase (online and offline) based on the reviews written about the products (and services) in which they are interested in. These reviews can be considered helpful or not and Amazon solicits feedback on reviews based off if people found the reviews to be helpful and for amazon it is a means of encouraging others to purchase the same product, weeding out bad and/or fraudulent products and prioritizing which reviews are presented to prospective customers.

 

Given that grading/labeling reviews is a very time intensive endeavor requiring a significant amount of purchasers electing to review items and then other individuals rating those reviews as helpful or not we decided to look into an algorithmic to evaluate the feasibility of labeling reviews based off of their predicted (prospective) helpfulness to others.  The ultimate end goal of this analysis is to capitalize on a large and engaged customer base and the prioritize which reviews are most likely to factor into a purchase being made or not for product/service and provide relevant information for customers.

 

Holistically, this information creates a strong value add for both doing business with Amazon from a third party sellers point of view as potential customers will see both aggregate reviews (1-5 stars) and customer stories (leveraging brand loyalty and word of mouth). In addition to this are the benefits to amazon itself as this provides Amazon information on which brands and products will drive the most sales, minimize returns and generate the most traffic. To note, this information more than likely informed the product choice and manufacturing of Amazon’s Basic product line which competes with products already offered by other sellers within Amazon’s marketplace but adds to Amazon’s bottom line. 

 

 

Analysis Questions

Can a prediction model generate the same classification of helpfulness  as dozens or hundreds of prospective buyers? This would be beneficial to determine what reviews will be helpful before they are seen by potential Amazon customers and allow for prioritizing reviews which may positively boost sales , might represent defects/issues with the product or customer service opportunities for Amazon or it’s third party sellers.

 

Can the prediction model of reviews, either helpful or not be applied to other businesses and their products to compare how their product reviews are similar to those from Amazon? Can the predictive model trained on electronics reviews generalize well to products in other categories?

           

Can a modestly accurate classification algorithm provide keywords that can guide or add value to the current human review process from existing Amazon shoppers?

 

 

Scope

 An analysis is to be conducted to determine the feasibility of predicting relative helpfulness or (helpful/unhelpful) ratings of online product reviews based off the review text entered by the user. Jupyter Notebooks and Rodeo were both used during the course of this project with code executed in python and pyspark (spark-2.4).

 

 

Data Preparation

 The Amazon Electronics reviews were first loaded in to both Jupiter Notebooks and Rodeo. There were several packages that were used to take the initial look at the overall data and prepare for further analysis. Some of these packages were numpy, pandas, seaborn, json, calendar, and datetime which facilitated the analysis and processing of the text data.

 

Two functions were also created to read the json file of the Amazon reviews. These functions first parsed the original path of the data, and then read the data in through the parse function that was created and creation of the dataframe. The original pandas dataframe that was created returned 1,689,188 rows and 9 columns. Using isna and isnull to check for missing values and NaN within the Amazon review data set returned 24,730 missing values and NaN. These missing values and NaN only represented 1.46% of the overall data within these reviews. It was chosen to remove these missing values from the original dataset. This resulted in 1,664,458 rows and 9 columns.

 

Binning was also done on the Amazon dataset to incorporate the helpfulness ratio (helpful/total reviews) of the reviews. Bins were created to categorize the helpfulness of the reviews regardless if the reviews were positive or negative. This created 3 bins 0, 1, and 2. The image below shows the min and max helpfulness ratio of each bin.

The sum of each bin was done to show how many reviews were categorized to each bin of helpfulness.

Due to the size of the data and the significant skew of the classes a sample size was taken from original data set to perform further analysis. The data set after sample was taken consisted of 330,308 data points.

Training and testing data sets were then created using spark.read.parquet.

The Amazon reviews then went through another round of tokenizing, removal of stop words, and hashing.

Initial Data Exploration

A word cloud was created to see the overall word frequency before more preparation was performed. This was executed to have an initial view of the most popular words to be able to set a base for comparison of further word clouds and prediction models.

From this initial word cloud we can see that words such as “works well”, “works great”, “hard drive”, “sound quality”, and “easy use” are terms that seem fairly positive and could deal with maybe computers or sound systems. Further analysis would need to be conducted to determine exactly what each product was that is being described as that is not included within the data that was retrieved (only the proprietary ASIN# was provided).

 

The data was cleaned more following this initial word cloud. All words were made lowercase, punctuation was removed as well as stop words. The top 50 common words were evaluated, as these words can be removed because they will not be significant in further analysis. This was also done for the rare words. All misspelled words were corrected, and tokenization applied. Bigrams we also created using TextBlob.

 

Next, the term frequency was counted to see the top words now within the dataset.

 

Several other columns were created and added to the dataframe. The columns extracted information from the timestamp in which the Amazon reviews were done. The columns that were added were for the year the review was created, the day and the month. These were then plotted to see if this would tell us about the frequency distribution of the time that the reviews were done by users.

From the above visualizations above this indicates that in 2013 there must have been very hot electronic products offered online, or a surge of people to Amazon to start purchasing items here instead of traditional retailers. Also, December and January are the two months in which most reviews were written. This too indicates that people got many electronic items around the holidays, probably as presents and wrote the reviews after setting them up and using them. When looking at the number of reviews by day, this is very intuitive that many people might have Amazon Prime and order things on Friday, and they are delivered over the weekend on Sundays and reviews are written at the earlier parts of the week.

 

A sentiment analysis was done to show a give value to the reviews. Using the overall column of the Amazon data set containing the ratings, new columns were added to the dataframe to indicate if the rating was good, neutral, or bad. The ratings were originally on a scale of 1 – 5 and the columns were created using ratings of 4 and 5 for good, 3 for neutral, and 2 and 1 for the bad columns.

 

Using the new columns, three more word clouds were created to visually see if there is a difference within the words used in the different review ratings.

Good Word Cloud

Neutral Word Cloud

Bad Word Cloud

Further Data Exploration

 

Random Forest Classification

After initial cleaning of the Amazon Review data set and examining the word clouds, a Random Forest model was created to predict the evaluations of the reviews. The Random Forest model resulted in the creation of a prediction column of the bin in which the review would be placed, if the review was helpful, and the probability that this would occur.

 

The Random Forest model was evaluated using MulticlassClassificationEvaluator from pyspark for accuracy. The training set resulted in an accuracy rate of 42.20% and the test set had an accuracy of 42.10%. These are very close in their accuracy levels. Below shows the prediction of each helpfulness bin as well as the actual helpfulness bin from the Random Forest Model.

Random Forest Classification Probability Outcomes of Testing Data

Random Forest Predictions

From these results we can see that there were more reviews that were predicted incorrectly for the 2.0 helpfulness bin, as well as the 1.0 and 0.0 bins.

 

Logistic Regression

A Logistic Regression model was then created using the same helpfulness bins to predict if the review will be helpful to someone when making the purchase.

 

When the model ran using the training model, it produced an accuracy rate of 45.90% and the test data returned a 45.00% accuracy rate. This accuracy rates of the logistic model were better than the Random Forest model that was previously created, but there are still parameters that need further tuning. The visualization below shows the counts of the correctly predicted helpfulness of the reviews that were done by users on electronics bought from Amazon.

Logistic Regression Probability Outcomes of Testing Data

Logistic Regression Predictions

The predictions made by the Logistic Regression, although were better than the Random Forest, are still lacking in certain parameters. However, as each model is created and trained and tested, there is the potential that a pattern is emerging even without having fine-tuned parameters for better accuracy rates.

 

Naive Bayes Classification

The Naive Bayes Classification model based mostly on probability and was the first model to ever use probability within the methods of doing predictions. This model can be considered one of the more reliable models that can be used.

 

Again, the same variables were used for the creation of the Naive Bayes Classification model to predict the helpfulness of the Amazon reviews.

Naive Bayes Probability Outcomes of Testing Data

Naive Bayes Predictions

The Naive Bayes Classification model had an accuracy rate of 41.06% on the training data and 40.64% on the test data. The accuracy outcomes of this model were the lowest of the three models. As with the other models we are also seeing that there are more inaccurate predictions for each helpfulness bin than there are correct predictions.

 

The helpfulness bin 0.0 had the most accurate predictions throughout all models that were created. All three prediction models create a pattern in which show that helpfulness depends on other factors, not just wording of reviews. The reviews can be considered helpful even if the review is negative or neutral. 

 

 

Amazon Review Analysis Findings and Conclusion

 

After creating word frequencies, word clouds, Random Forest model, running a Logistic Regression, and Naive Bayes, it can be concluded that further exploration and refinements with the models and with the text analysis are likely need to incorporate specificity in terms of attributing n-grams. The n-grams are instrumental in determining and understanding how the text should be interpreted. The context of the reviews should be evaluated. As we have found when evaluating the visuals created with the word clouds, the words “highly recommended” revealed themselves in the good, neutral, and bad clouds. This indicates that the words “highly recommended” could be referred to in a positive and negative context. I.E. ‘i would highly recommend’ vs ‘i would highly recommend you avoid this’ would have the same ‘highly recommend’ bigram as we frequently observed in the word cloud. This concludes that the context in which reviews are written weigh heavily on the analysis of Amazon Reviews.

 

Can a prediction model generate the same classification of helpfulness as dozens or hundreds of prospective buyers?

 

During the analysis of the Amazon Review, it would be useful and important to weight the impact by the severity and the volume of people finding the review helpful. The current analysis indicated that a rating of 1 star could be due to receiving a product which was defective upon arrival. It could even mean that someone who experienced a defective battery that caused a device catch fire and cause serious damage to their home and property. However the helpfulness ratings would be weighted equally regardless of the total ratings of helpful/unhelpful (only the ratio was considered). By doing further analysis, there is the possibility that weighting the words written in the reviews, would indicate a completely different outcome than the one that is currently produced.

 

For further analysis the incorporation of more text analysis and complex text analytics tools i.e. SyntaxNet could potentially determine more relevant terminology to the actual words that should be included in the helpfulness bins, and return better prediction models and accuracy rates. 

 

Can the prediction model of reviews, either helpful or not be applied to other businesses and their products to compare how their product reviews are similar to those from Amazon? Can the predictive model trained on electronics reviews generalize well to products in other categories?

 

The prediction models of reviews could potentially be applied to other business industries to help them evaluate their products and services, and determine expectations of potential consumers. The predictive and classification models can be tuned and trained to evaluate the context of the reviews for better models. Considering the context of reviews, is crucial to understanding what needs to be trained and tuned. More frequent words that appear in reviews need to be removed as well as words that are unique. The middle of the frequencies of the words in the reviews, are the words that should ultimately be evaluated and used for the model creation and prediction. This would help to alleviate any word or group of words that could be used in good, neutral, and bad reviews.

Prediction models can also be trained to put the electronics into generalized categories, only if specific words are used. These models could potentially use the rare words within the reviews that are only reflected of certain electronics. However, it is hard to tell exactly what people will write and if they will use correct/consistent terminology especially when it comes to electronics. 

 

Can a modestly accurate classification algorithm provide keywords that can guide or add value to the current human review process from existing Amazon shoppers?

 

The accuracy rates in which were produced from our models ranged from 40.06% to 45.00%. When dealing with real world data it is appropriate to conclude that these rating are not as bad as they appear. Real world data passed through prediction models output low accuracy rates. Having these rates on the Amazon reviews, is a good thing. With finer tuning they could output higher rates. These models have the potential to provide keywords that would indicate guidance of a consumer. The better these models are able to be tuned and reviewing the context in which the reviews were written by the users, we would be able to formulate sets of keywords that could be indicators of the current purchasing and reviewing process.

 

These words that could be formulated into certain lists, if in the wrong hands, could potentially sway people into a purchase based on false pretenses. This has the potential to be detrimental to the reputation of Amazon if the reviews were falsely written or biased towards any algorithmic approach.

 

As the Amazon Review Analysis stands, more research and parsing needs to take place to determine the actual context in which the reviews were written. This will would help to better train the models as well as the removal of more frequent words. These tasks could lead to more developed findings as well as a more accurate prediction of helpfulness.

 

Why our project is larger in scope than the homework.

 

For our project the data in which we are starting with is 1.37 GB and most of the data that we are using for the homework are much smaller. We have also conducted stratified sampling to begin and perform the analysis. The stated objective of the project requires building an analysis on top of building the scripting for the project. The homework assignments have been focused on scripting. We intend to focus on the analysis as the end goal for this project with scripting in python and pyspark as a means of accomplishing that analysis.

 

We are focusing our programs on knowledge, predication and classifications models, as well as the parameters within the programs to give the best accuracy.

Data Acquisition

http://jmcauley.ucsd.edu/data/amazon/

 

Reference

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016

Software and dependencies

Apache Spark

Dockerized (Docker CE) pyspark Spark-2.4 on Kubernetes

https://hub.docker.com/r/alexmilowski/jupyter-spark-2.4.0/tags/

 

Rodeo

https://www.yhat.com/products/rodeo

 

Python & pyspark packages used throughout the Amazon Review Analysis   

numpy (python)

pandas (python)

seaborn (python)

json (python)

calendar (python)

matplotlib (python)

collections (python)

gzip (python)

urllib (python)

datetime (python)

RegexTokenizer (pyspark)

Tokenizer (pyspark)

IDF aka Inverse Document Frequency (pyspark)

Hashing (pyspark)

StopWordsRemover (pyspark)

Pipeline (pyspark)

RandomForestClassifier (pyspark)

1/15

Follow me

© 2019 Kirby Hood

 

  • LinkedIn Social Icon
  • Facebook Social Icon