Joseph Cabinta

Data Cleaning and Sentiment Analysis on Amazon Sales Data using pandas, VADER, seaborn

View the Project on GitHub josephGZC/amazon_sentiment_analysis

Amazon Sales Sentiment Analysis

code behind the report → Jupyter Lab Sentiment Analysis


Table of Contents

1. Background
2. Executive Summary
3. Dataset Overivew
4. Data Cleaning and Preprocessing
5. Sentiment Analysis Approach
6. Insights Deep-Dive
    6.1. Sentiment Overview
    6.2. Sentiment Score vs. Numerical Variables
    6.3. Highest and Lowest Sentiment Scores Across Product Categories
    6.4. Highest and Lowest Sentiment Scores Across Product Subcategories
           6.4.1. High Popularity Subcategories
           6.4.2. Low Popularity Subcategories
    6.5. Highest and Lowest Sentiment Scores Among Individual Products
           6.5.1. High Popularity Products
           6.5.2. Low Popularity Products
    6.6. Sentiment Trends in Popular vs. Niche Products
           6.6.1. Most Popular Products
           6.6.2. Least Popular Products
7. Recommendations
    7.1. Mid-to-High Review Volume (20k–200k), High-Sentiment Products
    7.2. Low-Review (<20k), High-Sentiment Items
    7.3. Stricter Quality Control & Product Improvement
    7.4. Implementation Notes


1. Project Background

[ back to contents ]

Amazon, the world’s largest online retailer, offers a vast selection of products across categories such as electronics, home goods, and office supplies. This sentiment analysis, based on January 2023 internal sales and review data, examines customer sentiment trends, identifies product quality concerns, and assesses sentiment correlations with price, discounts, and review volume. The findings guide ranking adjustments, quality control measures, and policy improvements to optimize marketplace efficiency.

2. Executive Summary

[ back to contents ]

This project analyzed Amazon sales data and customer sentiment across various product categories using a lexicon-based approach. The analyses revealed an average sentiment score of 0.813 (on a -1 to +1 scale) with 72% of products classified as positive, indicating a generally favorable buyer experience. A Pearson correlation analysis (r = 0.12) indicated that customer sentiment is weakly related to price, emphasizing the importance of product quality and user experience. Certain subcategories, such as battery chargers (electronics) and exhaust fans (home & kitchen), showed negative sentiment, highlighting the need for stricter quality control. Separating products by review volume (≥20k vs. <20k reviews) identified under-marketed but high-quality items, including webcams, power LAN adapters, and tripods, which could benefit from targeted promotions. To enhance customer satisfaction and marketplace efficiency, re-program search ranking algorithms to prioritize well-rated yet under-marketed products (e.g., Philips GC1905 Steam Iron and Mi 108 cm Full HD Android LED TV) while enforcing stricter quality measures for consistently low-rated items (e.g., ENVIE ECR-20 Battery Charger and Fire-Boltt Ninja Calling Smartwatch). Additionally, policy refinements should address recurring complaints in low-scoring subcategories, such as streaming clients and data dongles.

3. Dataset Overview

[ back to contents ]

The dataset consists of multiple product attributes, including customer reviews, ratings, and pricing details, summarized below:

Column Name Description
product_id Unique identifier for each product
product_name Name of the product
category Main category and subcategories
discounted_price Discounted price of the product
actual_price Actual price before discounts
discount_percentage Discount percentage
rating Average product rating (out of 5)
rating_count Number of users who rated the product
about_product Product description
user_id Unique user identifier
user_name Name of the user who left a review
review_id Unique review identifier
review_title Short review summary
review_content Detailed review content
img_link Image URL of the product
product_link Official product page link

4. Data Cleaning & Preprocessing

[ back to contents ]

5. Sentiment Analysis Approach

[ back to contents ]

Sentiment analysis was conducted using VADER (Valence Aware Dictionary and Sentiment Reasoner), a lexicon- and rule-based tool designed for informal text, making it particularly well-suited for analyzing product reviews and social media posts. Each product’s star rating and review text were used to calculate a Sentiment Score, ranging from -1 (most negative) to +1 (most positive). To enhance interpretability, products were also classified into Sentiment Categories: Positive, Mixed Negative, Neutral, or Mixed Positive, based on the distribution of their ratings and textual reviews. Higher sentiment scores indicate greater customer satisfaction, whereas negative scores suggest unfavorable opinions.

6. Insights Deep-Dive

[ back to contents ]

6.1. Sentiment Overview [↑]

6.2. Sentiment Score vs. Numerical Variables [↑]


6.3. Highest and Lowest Sentiment Scores Across Product Categories [↑]

The main categories were ranked based on mean sentiment score using a bar chart, with colors representing the number of reviews.


6.4. Highest and Lowest Sentiment Scores Across Product Subcategories [↑]

Because subcategories vary significantly in review counts, they were analyzed separately:

  1. High Review Counts (≥20k reviews):
    • Top 10 (highest sentiment scores)
    • Bottom 10 (lowest sentiment scores)
  2. Low Review Counts (<20k reviews):
    • Top 10 (highest sentiment scores)
    • Bottom 10 (lowest sentiment scores)

In all cases, color bars represent review volume.

6.4.1. High Popularity Subcategories [↑]

6.4.2. Low Popularity Subcategories [↑]


6.5. Highest and Lowest Sentiment Scores Among Individual Products [↑]

Products were categorized based on review count:

6.5.1. High Popularity Products [↑]

6.5.2. Low Popularity Products [↑]


Products were analyzed by popularity, using mean sentiment scores in bar charts.


7. Recommendations

[ back to contents ]

7.1. Mid-to-High Review Volume (20k–200k), High-Sentiment Products [↑]

Subcategories to Consider

Products to Consider

Why It Matters

These products already have a substantial review base (above 20k) and maintain strong sentiment, but they have yet to achieve “blockbuster” status with 200k+ reviews. Elevating them in search results and promotional campaigns can help tap into their latent potential, moving them closer to top-tier performance.

Key Actions

  1. Targeted Promotions. Launch product- or category-specific ads, email campaigns, and homepage features to boost their visibility. Encourage cross-promotion where subcategories like Webcams or Memory are spotlighted alongside best-selling accessories.
  2. Refined Search Placement. Adjust search algorithms to reward high-sentiment products in the 20k–200k review range, ensuring they appear prominently for relevant keywords. Monitor conversion and engagement metrics for these items, dynamically updating placements to maximize growth.

7.2. Low-Review (<20k), High-Sentiment Products [↑]

Subcategories to Consider

Products to Consider

Why It Matters

Though these items have fewer than 20k reviews, they exhibit high sentiment scores, indicating quality and customer satisfaction. With strategic marketing, they could grow substantially, bridging the gap between early positive feedback and broader market adoption.

Key Actions

  1. Awareness Campaigns. Showcase these products in curated lists (e.g., “Top-Rated Essentials”) and utilize newsletters or social media posts to reach new audiences. Consider limited-time discounts to encourage trial and accelerate the review collection process.
  2. Partnerships & Bundling. Bundle these under-reviewed, high-sentiment products with related high-traffic items (e.g., pairing Espresso Machines with popular coffee bean brands). Highlight subcategory benefits (e.g., Fountain Pens for professionals, Tablets for students) in cross-category promotions to draw in complementary buyers.

7.3. Stricter Quality Control & Product Improvement [↑]

Subcategories to Consider

Products to Consider

Why It Matters

Despite strong sales volumes or established brand presence, these subcategories and products harbor chronic quality or performance issues. Ignoring negative feedback leads to higher return rates, dissatisfied customers, and potential brand damage.

Key Actions

  1. Focused Product Audits. Conduct comprehensive technical evaluations, user feedback analyses, and manufacturer reviews, particularly for high-volume items with negative scores. Enforce stricter production standards and supplier accountability to mitigate recurring complaints.
  2. Rapid Response & Remediation. Immediately address items scoring below –0.5 sentiment, prioritizing improvements in design, durability, or customer support options. Consider removing severely underperforming SKUs from listings until quality issues are resolved.

7.4 Implementation Notes [↑]