Customer review is an extremely useful reference for someone who wants to purchase products or services from a business. However, it is nearly impossible for someone to go through all reviews, especially for products that are flooded with tons of reviews.
Hence, this project is aimed to summarise all customer reviews towards a business, specifically the negative and positive sentiment from the top five main aspects discussed in the reviews using the Yelp dataset. With that, it’s believed that it will help the customer to get a glance at the product itself and the aspects that customers care towards the products, which will help them in decision making.
- Data preprocessing on reviews dataset
- clean all symbols & numbers
- lower case
- remove stopwords
2. Main aspects extraction
- retain all nouns in a review using POS tagging function
- lemmatizing the retained words
- count the frequency of the retained words and keep the top 5 being the main aspects
3. Use a pre-trained model to predict
- XGBoost is pre-trained using TFIDF features as an input to predict the sentiment of a review (positive, negative, neutral). The sentiment of a review is preprocessed using the corresponding rating (≤ 2 -negative, ≥ 4 -positive, 3 -neutral).
- The output of the pre-trained model is the score/probability of review fall under one of the sentiment categories.
- The top 5 highest score positive/negative reviews will be retained.
The following picture is an example of an aspect summary of a business. Clearly, Hospital is the common term that is being mentioned in the customer reviews.
With the methodology mentioned above, it is able to summarise all the customer reviews. However, there are still some flaws in the methods and the suggested improvement is as follows:
- Aspect extraction
Current aspect extraction is being used possessed a lot of flaws. For example, if a review contains few main aspects,it will be classified under the first aspect, which is not true. Hence, we should further breakdown the sentences in a review (noun and corresponding adjectives) and design a scoring function that will help to weigh the main aspects of the review.
2. Variety of reviews summary
Another thing that should have been included in the method is the variety of positive and negative reviews. The current methods only sort the reviews by the score that’s predicted by the sentiment model. Hence there will be a case that few reviews are talking about the same thing. For example, if the top five negative reviews keep talking about the product is very expensive, it is not very useful for customers who do not care so much towards the product’s price.
3. Adding Usefulness feature
There is a feature named as “usefulness” in dataset, which is an indicator of the aspects that customers pay attention to. Hence, this could help to sort out the aspects of customers towards that particular products.