Customized Book Search Engine Based on Keywords Extraction of Amazon Customer Reviews

7 min readDec 14, 2021

Introduction

Amazon is one of the world’s largest E-commerce platform. The ability to find about any product on Amazon is the main reason consumers shop there. However, the fact that customer can find almost everything on Amazon does not mean it takes less effort for them to find what they are really looking for. It is often the case that users have some specific expectations on certain characteristics of the product, but those features are not shown in the product title. As a result, it is hard for search engines to retrieve those “customized” features, which costs users bunch of time to choose the product they really want.

For example, a mom may want to buy a toy for her daughter, and she wanted the toy to be wooden (to prevent potential hurt to kids), educational and durable and suitable for girl. Amazon search engine did not perform well on recommending products for such retrieval queries like “wooden durable educational toys for girls”. Thus, we want to improve the online shopping user experience by extracting useful information from millions of amazon user reviews and elaborating product title by adding some feature tags summarized from those reviews.

This project aims to extract useful information from Amazon reviews to elaborate the product title, such that users can easily search for products that meet their personalized requirements. The deliverable of this project is an improved product search engine that indicates feature summaries in product titles on user interface and at the same time provides better recommendations for customized search queries. Consider the scale of this project, we will narrow down our Amazon products to books. We also found that titles of books on Amazon are extremely simple, only with single book names. Therefore, we find that adding review tags to books are more meaningful for personalized recommendation.

Data

For this project, our data comes from a professional Amazon scraping API called Rainforest API. This API provides clean, comprehensive and high-quality data especially for Amazon product data.

We mainly used three APIs in this project. The first API extracted all the sub-categories under the main book categories. To be concrete in our case, it listed all 30 book genres under the “Books” main category. The Second API provided detailed information for concrete products, those information is pretty much like what users can see on the shopping website. We extracted 160 books from each subcategory. It contains everything but the reviews for that product. So finally, we used the API that could extract reviews of each product and append the reviews information to the product information mentioned in the second part. The final data are in JSON format, which contains product information of 4800 books. One product data example is shown in figure below.

Methods

Our work includes two parts. The first part is keyword extraction, which extracts tags from the reviews. The second part is information retrieval, which ranks the documents given queries. Here we used TF-IDF, BERT and TextRank for keyword extraction. Then we used latent Dirichlet allocation (LDA) model to get a list of books that are in the same category with the query. Finally we used two different retrieval ranking methods, BERT and BM25, on titles as well as the tags to provide recommendations for search queries. The diagram of our method is shown below.

Keyword Extraction Methods — TF-IDF

For a document, we aim at extracting several most representative words for this document as its tags. We use spacy package in Python to do text preprocessing. Then we use the equation TF-IDF to calculate the score of each words in the document and select top k words with highes scores.

Keyword Extraction Methods — BERT

For keyword extraction, we also tried BERT model using KeyBERT package in Python. KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document. First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.

Keyword Extraction Methods — TextRank

TextRank algorithm is a graph-based ranking model for text processing which can be used in order to find keywords. The overflow of the algorithm is shown in figure below. It divides articles into several component units, and chooses important components to build graph model.

LDA Filtering

We use latent Dirichlet allocation (LDA) to generate a list of candidate books. LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities. We can describe the generative process of LDA as, given the M number of documents, N number of words, and prior K number of topics, the model trains to output:

The distribution of words for each topic k
The distribution of topics for each document m.

Then we can get document as combination of several topics. We used this information to help narrow the search.

Ranking Methods — BM25

For a query, our task is to give several most relevant documents. We use BM25 to calculate the similarity between each query and each document. For a specific query, we calculate its similarity with all the document and select k documents with the highest score.

Ranking Methods — BERT

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms. It tried to embed all entries in our corpus, whether they be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from our corpus are found. These entries should have a high semantic overlap with the query.

We use sentence-transformers package to do the BERT transformation. The package provides a list of pretrained models for word embedding. We use the pretrained model to do the word embedding for both corpus and the query. After getting the representing vectors for both corpus and query, we calculate the cosine similarity to find top k documents that are most relevant for this query.

Results & Discussions

As is discussed in Methods, we have tried 6 combinations of different keyword extraction and information retrieval methods, and we have created 2 baselines to evaluate the results. One is just a random baseline that will just return 15–30 random documents for each query. Our second baseline is one of the 6 methods, which is the model with key word extraction method TF-IDF and ranking method BM25 since both of them are usual way of doing IR.

We picked 20 queries, and annotated 120 results for each query retrieved by each model using the 5-point relevance scale rating, where 1 represents the most not-relevant, and 5 represents the most highly-relevant. Then we calculated NDCG at rank 10. The result is shown in the following table.

Results of different Models on 20 Queries

Performance of Three Extraction Methods Using BM25 ranking

Performance of Three Extraction Methods Using BERT Ranking

Our model with keyword extraction method TextRank and IR method BERT performs best in almost all the queries. It performs much better than the 2 baselines. The NDCG at rank 10 of our model fluctuates around 0.8–1, while the NDCG at rank 10 of the 2 baselines stays around 0.2–0.5, so our model is quite successful. This large performance improvement is expected, and the results could satisfy the end-users.

We have deployed our search engine on here, you are welcomed to try our book search engine with some fancy queries :) Notice that the search engine is deployed on free servers so it may not be stable, and it also needs 10 to 30 seconds to response the search query.

What’s Next?

Overall, the project is quite successful and meet our expectation. But there are still some points that can be further polished.

First, our dataset has some limitation. In the project, we only use the reviews of books and do not consider other products. Besides, our dataset only contain 4800 products. Both the diversity and the amount of data are not enough. We expect this method to generalze well on a larger scale but it’s hard to say.
Second, we can also try some other deep learning models to do the keyword extraction as well as the ranking, such as DSSM model. Since deep learning models are so powerful, the results are hopefully better.
Third, our deployed search engine needs long time to respond to a given query! That’s something related to front and back end interactions. Though it’s enough for demo purpose, it needs to be polished when actually deployed in reality.