Content-Based Recommender System Using NLP

Arif Zainurrohman
5 min readFeb 27, 2021

--

Content-Based Filtering

Each one of us must have wondered where all the recommendations that Netflix, Amazon, Google give us, come from. The two main types of recommender systems are either collaborative or content-based filters. I will use hotels and movies as an example , but keep in mind that this type of process can be applied for any kind of product you watch, listen to, buy, and so on.

Recommendation paradigms

Recommendation System

In a general way, recommender systems are algorithms for suggesting relevant items to users (such as movies to watch, books to read, products to buy or anything else depending on industries).

Recommender systems usually make use of either or both collaborative filtering and content-based filtering (also known as the personality-based approach),as well as other systems such as knowledge-based systems.

In this practice, we will use Content-based filtering.

Content-based filtering

Content-Based Filtering

Content-based filtering methods are based on a description of the item and a profile of the user’s preferences.These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on an item’s features.

How do content-based recommender systems work?

In this system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. These algorithms try to recommend items that are similar to those that a user liked in the past, or is examining in the present. In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. This approach has its roots in information retrieval and information filtering research.

To create a user profile, the system mostly focuses on two types of information:

1. A model of the user’s preference.

2. A history of the user’s interaction with the recommender system.

The Dataset

Hotel

The data we use for content-based filtering is the dataset we make ourselves from booking.com because the information provided is complete.

The data taken consists of 657 hotel data which includes names, reviews, prices and others.

Import Library

Import Library

Load the data

Load the Data

TF-IDF Vectorizer

TF-IDF or Term Frequency Inverse Document Frequency is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

TF-IDF

Scikit-learn provides you a pre-built TF-IDF vectorizer that calculates the TF-IDF score for each document’s description, word-by-word.

tf = TfidfVectorizer(analyzer=’word’, ngram_range=(1, 3), min_df=0, stop_words=’english’)

Cosine Similarity

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

Similarity Metrics

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘jakarta’ appeared 10 times in one document and 5 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

Text Preprocessing

Text Preprocessing

We need to do text preprocessing so that the data used can be processed into numbers using TF-IDF and cosine similarity later. The data that will be used is column ‘desc_clean’ only.

Creating a TF-IDF Vectorizer and Cosine Similarity

TF-IDF Vectorizer and Cosine Similarity metrics

Creating a Variable

Create a variable

Modelling

Model

In the modeling stage, we create a function for the same hotel recommendation according to the results of the TF-IDF and the created cosine similarity. Displays in the form of the 10 closest sequence to the hotel name that we define.

Run Script Recommendation

Hotel
Movie

Conclusion

A recommendation system with Content-based filtering methods can generate hotel names that have something in common in terms of the descriptions that have been shown in the data set.

References

Recommendation System Dengan Python : Content Based Filtering (Part 2) | by Novindra Prasetio | Data Folks Indonesia | Medium

How to build a content-based movie recommender system with Natural Language Processing | by Emma Grimaldi | Towards Data Science

Introduction to recommender systems | by Baptiste Rocca | Towards Data Science

Recommender system — Wikipedia

TF-IDF Vectorizer scikit-learn. Deep understanding TfidfVectorizer by… | by Mukesh Chaudhary | Medium

Soumyadip Nandi — Personal Blog (wordpress.com)

--

--

Arif Zainurrohman
Arif Zainurrohman

Written by Arif Zainurrohman

Corporate Data Analytics. Enthusiast in all things data, personal finance, and Fintech.

No responses yet