Menu

Genre Vectorization Using TF-IDF

Genre Vectorization Using TF-IDF

Machine learning models operate on numerical data, not text. Since movie genres are stored as text, they must be converted into numerical form before similarity calculations can be performed.

TF-IDF is used to convert genre text into numerical vectors that represent the importance of each genre for a movie.

Code:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(token_pattern='[a-zA-Z0-9]+')

tfidf_matrix = tfidf.fit_transform(movies['genres'])

After this step, each movie is represented as a vector of numbers. These vectors capture genre composition and allow the system to measure similarity between movies mathematically.