Menu

Data Preprocessing

Data Preprocessing

Raw data is rarely suitable for direct use in machine learning models. In this lesson, we preprocess movie titles to improve search accuracy and usability within the recommendation function.

Movie titles often include the release year in parentheses, which can make matching user input difficult. To solve this, we extract the year and create a clean version of the title.

Code:

movies['year'] = movies['title'].str.extract(r'\((\d{4})\)')

movies['clean_title'] = movies['title'].str.replace(r'\(\d{4}\)', '').str.strip()

Separating the year from the title allows users to search for movies without worrying about formatting. This preprocessing step improves the robustness of the recommendation system and prevents matching errors.