Menu

Data Preprocessing

Lesson 5: Data Preprocessing

Raw data is not directly suitable for machine learning models. In this lesson, we clean missing values, create meaningful features, and convert categorical variables into numerical format so the churn prediction model can learn effectively.

a. Fixing Missing Values

First, we ensure that the TotalCharges column is numeric and handle missing values.

Code:

# Convert TotalCharges to numeric

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Check missing values

print("Missing values per column:")

print(df.isnull().sum())

# Drop rows with missing TotalCharges

df = df.dropna(subset=['TotalCharges'])

Output:

  • Displays the number of missing values in each column.
  • Removes rows where TotalCharges was missing.

This ensures the dataset is clean and ready for feature engineering.

b. Feature Engineering

Next, we create new features to improve model performance.

Code:

# Tenure buckets

def tenure_group(tenure):

if tenure <= 12:

return '0-12'

elif tenure <= 24:

return '12-24'

elif tenure <= 48:

return '24-48'

else:

return '48+'

df['TenureGroup'] = df['tenure'].apply(tenure_group)

# Average monthly spend

df['AvgMonthlySpend'] = df['TotalCharges'] / (df['tenure'].replace(0, 1))

# Service count

services = ['PhoneService','MultipleLines','InternetService','OnlineSecurity',

'OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']

df['ServiceCount'] = df[services].apply(lambda x: sum(x == 'Yes'), axis=1)

# Payment stability

df['PaymentStability'] = df['PaymentMethod'].apply(

lambda x: 'Automatic' if 'automatic' in x.lower() else 'Manual'

)

Output:

  • Creates new columns: TenureGroup, AvgMonthlySpend, ServiceCount, and PaymentStability.

These engineered features help the model better understand customer behavior patterns.

c. Encoding Categorical Variables

Machine learning models require numerical input, so we convert categorical columns into numbers.

Code:

from sklearn.preprocessing import LabelEncoder

# Encode binary categorical columns

binary_cols = ['gender','Partner','Dependents','PhoneService',

'PaperlessBilling','Churn','PaymentStability']

le = LabelEncoder()

for col in binary_cols:

df[col] = le.fit_transform(df[col])

# One-hot encode multi-class categorical columns

multi_cols = ['MultipleLines','InternetService','OnlineSecurity','OnlineBackup',

'DeviceProtection','TechSupport','StreamingTV','StreamingMovies',

'Contract','PaymentMethod','TenureGroup']

df = pd.get_dummies(df, columns=multi_cols, drop_first=True)

Output:

  • Binary columns are converted to 0 and 1.
  • Multi-class columns are transformed into multiple dummy variables.

After these preprocessing steps, the dataset becomes fully numerical and ready for model training.