Data Preprocessing
Lesson 5: Data Preprocessing
Raw data is not directly suitable for machine learning models. In this lesson, we clean missing values, create meaningful features, and convert categorical variables into numerical format so the churn prediction model can learn effectively.
a. Fixing Missing Values
First, we ensure that the TotalCharges column is numeric and handle missing values.
Code:
# Convert TotalCharges to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
# Check missing values
print("Missing values per column:")
print(df.isnull().sum())
# Drop rows with missing TotalCharges
df = df.dropna(subset=['TotalCharges'])
Output:
- Displays the number of missing values in each column.
- Removes rows where TotalCharges was missing.
This ensures the dataset is clean and ready for feature engineering.
b. Feature Engineering
Next, we create new features to improve model performance.
Code:
# Tenure buckets
def tenure_group(tenure):
if tenure <= 12:
return '0-12'
elif tenure <= 24:
return '12-24'
elif tenure <= 48:
return '24-48'
else:
return '48+'
df['TenureGroup'] = df['tenure'].apply(tenure_group)
# Average monthly spend
df['AvgMonthlySpend'] = df['TotalCharges'] / (df['tenure'].replace(0, 1))
# Service count
services = ['PhoneService','MultipleLines','InternetService','OnlineSecurity',
'OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
df['ServiceCount'] = df[services].apply(lambda x: sum(x == 'Yes'), axis=1)
# Payment stability
df['PaymentStability'] = df['PaymentMethod'].apply(
lambda x: 'Automatic' if 'automatic' in x.lower() else 'Manual'
)
Output:
- Creates new columns: TenureGroup, AvgMonthlySpend, ServiceCount, and PaymentStability.
These engineered features help the model better understand customer behavior patterns.
c. Encoding Categorical Variables
Machine learning models require numerical input, so we convert categorical columns into numbers.
Code:
from sklearn.preprocessing import LabelEncoder
# Encode binary categorical columns
binary_cols = ['gender','Partner','Dependents','PhoneService',
'PaperlessBilling','Churn','PaymentStability']
le = LabelEncoder()
for col in binary_cols:
df[col] = le.fit_transform(df[col])
# One-hot encode multi-class categorical columns
multi_cols = ['MultipleLines','InternetService','OnlineSecurity','OnlineBackup',
'DeviceProtection','TechSupport','StreamingTV','StreamingMovies',
'Contract','PaymentMethod','TenureGroup']
df = pd.get_dummies(df, columns=multi_cols, drop_first=True)
Output:
- Binary columns are converted to 0 and 1.
- Multi-class columns are transformed into multiple dummy variables.
After these preprocessing steps, the dataset becomes fully numerical and ready for model training.










