Data Cleaning and Preparation
Lesson 4: Data Cleaning and Preparation
Once we understand the dataset structure, the next step is to clean and prepare the data so it can be used reliably for analysis. Real-world sales data often contains missing values and incorrect data types, which can lead to wrong results if not handled properly.
The first part of data cleaning focuses on handling missing values. In this project, some address-related columns contain missing entries. Instead of removing rows, we fill these missing values with meaningful defaults to keep the dataset complete.
Code to handle missing values:
# Fill ADDRESSLINE2 with empty string
df['ADDRESSLINE2'] = df['ADDRESSLINE2'].fillna('')
# Fill STATE with 'Unknown'
df['STATE'] = df['STATE'].fillna('Unknown')
# Fill POSTALCODE with '00000'
df['POSTALCODE'] = df['POSTALCODE'].fillna('00000')
# Fill TERRITORY with 'Not Assigned'
df['TERRITORY'] = df['TERRITORY'].fillna('Not Assigned')
Next, we convert important columns to the correct data types. Dates are converted to datetime format, and numeric columns are converted to numeric types. This step is essential for performing calculations, grouping, and time-based analysis.
Code to convert data types:
df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'], errors='coerce')
df['SALES'] = pd.to_numeric(df['SALES'], errors='coerce')
df['QUANTITYORDERED'] = pd.to_numeric(df['QUANTITYORDERED'], errors='coerce')
df['PRICEEACH'] = pd.to_numeric(df['PRICEEACH'], errors='coerce')
df['MSRP'] = pd.to_numeric(df['MSRP'], errors='coerce')
After cleaning and conversion, the dataset becomes consistent and analysis-ready. This prepared data will now be used for grouping, aggregation, and exploratory analysis in the next lesson.










