Preparing data for machine learning algorithms involves several crucial steps to ensure that the data is in a suitable format for training models effectively. Proper data preparation can significantly impact the performance and accuracy of your machine-learning models. Here’s a detailed explanation of the process:
Example dataset:
Customer Id |
Age |
Gender |
Income |
Occupation |
Purchased |
1 |
25 |
Male |
50000 |
Engineer |
1 |
2 |
NaN |
Female |
60000 |
Scientist |
0 |
3 |
35 |
Female |
45000 |
Artist |
1 |
4 |
40 |
Male |
NaN |
Engineer |
0 |
5 |
50 |
Female |
52000 |
Engineer |
1 |
6 |
30 |
Male |
58000 |
Doctor |
0 |
7 |
28 |
Female |
61000 |
Scientist |
1 |
8 |
45 |
NaN |
55000 |
Artist |
0 |
Data Understanding
Let's assume we have a dataset about customer information that includes the following columns:
age
: Numericalincome
: Numericalgender
: Categorical (e.g., 'Male', 'Female')occupation
: Categorical (e.g., 'Engineer', 'Doctor', 'Artist')purchase
: Target variable, binary (e.g., 0 = No, 1 = Yes)
Data Cleaning
Handle Missing Data
- Age: Impute missing values with the median age.
- Income: Impute missing values with the median income.
- Gender: Impute missing values with the most frequent value (mode).
# Impute missing
values for Age with median
df['Age'].fillna(df['Age'].median(), inplace=True) # Impute missing values for Income with median df['Income'].fillna(df['Income'].median(), inplace=True) # Impute missing values for Gender with mode df['Gender'].fillna(df['Gender'].mode()[0], inplace=True) |
Check Data After Imputation
print(df.isnull().sum())
|
Data Transformation
Encode Categorical Variables:
- Gender: Convert to binary values
- Occupation: Use one-hot encoding.
from
sklearn.preprocessing import LabelEncoder
from
sklearn.preprocessing import OneHotEncoder
# Encode Gender
using Label Encoding
label_encoder =
LabelEncoder()
df['Gender'] =
label_encoder.fit_transform(df['Gender'])
# One-Hot Encode
Occupation
df =
pd.get_dummies(df, columns=['Occupation'], drop_first=True)
from
sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder # Encode Gender using Label Encoding label_encoder = LabelEncoder() df['Gender'] = label_encoder.fit_transform(df['Gender']) # One-Hot Encode Occupation df = pd.get_dummies(df, columns=['Occupation'], drop_first=True) |
Separate Features and Targets:
# Define features and target variable
X = df.drop(columns=['Customer Id', 'Purchased']) y = df['Purchased'] |
Feature Scaling (if needed)
Standardize numerical features like Age
and Income
if required by the algorithm.
from
sklearn.preprocessing import StandardScaler
# Initialize the
scaler
scaler =
StandardScaler()
# Standardize
numerical features
X[['Age',
'Income']] = scaler.fit_transform(X[['Age', 'Income']])
Data Splitting
Standardize numerical features like
Age
and Income
if required by the algorithm.
from
sklearn.preprocessing import StandardScaler # Initialize the scaler scaler = StandardScaler() # Standardize numerical features X[['Age', 'Income']] = scaler.fit_transform(X[['Age', 'Income']]) |
Split the data into training and test sets.
from
sklearn.model_selection import train_test_split # Split the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
Comments