Preparing the Data for ML Algorithm

- August 24, 2024

Preparing data for machine learning algorithms involves several crucial steps to ensure that the data is in a suitable format for training models effectively. Proper data preparation can significantly impact the performance and accuracy of your machine-learning models. Here’s a detailed explanation of the process:

Example dataset:

Customer Id	Age	Gender	Income	Occupation	Purchased
1	25	Male	50000	Engineer	1
2	NaN	Female	60000	Scientist	0
3	35	Female	45000	Artist	1
4	40	Male	NaN	Engineer	0
5	50	Female	52000	Engineer	1
6	30	Male	58000	Doctor	0
7	28	Female	61000	Scientist	1
8	45	NaN	55000	Artist	0

Data Understanding

Let's assume we have a dataset about customer information that includes the following columns:

age: Numerical
income: Numerical
gender: Categorical (e.g., 'Male', 'Female')
occupation: Categorical (e.g., 'Engineer', 'Doctor', 'Artist')
purchase: Target variable, binary (e.g., 0 = No, 1 = Yes)

Data Cleaning

Handle Missing Data

Age: Impute missing values with the median age.
Income: Impute missing values with the median income.
Gender: Impute missing values with the most frequent value (mode).

# Impute missing values for Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)
# Impute missing values for Income with median
df['Income'].fillna(df['Income'].median(), inplace=True)
# Impute missing values for Gender with mode
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

Check Data After Imputation

print(df.isnull().sum())

Data Transformation

Encode Categorical Variables:

Gender: Convert to binary values
Occupation: Use one-hot encoding.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Encode Gender using Label Encoding
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])

# One-Hot Encode Occupation
df = pd.get_dummies(df, columns=['Occupation'], drop_first=True)

Separate Features and Targets:

# Define features and target variable
X = df.drop(columns=['Customer Id', 'Purchased'])
y = df['Purchased']

Feature Scaling (if needed)

Standardize numerical features like `Age` and `Income` if required by the algorithm.

from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()

# Standardize numerical features
X[['Age', 'Income']] = scaler.fit_transform(X[['Age', 'Income']])

Data Splitting

Split the data into training and test sets.

from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Search This Blog

Cnuinformatica