Preparing the Data for ML Algorithm

Preparing data for machine learning algorithms involves several crucial steps to ensure that the data is in a suitable format for training models effectively. Proper data preparation can significantly impact the performance and accuracy of your machine-learning models. Here’s a detailed explanation of the process:

Example dataset:

Customer Id

Age

Gender

Income

Occupation

Purchased

1

25

Male

50000

Engineer

1

2

NaN

Female

60000

Scientist

0

3

35

Female

45000

Artist

1

4

40

Male

NaN

Engineer

0

5

50

Female

52000

Engineer

1

6

30

Male

58000

Doctor

0

7

28

Female

61000

Scientist

1

8

45

NaN

55000

Artist

0


Data Understanding

Let's assume we have a dataset about customer information that includes the following columns:

  • age: Numerical
  • income: Numerical
  • gender: Categorical (e.g., 'Male', 'Female')
  • occupation: Categorical (e.g., 'Engineer', 'Doctor', 'Artist')
  • purchase: Target variable, binary (e.g., 0 = No, 1 = Yes)
Data Cleaning

Handle Missing Data
  • Age: Impute missing values with the median age.
  • Income: Impute missing values with the median income.
  • Gender: Impute missing values with the most frequent value (mode).
# Impute missing values for Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)
# Impute missing values for Income with median
df['Income'].fillna(df['Income'].median(), inplace=True)
# Impute missing values for Gender with mode
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

Check Data After Imputation
print(df.isnull().sum())

Data Transformation

Encode Categorical Variables:

  • Gender: Convert to binary values
  • Occupation: Use one-hot encoding.
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Encode Gender using Label Encoding
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])

# One-Hot Encode Occupation
df = pd.get_dummies(df, columns=['Occupation'], drop_first=True)

Separate Features and Targets:

# Define features and target variable
X = df.drop(columns=['Customer Id', 'Purchased'])
y = df['Purchased']

Feature Scaling (if needed)

Standardize numerical features like Age and Income if required by the algorithm.
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()

# Standardize numerical features
X[['Age', 'Income']] = scaler.fit_transform(X[['Age', 'Income']])

Data Splitting

Split the data into training and test sets.
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



Comments

Popular posts from this blog

About me

A set of documents that need to be classified, use the Naive Bayesian Classifier

Keras