Skip to main content

Preparing the Data for ML Algorithm

Preparing data for machine learning algorithms involves several crucial steps to ensure that the data is in a suitable format for training models effectively. Proper data preparation can significantly impact the performance and accuracy of your machine-learning models. Here’s a detailed explanation of the process:

Example dataset:

Customer Id

Age

Gender

Income

Occupation

Purchased

1

25

Male

50000

Engineer

1

2

NaN

Female

60000

Scientist

0

3

35

Female

45000

Artist

1

4

40

Male

NaN

Engineer

0

5

50

Female

52000

Engineer

1

6

30

Male

58000

Doctor

0

7

28

Female

61000

Scientist

1

8

45

NaN

55000

Artist

0


Data Understanding

Let's assume we have a dataset about customer information that includes the following columns:

  • age: Numerical
  • income: Numerical
  • gender: Categorical (e.g., 'Male', 'Female')
  • occupation: Categorical (e.g., 'Engineer', 'Doctor', 'Artist')
  • purchase: Target variable, binary (e.g., 0 = No, 1 = Yes)
Data Cleaning

Handle Missing Data
  • Age: Impute missing values with the median age.
  • Income: Impute missing values with the median income.
  • Gender: Impute missing values with the most frequent value (mode).
# Impute missing values for Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)
# Impute missing values for Income with median
df['Income'].fillna(df['Income'].median(), inplace=True)
# Impute missing values for Gender with mode
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

Check Data After Imputation
print(df.isnull().sum())

Data Transformation

Encode Categorical Variables:

  • Gender: Convert to binary values
  • Occupation: Use one-hot encoding.
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Encode Gender using Label Encoding
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])

# One-Hot Encode Occupation
df = pd.get_dummies(df, columns=['Occupation'], drop_first=True)

Separate Features and Targets:

# Define features and target variable
X = df.drop(columns=['Customer Id', 'Purchased'])
y = df['Purchased']

Feature Scaling (if needed)

Standardize numerical features like Age and Income if required by the algorithm.
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()

# Standardize numerical features
X[['Age', 'Income']] = scaler.fit_transform(X[['Age', 'Income']])

Data Splitting

Split the data into training and test sets.
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



Comments

Popular posts from this blog

ML Lab Questions

1. Using matplotlib and seaborn to perform data visualization on the standard dataset a. Perform the preprocessing b. Print the no of rows and columns c. Plot box plot d. Heat map e. Scatter plot f. Bubble chart g. Area chart 2. Build a Linear Regression model using Gradient Descent methods in Python for a wine data set 3. Build a Linear Regression model using an ordinary least-squared model in Python for a wine data set  4. Implement quadratic Regression for the wine dataset 5. Implement Logistic Regression for the wine data set 6. Implement classification using SVM for Iris Dataset 7. Implement Decision-tree learning for the Tip Dataset 8. Implement Bagging using Random Forests  9.  Implement K-means Clustering    10.  Implement DBSCAN clustering  11.  Implement the Gaussian Mixture Model  12. Solve the curse of Dimensionality by implementing the PCA algorithm on a high-dimensional 13. Comparison of Classification algorithms  14. Compa...

Gaussian Mixture Model

A Gaussian Mixture Model (GMM) is a probabilistic model used for clustering and density estimation. It assumes that data is generated from a mixture of several Gaussian distributions, each representing a cluster within the dataset. Unlike K-means, which assigns data points to the nearest cluster centroid deterministically, GMM considers each data point as belonging to each cluster with a certain probability, allowing for soft clustering. GMM is ideal when: Clusters have elliptical shapes or different spreads : GMM captures varying shapes and densities, unlike K-means, which assumes clusters are spherical. Soft clustering is preferred : If you want to know the probability of a data point belonging to each cluster (not a hard assignment). Data has overlapping clusters : GMM allows a point to belong partially to multiple clusters, which is helpful when clusters have significant overlap. Applications of GMM Image Segmentation : Used to segment images into regions, where each region can be...

Logistic Regression

Logistic regression is a statistical method used for binary classification problems. It's particularly useful when you need to predict the probability of a binary outcome based on one or more predictor variables. Here's a breakdown: What is Logistic Regression? Purpose : It models the probability of a binary outcome (e.g., yes/no, success/failure) using a logistic function (sigmoid function). Function : The logistic function maps predicted values (which are in a range from negative infinity to positive infinity) to a probability range between 0 and 1. Formula : The model is typically expressed as: P ( Y = 1 ∣ X ) = 1 1 + e − ( β 0 + β 1 X ) P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} P ( Y = 1∣ X ) = 1 + e − ( β 0 ​ + β 1 ​ X ) 1 ​ Where P ( Y = 1 ∣ X ) P(Y = 1 | X) P ( Y = 1∣ X ) is the probability of the outcome being 1 given predictor X X X , and β 0 \beta_0 β 0 ​ and β 1 \beta_1 β 1 ​ are coefficients estimated during model training. When to Apply Logistic R...