Skip to main content

WAP using Python a set of Documents classification by Naive Bayes Model

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Load a sample dataset (e.g., 20 Newsgroups)
categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)
# Print information about the newsgroups dataset
print("=== Newsgroups Dataset Information ===")
print(f"Number of documents: {len(newsgroups.data)}")
print(f"Number of categories: {len(newsgroups.target_names)}")
print("Categories:", newsgroups.target_names)
print("First document sample:\n", newsgroups.data[0][:500])  # Print first 500 characters of the first document
print("\n")

# Step 2: Preprocess the text data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target
# Print information about the vectorizer
print("=== Vectorizer Information ===")
print(f"Number of features (unique words): {len(vectorizer.get_feature_names_out())}")
print("Sample feature names (words):", vectorizer.get_feature_names_out()[:20])  # Print first 20 feature names
print("Shape of the document-term matrix:", X.shape)
print("\n")

# Step 3: Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = model.predict(X_test)

# Step 6: Evaluate the model
print("=== Model Evaluation ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=newsgroups.target_names))

Comments

Popular posts from this blog

Logistic Regression

Logistic regression is a statistical method used for binary classification problems. It's particularly useful when you need to predict the probability of a binary outcome based on one or more predictor variables. Here's a breakdown: What is Logistic Regression? Purpose : It models the probability of a binary outcome (e.g., yes/no, success/failure) using a logistic function (sigmoid function). Function : The logistic function maps predicted values (which are in a range from negative infinity to positive infinity) to a probability range between 0 and 1. Formula : The model is typically expressed as: P ( Y = 1 ∣ X ) = 1 1 + e − ( β 0 + β 1 X ) P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} P ( Y = 1∣ X ) = 1 + e − ( β 0 ​ + β 1 ​ X ) 1 ​ Where P ( Y = 1 ∣ X ) P(Y = 1 | X) P ( Y = 1∣ X ) is the probability of the outcome being 1 given predictor X X X , and β 0 \beta_0 β 0 ​ and β 1 \beta_1 β 1 ​ are coefficients estimated during model training. When to Apply Logistic R...

Linear Regression using Ordinary Least Square method

Ordinary Least Square Method Download Dataset Step 1: Import the necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt Step 2: Load the CSV Data # Load the dataset data = pd.read_csv('house_data.csv') # Extract the features (X) and target variable (y) X = data['Size'].values y = data['Price'].values # Reshape X to be a 2D array X = X.reshape(-1, 1) # Add a column of ones to X for the intercept X_b = np.c_[np.ones((X.shape[0], 1)), X] Step 3: Add a Column of Ones to X for the Intercept # Add a column of ones to X for the intercept X_b = np.c_[np.ones((X.shape[0], 1)), X] Step 4: Implement the OLS Method # Calculate the OLS estimate of theta (the coefficients) theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) Step 5: Make Predictions # Make predictions y_pred = X_b.dot(theta_best) Step 6: Visualize the Results # Plot the data and the regression line plt.scatter(X, y, color='blue', label='Data') plt.pl...

Quadratic Regression

  Quadratic regression is a statistical method used to model a relationship between variables with a parabolic best-fit curve, rather than a straight line. It's ideal when the data relationship appears curvilinear. The goal is to fit a quadratic equation   y=ax^2+bx+c y = a ⁢ x 2 + b ⁢ x + c to the observed data, providing a nuanced model of the relationship. Contrary to historical or biological connotations, "regression" in this mathematical context refers to advancing our understanding of complex relationships among variables, particularly when data follows a curvilinear pattern. Working with quadratic regression These calculations can become quite complex and tedious. We have just gone over a few very detailed formulas, but the truth is that we can handle these calculations with a graphing calculator. This saves us from having to go through so many steps -- but we still must understand the core concepts at play. Let's try a practice problem that includes quadratic ...