Skip to main content

TITANIC DATASET

The data includes demographic and travel information for 1,309 Titanic passengers to predict their survival. The whole Titanic dataset is accessible in multiple forms from the Department of Biostatistics at Vanderbilt University School of Medicine (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv). The go-to resource for information about the Titanic is the website for Encyclopaedia Titanica (https://www.encyclopedia-titanica.org/). It includes a complete list of the passengers and crew and all the relevant information on the Titanic's facts, history, and data. Additionally, the Titanic dataset is the focus of the inaugural competition on Kaggle.com (https://www.kaggle.com/c/titanic; needs creating a Kaggle account). Additionally, a CSV version is available in the GitHub repository at https://github.com/alexperrier/packt-aml/blob/master/ch4.


Download Dataset:

https://drive.google.com/file/d/1mYc9-t_snfQUSkEm_hVXBmLscz7pl40S/view?usp=drive_link ( with Null Values)

https://drive.google.com/file/d/1Z0csVhm0udDw1y3rUQhxxn-whWnmSDmF/view?usp=sharing (training set)

Data Dictionary

VariableDefinitionKey
survivalSurvival0 = No, 1 = Yes
pclassTicket class1 = 1st, 2 = 2nd, 3 = 3rd
sexSex
AgeAge in years
sibsp# of siblings / spouses aboard the Titanic
parch# of parents / children aboard the Titanic
ticketTicket number
farePassenger fare
cabinCabin number
embarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.


Questions:

1. Identify the male and female count in the dataset and draw a graph using Excel and Python
2. Identify the male and female court based on age and draw a graph using Excel and Python
3. Identify the P-class types in that how many males and females count draw a graph using Excel and Python
4. Identify the P-class types in that how many males and females count according to survivals draw a graph using Excel and Python
+++++++++++++++++++++++++++++++***********+++++++++++++++++++++++++++++++++++++++++++


1. $         df['Sex'].value_counts()








USING SEABORN

# Group passengers by Sex and count the occurrences

sex_counts = df['Sex'].value_counts().reset_index()

sex_counts.columns = ['Sex', 'Count']


# Define a color palette

colors = sns.color_palette("pastel")


# Plotting the data with color palette

plt.figure(figsize=(6, 4))

ax = sns.barplot(x='Sex', y='Count', data=sex_counts, palette=colors)

plt.title('Count of Males and Females')

plt.xlabel('Sex')

plt.ylabel('Count')

 

# Annotate count numbers on the bars

for p in ax.patches:

    ax.annotate(str(int(p.get_height())), (p.get_x() + p.get_width() / 2., p.get_height()),ha='center', va='center', xytext=(0, 10), textcoords='offset points')


# Show the plot

plt.show()


OUTPUT:


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

USING MATPLOT

# Group passengers by Sex and count the occurrences

sex_counts = df['Sex'].value_counts().reset_index()

sex_counts.columns = ['Sex', 'Count']


# Create a bar graph using matplotlib

plt.figure(figsize=(6, 4))

plt.bar(sex_counts['Sex'], sex_counts['Count'], color=['blue', 'pink'])

plt.title('Count of Males and Females')

plt.xlabel('Sex')

plt.ylabel('Count')


# Annotate count numbers on the bars

for i, count in enumerate(sex_counts['Count']):

    plt.text(i, count + 10, str(count), ha='center', va='bottom')


# Show the plot

plt.show()


OUTPUT:

 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# Group passengers by Pclass, Sex, and Survival, then count the occurrences

pclass_sex_survived_counts = df.groupby(['Pclass', 'Sex', 'Survived']).size().reset_index(name='Count')

# Pivot the data to have Pclass as columns, Sex as rows, and Survival as values

pclass_sex_survived_pivot = pclass_sex_survived_counts.pivot_table(index=['Sex', 'Survived'], columns='Pclass', values='Count', fill_value=0)

# Define a color palette

colors = sns.color_palette("Set1")

# Plotting the data with color palette

ax = pclass_sex_survived_pivot.plot(kind='bar', stacked=True, color=colors)

plt.title('Survival Counts in Each Pclass by Gender')

plt.xlabel('(Gender, Survival)')

plt.ylabel('Count')

plt.xticks(rotation=0)

plt.legend(title='Pclass')


# Show the plot

plt.show()


OUTPUT:


















Comments

Popular posts from this blog

ML Lab Questions

1. Using matplotlib and seaborn to perform data visualization on the standard dataset a. Perform the preprocessing b. Print the no of rows and columns c. Plot box plot d. Heat map e. Scatter plot f. Bubble chart g. Area chart 2. Build a Linear Regression model using Gradient Descent methods in Python for a wine data set 3. Build a Linear Regression model using an ordinary least-squared model in Python for a wine data set  4. Implement quadratic Regression for the wine dataset 5. Implement Logistic Regression for the wine data set 6. Implement classification using SVM for Iris Dataset 7. Implement Decision-tree learning for the Tip Dataset 8. Implement Bagging using Random Forests  9.  Implement K-means Clustering    10.  Implement DBSCAN clustering  11.  Implement the Gaussian Mixture Model  12. Solve the curse of Dimensionality by implementing the PCA algorithm on a high-dimensional 13. Comparison of Classification algorithms  14. Compa...

Gaussian Mixture Model

A Gaussian Mixture Model (GMM) is a probabilistic model used for clustering and density estimation. It assumes that data is generated from a mixture of several Gaussian distributions, each representing a cluster within the dataset. Unlike K-means, which assigns data points to the nearest cluster centroid deterministically, GMM considers each data point as belonging to each cluster with a certain probability, allowing for soft clustering. GMM is ideal when: Clusters have elliptical shapes or different spreads : GMM captures varying shapes and densities, unlike K-means, which assumes clusters are spherical. Soft clustering is preferred : If you want to know the probability of a data point belonging to each cluster (not a hard assignment). Data has overlapping clusters : GMM allows a point to belong partially to multiple clusters, which is helpful when clusters have significant overlap. Applications of GMM Image Segmentation : Used to segment images into regions, where each region can be...

Logistic Regression

Logistic regression is a statistical method used for binary classification problems. It's particularly useful when you need to predict the probability of a binary outcome based on one or more predictor variables. Here's a breakdown: What is Logistic Regression? Purpose : It models the probability of a binary outcome (e.g., yes/no, success/failure) using a logistic function (sigmoid function). Function : The logistic function maps predicted values (which are in a range from negative infinity to positive infinity) to a probability range between 0 and 1. Formula : The model is typically expressed as: P ( Y = 1 ∣ X ) = 1 1 + e − ( β 0 + β 1 X ) P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} P ( Y = 1∣ X ) = 1 + e − ( β 0 ​ + β 1 ​ X ) 1 ​ Where P ( Y = 1 ∣ X ) P(Y = 1 | X) P ( Y = 1∣ X ) is the probability of the outcome being 1 given predictor X X X , and β 0 \beta_0 β 0 ​ and β 1 \beta_1 β 1 ​ are coefficients estimated during model training. When to Apply Logistic R...