The data includes demographic and travel information for 1,309 Titanic passengers to predict their survival. The whole Titanic dataset is accessible in multiple forms from the Department of Biostatistics at Vanderbilt University School of Medicine (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv). The go-to resource for information about the Titanic is the website for Encyclopaedia Titanica (https://www.encyclopedia-titanica.org/). It includes a complete list of the passengers and crew and all the relevant information on the Titanic's facts, history, and data. Additionally, the Titanic dataset is the focus of the inaugural competition on Kaggle.com (https://www.kaggle.com/c/titanic; needs creating a Kaggle account). Additionally, a CSV version is available in the GitHub repository at https://github.com/alexperrier/packt-aml/blob/master/ch4.
Download Dataset:
https://drive.google.com/file/d/1mYc9-t_snfQUSkEm_hVXBmLscz7pl40S/view?usp=drive_link ( with Null Values)
https://drive.google.com/file/d/1Z0csVhm0udDw1y3rUQhxxn-whWnmSDmF/view?usp=sharing (training set)
Data Dictionary
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Questions:
1. $ df['Sex'].value_counts()
USING SEABORN
# Group passengers by Sex and count the
occurrences |
sex_counts =
df['Sex'].value_counts().reset_index() |
sex_counts.columns = ['Sex', 'Count'] |
|
# Define a color palette |
colors =
sns.color_palette("pastel") |
|
# Plotting the data with color palette |
plt.figure(figsize=(6, 4)) |
ax = sns.barplot(x='Sex', y='Count', data=sex_counts,
palette=colors) |
plt.title('Count of Males and Females') |
plt.xlabel('Sex') |
plt.ylabel('Count') |
|
# Annotate count numbers on the bars |
for p in ax.patches: |
ax.annotate(str(int(p.get_height())), (p.get_x() + p.get_width() / 2.,
p.get_height()),ha='center', va='center', xytext=(0, 10), textcoords='offset
points') |
|
# Show the plot |
plt.show() |
OUTPUT:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
USING MATPLOT
# Group passengers by Sex and count the occurrences |
sex_counts =
df['Sex'].value_counts().reset_index() |
sex_counts.columns = ['Sex', 'Count'] |
|
# Create a bar graph using matplotlib |
plt.figure(figsize=(6, 4)) |
plt.bar(sex_counts['Sex'],
sex_counts['Count'], color=['blue', 'pink']) |
plt.title('Count of Males and Females') |
plt.xlabel('Sex') |
plt.ylabel('Count') |
|
# Annotate count numbers on the bars |
for i, count in
enumerate(sex_counts['Count']): |
plt.text(i, count + 10,
str(count), ha='center', va='bottom') |
|
# Show the plot |
plt.show() |
OUTPUT:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# Group passengers by Pclass, Sex, and Survival, then
count the occurrences |
pclass_sex_survived_counts = df.groupby(['Pclass', 'Sex',
'Survived']).size().reset_index(name='Count') |
# Pivot the data to have Pclass as columns, Sex as
rows, and Survival as values |
pclass_sex_survived_pivot =
pclass_sex_survived_counts.pivot_table(index=['Sex', 'Survived'],
columns='Pclass', values='Count', fill_value=0) |
# Define a color palette |
colors = sns.color_palette("Set1") |
# Plotting the data with color palette |
ax = pclass_sex_survived_pivot.plot(kind='bar',
stacked=True, color=colors) |
plt.title('Survival Counts in Each Pclass
by Gender') |
plt.xlabel('(Gender, Survival)') |
plt.ylabel('Count') |
plt.xticks(rotation=0) |
plt.legend(title='Pclass') |
|
# Show the plot |
plt.show() |
OUTPUT:
Comments