http://patbaa.web.elte.hu/physdm/data/titanic.csv
On the link above you will find a dataset about the Titanic passengers. Your task is to explore the dataset.
Help for the columns:
Impute the missing values in a sensible way:
The imputing method affects different machine learning models different way, but now we are interested only in EDA, so try to keep as much information as possible!
Which feature seems to play the most important role in surviving/not surviving? Explain how and why could that feature be important!
Impute the missing values in a sensible way:
The imputing method affects different machine learning models different way, but now we are interested only in EDA, so try to keep as much information as possible!
import pandas as pd
import seaborn as sns
from collections import Counter
%pylab inline
data = pd.read_csv('titanic.csv')
data.head()
data = pd.read_csv('titanic.csv')
data.pop('PassengerId');
data.pop('Name');
data.pop('Ticket');
data.isna().sum()
plt.figure(figsize=(16, 4))
plt.imshow(data.isna().T)
data.head()
Counter(data.Embarked)
data.Embarked.fillna('S', inplace=True)
data['has_no_cabin'] = data.Cabin.isna().astype(int)
data['has_no_age'] = data.Age.isna().astype(int)
grouped = data.groupby(['Survived', 'Pclass'])[['Sex']].count().reset_index()
grouped
pivoted = grouped.pivot(index='Pclass', columns='Survived', values='Sex')
pivoted
sns.heatmap(pivoted, annot=True, fmt='g')
plt.pcolor(pivoted)
plt.colorbar()
plt.xticks([0.5, 1.5], ['not survived', 'survived'])
plt.yticks([0.5, 1.5, 2.5], ['Pclass = 1', 'Pclass = 2', 'Pclass = 3'])
plt.show()
sns.boxplot(x='Pclass', y='Age', data = data)
# older people --> more money --> higher class
Which feature seems to play the most important role in surviving/not surviving? Explain how and why could that feature be important!
data.corr()
sns.heatmap(data.corr(), annot=True)
more expensive the ticket, the better the survival rate (kinda same at the two others)
Most important is the Pclass and having a Cabin.
sns.factorplot('Sex', data=data[data.Survived == 0], kind='count')
plt.title('Not survived')
sns.factorplot('Sex', data=data[data.Survived == 1], kind='count')
plt.title('Survived')
plt.show()
data.head()
sns.heatmap(data.groupby(['Sex'])[['Parch', 'SibSp']].mean(), annot=True)
Counter(data.Sex)
It seems that males often traveled alone!
Counter(data[(data.SibSp == 0) & (data.Parch == 0)].Sex)