Exploratory data analysis¶

http://patbaa.web.elte.hu/physdm/data/titanic.csv

On the link above you will find a dataset about the Titanic passengers. Your task is to explore the dataset.

Help for the columns:

SibSp - number of sibling/spouses on the ship
Parch - number of parent/children on the ship
Cabin - the cabin they slept in (if they had a cabin)
Embarked - harbour of entering the ship
Pclass - passenger class (like on trains)

1. Load the above-linked csv file as a pandas dataframe. Check & plot if any of the columns has missing values. If they have, investigate if the missingness is random or not.¶

Impute the missing values in a sensible way:

if only a very small percentage is missing, imputing with the column-wise mean makes sense, or also removing the missing rows makes sense
if in a row almost all the entries is missing, it worth to remove that given row
if a larger portion is missing from a column, usually it worth to encode that with a value that does not appear in the dataset (eg: -1).

The imputing method affects different machine learning models different way, but now we are interested only in EDA, so try to keep as much information as possible!

2. Create a heatmap which shows how many people survived and dies with the different Pclass variables. You need to create a table where the columns indicates if a person survived or not, the rows indicates the different Pclass and the cell values contains the number of people belonging the that given category. The table should be colored based on the value of the cells in the table.¶

3. Create boxplots for each different Pclass. The boxplot should show the age distribution for the given Pclass. Plot all of these next to each other in a row to make it easier to compare!¶

4. Calculate the correlation matrix for the numerical columns. Show it also as a heatmap described at the 1st task.¶

Which feature seems to play the most important role in surviving/not surviving? Explain how and why could that feature be important!

5. Create two plots which you think are meaningful. Interpret both of them. (Eg.: older people buy more expensive ticket? people buying more expensive ticket survive more? etc.)¶

Hints:¶

On total you can get 10 points for fully completing all tasks.
Decorate your notebook with, questions, explanation etc, make it self contained and understandable!
Comments you code when necessary
Write functions for repetitive tasks!
Use the pandas package for data loading and handling
Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation
Use the scikit learn package for almost everything
Use for loops only if it is really necessary!
Code sharing is not allowed between student! Sharing code will result in zero points.
If you use code found on web, it is OK, but, make its source clear!

Solution¶

Note:¶

there are many different way to get the same results
here we show one way. this is not neccessarily the most elegant or the fastest way

1. Load the above-linked csv file as a pandas dataframe. Check & plot if any of the columns has missing values. If they have, investigate if the missingness is random or not.¶

Impute the missing values in a sensible way:

if only a very small percentage is missing, imputing with the column-wise mean makes sense, or also removing the missing rows makes sense
if in a row almost all the entries is missing, it worth to remove that given row
if a larger portion is missing from a column, usually it worth to encode that with a value that does not appear in the dataset (eg: -1).

The imputing method affects different machine learning models different way, but now we are interested only in EDA, so try to keep as much information as possible!

In [1]:

import pandas as pd
import seaborn as sns
from collections import Counter

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [2]:

data = pd.read_csv('titanic.csv')
data.head()

Out[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

In [3]:

data = pd.read_csv('titanic.csv')
data.pop('PassengerId');
data.pop('Name');
data.pop('Ticket');

In [4]:

data.isna().sum()

Out[4]:

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [5]:

plt.figure(figsize=(16, 4))
plt.imshow(data.isna().T)

Out[5]:

<matplotlib.image.AxesImage at 0x7fa78900e828>

In [5]:

data.head()

Out[5]:

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
0	0	3	male	22.0	1	0	7.2500	NaN	S
1	1	1	female	38.0	1	0	71.2833	C85	C
2	1	3	female	26.0	0	0	7.9250	NaN	S
3	1	1	female	35.0	1	0	53.1000	C123	S
4	0	3	male	35.0	0	0	8.0500	NaN	S

In [6]:

Counter(data.Embarked)

Out[6]:

Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})

In [7]:

data.Embarked.fillna('S', inplace=True)
data['has_no_cabin'] = data.Cabin.isna().astype(int)
data['has_no_age'] = data.Age.isna().astype(int)

2. Create a heatmap which shows how many people survived and dies with the different Pclass variables. You need to create a table where the columns indicates if a person survived or not, the rows indicates the different Pclass and the cell values contains the number of people belonging the that given category. The table should be colored based on the value of the cells in the table.¶

In [8]:

grouped = data.groupby(['Survived', 'Pclass'])[['Sex']].count().reset_index()
grouped

Out[8]:

	Survived	Pclass	Sex
0	0	1	80
1	0	2	97
2	0	3	372
3	1	1	136
4	1	2	87
5	1	3	119

In [9]:

pivoted = grouped.pivot(index='Pclass', columns='Survived', values='Sex')
pivoted

Out[9]:

Survived	0	1
Pclass
1	80	136
2	97	87
3	372	119

In [11]:

sns.heatmap(pivoted, annot=True, fmt='g')

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa784d4c550>

In [12]:

plt.pcolor(pivoted)
plt.colorbar()

plt.xticks([0.5, 1.5], ['not survived', 'survived'])
plt.yticks([0.5, 1.5, 2.5], ['Pclass = 1', 'Pclass = 2', 'Pclass = 3'])
plt.show()

3. Create boxplots for each different Pclass. The boxplot should show the age distribution for the given Pclass. Plot all of these next to each other in a row to make it easier to compare!¶

In [12]:

sns.boxplot(x='Pclass', y='Age', data = data)
# older people --> more money --> higher class

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3f4e3eaf98>

4. Calculate the correlation matrix for the numerical columns. Show it also as a heatmap described at the 1st task.¶

Which feature seems to play the most important role in surviving/not surviving? Explain how and why could that feature be important!

In [13]:

data.corr()

Out[13]:

	Survived	Pclass	Age	SibSp	Parch	Fare	has_no_cabin	has_no_age
Survived	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307	-0.316912	-0.092197
Pclass	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500	0.725541	0.172933
Age	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067	-0.249732	NaN
SibSp	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651	0.040460	0.018958
Parch	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225	-0.036987	-0.124104
Fare	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000	-0.482075	-0.100707
has_no_cabin	-0.316912	0.725541	-0.249732	0.040460	-0.036987	-0.482075	1.000000	0.144111
has_no_age	-0.092197	0.172933	NaN	0.018958	-0.124104	-0.100707	0.144111	1.000000

In [14]:

sns.heatmap(data.corr(), annot=True)

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3f4e310588>

survived is negatively correlated with Pclass $\to$ being on first class means higher survival
has no cabin means more chance to not survive
more expensive the ticket, the better the survival rate (kinda same at the two others)

Most important is the Pclass and having a Cabin.

5. Create two plots which you think are meaningful. Interpret both of them. (Eg.: older people buy more expensive ticket? people buying more expensive ticket survive more? etc.)¶

In [15]:

sns.factorplot('Sex', data=data[data.Survived == 0], kind='count')
plt.title('Not survived')
sns.factorplot('Sex', data=data[data.Survived == 1], kind='count')
plt.title('Survived')
plt.show()

/home/pataki/.conda/envs/fastai/lib/python3.6/site-packages/seaborn/categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)

In [16]:

data.head()

Out[16]:

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked	has_no_cabin	has_no_age
0	0	3	male	22.0	1	0	7.2500	NaN	S	1	0
1	1	1	female	38.0	1	0	71.2833	C85	C	0	0
2	1	3	female	26.0	0	0	7.9250	NaN	S	1	0
3	1	1	female	35.0	1	0	53.1000	C123	S	0	0
4	0	3	male	35.0	0	0	8.0500	NaN	S	1	0

In [17]:

sns.heatmap(data.groupby(['Sex'])[['Parch', 'SibSp']].mean(), annot=True)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3f4e307f28>

In [18]:

Counter(data.Sex)

Out[18]:

Counter({'male': 577, 'female': 314})

It seems that males often traveled alone!

In [19]:

Counter(data[(data.SibSp == 0) & (data.Parch == 0)].Sex)

Out[19]:

Counter({'female': 126, 'male': 411})