Exploratory data analysis

http://patbaa.web.elte.hu/physdm/data/titanic.csv

On the link above you will find a dataset about the Titanic passengers. Your task is to explore the dataset.

Help for the columns:

  • SibSp - number of sibling/spouses on the ship
  • Parch - number of parent/children on the ship
  • Cabin - the cabin they slept in (if they had a cabin)
  • Embarked - harbour of entering the ship
  • Pclass - passenger class (like on trains)

1. Load the above-linked csv file as a pandas dataframe. Check & plot if any of the columns has missing values. If they have, investigate if the missingness is random or not.

Impute the missing values in a sensible way:

  • if only a very small percentage is missing, imputing with the column-wise mean makes sense, or also removing the missing rows makes sense
  • if in a row almost all the entries is missing, it worth to remove that given row
  • if a larger portion is missing from a column, usually it worth to encode that with a value that does not appear in the dataset (eg: -1).

The imputing method affects different machine learning models different way, but now we are interested only in EDA, so try to keep as much information as possible!

2. Create a heatmap which shows how many people survived and dies with the different Pclass variables. You need to create a table where the columns indicates if a person survived or not, the rows indicates the different Pclass and the cell values contains the number of people belonging the that given category. The table should be colored based on the value of the cells in the table.

3. Create boxplots for each different Pclass. The boxplot should show the age distribution for the given Pclass. Plot all of these next to each other in a row to make it easier to compare!

4. Calculate the correlation matrix for the numerical columns. Show it also as a heatmap described at the 1st task.

Which feature seems to play the most important role in surviving/not surviving? Explain how and why could that feature be important!

5. Create two plots which you think are meaningful. Interpret both of them. (Eg.: older people buy more expensive ticket? people buying more expensive ticket survive more? etc.)

Hints:

  • On total you can get 10 points for fully completing all tasks.
  • Decorate your notebook with, questions, explanation etc, make it self contained and understandable!
  • Comments you code when necessary
  • Write functions for repetitive tasks!
  • Use the pandas package for data loading and handling
  • Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation
  • Use the scikit learn package for almost everything
  • Use for loops only if it is really necessary!
  • Code sharing is not allowed between student! Sharing code will result in zero points.
  • If you use code found on web, it is OK, but, make its source clear!

Solution

Note:

  • there are many different way to get the same results
  • here we show one way. this is not neccessarily the most elegant or the fastest way

1. Load the above-linked csv file as a pandas dataframe. Check & plot if any of the columns has missing values. If they have, investigate if the missingness is random or not.

Impute the missing values in a sensible way:

  • if only a very small percentage is missing, imputing with the column-wise mean makes sense, or also removing the missing rows makes sense
  • if in a row almost all the entries is missing, it worth to remove that given row
  • if a larger portion is missing from a column, usually it worth to encode that with a value that does not appear in the dataset (eg: -1).

The imputing method affects different machine learning models different way, but now we are interested only in EDA, so try to keep as much information as possible!

In [1]:
import pandas as pd
import seaborn as sns
from collections import Counter

%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
data = pd.read_csv('titanic.csv')
data.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [3]:
data = pd.read_csv('titanic.csv')
data.pop('PassengerId');
data.pop('Name');
data.pop('Ticket');
In [4]:
data.isna().sum()
Out[4]:
Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64
In [5]:
plt.figure(figsize=(16, 4))
plt.imshow(data.isna().T)
Out[5]:
<matplotlib.image.AxesImage at 0x7fa78900e828>
In [5]:
data.head()
Out[5]:
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 0 3 male 22.0 1 0 7.2500 NaN S
1 1 1 female 38.0 1 0 71.2833 C85 C
2 1 3 female 26.0 0 0 7.9250 NaN S
3 1 1 female 35.0 1 0 53.1000 C123 S
4 0 3 male 35.0 0 0 8.0500 NaN S
In [6]:
Counter(data.Embarked)
Out[6]:
Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})
In [7]:
data.Embarked.fillna('S', inplace=True)
data['has_no_cabin'] = data.Cabin.isna().astype(int)
data['has_no_age'] = data.Age.isna().astype(int)

2. Create a heatmap which shows how many people survived and dies with the different Pclass variables. You need to create a table where the columns indicates if a person survived or not, the rows indicates the different Pclass and the cell values contains the number of people belonging the that given category. The table should be colored based on the value of the cells in the table.

In [8]:
grouped = data.groupby(['Survived', 'Pclass'])[['Sex']].count().reset_index()
grouped
Out[8]:
Survived Pclass Sex
0 0 1 80
1 0 2 97
2 0 3 372
3 1 1 136
4 1 2 87
5 1 3 119
In [9]:
pivoted = grouped.pivot(index='Pclass', columns='Survived', values='Sex')
pivoted
Out[9]:
Survived 0 1
Pclass
1 80 136
2 97 87
3 372 119
In [11]:
sns.heatmap(pivoted, annot=True, fmt='g')
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa784d4c550>
In [12]:
plt.pcolor(pivoted)
plt.colorbar()

plt.xticks([0.5, 1.5], ['not survived', 'survived'])
plt.yticks([0.5, 1.5, 2.5], ['Pclass = 1', 'Pclass = 2', 'Pclass = 3'])
plt.show()

3. Create boxplots for each different Pclass. The boxplot should show the age distribution for the given Pclass. Plot all of these next to each other in a row to make it easier to compare!


In [12]:
sns.boxplot(x='Pclass', y='Age', data = data)
# older people --> more money --> higher class
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f4e3eaf98>

4. Calculate the correlation matrix for the numerical columns. Show it also as a heatmap described at the 1st task.

Which feature seems to play the most important role in surviving/not surviving? Explain how and why could that feature be important!

In [13]:
data.corr()
Out[13]:
Survived Pclass Age SibSp Parch Fare has_no_cabin has_no_age
Survived 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307 -0.316912 -0.092197
Pclass -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500 0.725541 0.172933
Age -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067 -0.249732 NaN
SibSp -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651 0.040460 0.018958
Parch 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225 -0.036987 -0.124104
Fare 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000 -0.482075 -0.100707
has_no_cabin -0.316912 0.725541 -0.249732 0.040460 -0.036987 -0.482075 1.000000 0.144111
has_no_age -0.092197 0.172933 NaN 0.018958 -0.124104 -0.100707 0.144111 1.000000
In [14]:
sns.heatmap(data.corr(), annot=True)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f4e310588>
  • survived is negatively correlated with Pclass $\to$ being on first class means higher survival
  • has no cabin means more chance to not survive
  • more expensive the ticket, the better the survival rate (kinda same at the two others)

    Most important is the Pclass and having a Cabin.

5. Create two plots which you think are meaningful. Interpret both of them. (Eg.: older people buy more expensive ticket? people buying more expensive ticket survive more? etc.)


In [15]:
sns.factorplot('Sex', data=data[data.Survived == 0], kind='count')
plt.title('Not survived')
sns.factorplot('Sex', data=data[data.Survived == 1], kind='count')
plt.title('Survived')
plt.show()
/home/pataki/.conda/envs/fastai/lib/python3.6/site-packages/seaborn/categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
In [16]:
data.head()
Out[16]:
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked has_no_cabin has_no_age
0 0 3 male 22.0 1 0 7.2500 NaN S 1 0
1 1 1 female 38.0 1 0 71.2833 C85 C 0 0
2 1 3 female 26.0 0 0 7.9250 NaN S 1 0
3 1 1 female 35.0 1 0 53.1000 C123 S 0 0
4 0 3 male 35.0 0 0 8.0500 NaN S 1 0
In [17]:
sns.heatmap(data.groupby(['Sex'])[['Parch', 'SibSp']].mean(), annot=True)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3f4e307f28>
In [18]:
Counter(data.Sex)
Out[18]:
Counter({'male': 577, 'female': 314})

It seems that males often traveled alone!

In [19]:
Counter(data[(data.SibSp == 0) & (data.Parch == 0)].Sex)
Out[19]:
Counter({'female': 126, 'male': 411})