Unsupervised learning & clustering¶

1. Reading data¶

The woldbank_development_2015.csv (can be found in the same folder with this notebook) file contains the World Development Indicators for the 2015 year, downloaded from The World Bank's webpage.

Look at the data in any text editor. Build up an overall sense how the data is built up and how is the missing values are represented.
Read the file into a pandas dataframe and tell pandas which special pattern means if a value is missing.
The data is in a long format. Convert it into a wide format, where each row is a single country and the columns are the measured features, the Series Codes (eg the first column is 'EG.CFT.ACCS.ZS', the second is 'EG.ELC.ACCS.ZS'. Order of the columns does not matter)!
Keep only those rows, which represents counties and NOT regions. Luckily they are well separated in the order they occur!
Convert the features to numeric format, which will be needed for modeling!

2. Data preprocessing and inspection¶

Visualize the missing values!
Keep only those countries which has less than 700 missing features in the original table and keep features which were missing in less than 20 country in the original table
Visualize the missing values again! Now really only a few entries are missing. Impute the missing values with its feature's mean value!
How many counties and features do we have left?
Read the kept features' descriptions. In the original table the Series Name describe the meaning of the features. What do you think, based only on these information, which counties are the most similar to Hungary? And Norway?

3. PCA¶

Perform PCA with 3 principal components on the filtered, imputed data (from now on, data refers to the filtered, imputed dataset)
Plot the three embedded 2D combination next to each other (0 vs 1, 0 vs 2 and 1 vs 2)
It seems that the embedding is really dominated by a single direction. Normalize the data (each feature should have zero mean and unit variance after normalization) and re-do the PCA and the plotting (do not delete the previous plots, just make new ones).

4. T-SNE¶

Perform T-SNE on the scaled data with 2 components
Plot the embeddings results. Add a text label for each point to make it possible to interpret the results. It will not be possible to read all, but try to make it useful, see the attached image as an example!
Highlight Hungary and Norway! Which countries are the closest one to Hungary and Norway?

5. Hierarchical clustering¶

Perform hierarchical clustering on the filtered and scaled data (hint: use seaborn)
Try to plot in a way that all country's name is visible
Write down your impressions that you got from this plot!

Hints:¶

On total you can get 10 points for fully completing all tasks.
Decorate your notebook with, questions, explanation etc, make it self contained and understandable!
Comments you code when necessary
Write functions for repetitive tasks!
Use the pandas package for data loading and handling
Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation
Use the scikit learn package for almost everything
Use for loops only if it is really necessary!
Code sharing is not allowed between student! Sharing code will result in zero points.
If you use code found on web, it is OK, but, make its source clear!

Solution¶

1. Reading data¶

The woldbank_development_2015.csv (can be found in the same folder with this notebook) file contains the World Development Indicators for the 2015 year, downloaded from The World Bank's webpage.

Look at the data in any text editor. Build up an overall sense how the data is built up and how is the missing values are represented.
Read the file into a pandas dataframe and tell pandas which special pattern means if a value is missing.
The data is in a long format. Convert it into a wide format, where each row is a single country and the columns are the measured features, the Series Codes (eg the first column is 'EG.CFT.ACCS.ZS', the second is 'EG.ELC.ACCS.ZS'. Order of the columns does not matter)!

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

%matplotlib inline

long = pd.read_csv('woldbank_development_2015.csv', na_values=['..', '...'])
long.head()

pd.unique(long['Country Name'])[210:225]

array(['Venezuela, RB', 'Vietnam', 'Virgin Islands (U.S.)',
       'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe',
       'Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area'],
      dtype=object)

%%time
series_codes = pd.unique(long['Series Code'].dropna())

wide = pd.DataFrame()
# had to manually split when countries stops and regions starts
for c in pd.unique(long['Country Name'])[:217]:
    tmp = long[long['Country Name'] == c][['Series Code', '2015 [YR2015]']].set_index('Series Code').T
    # tmp is here:
    ## index: 2015 [YR2015]
    ## columns: Series Code
    tmp['country'] = c
    
    wide = wide.append(tmp[['country'] + list(series_codes)])
    
wide = wide.set_index('country').astype(float)

CPU times: user 3.69 s, sys: 32 ms, total: 3.72 s
Wall time: 3.73 s

wide.head()

# keep only the countries first, then pivot it
long2 = long[long['Country Name'].isin(pd.unique(long['Country Name'])[:217])]
pivoted = long2.pivot(index='Country Name', columns='Series Code', values='2015 [YR2015]')
pivoted[series_codes].head()

np.allclose(pivoted[series_codes].fillna(0), wide.fillna(0))
# in general we do not want to compare floats with == too much

True

np.allclose??

2. Data preprocessing and inspection¶

Visualize the missing values!
Keep only those countries which has less than 700 missing features in the original table and keep features which were missing in less than 20 country in the original table
Visualize the missing values again! Now really only a few entries are missing. Impute the missing values with its feature's mean value!
How many counties and features do we have left?
Read the kept features' descriptions. In the original table the Series Name describe the meaning of the features. What do you think, based only on these information, which counties are the most similar to Hungary? And Norway?

wide.isna()

plt.imshow(wide.isna())

<matplotlib.image.AxesImage at 0x7f488e69be48>

keep_cols = wide.isna().sum()[wide.isna().sum() < 20].index.values
keep_rows = wide.isna().sum(1)[wide.isna().sum(1) < 700].index.values

filtered_wide = wide[wide.index.isin(keep_rows)][keep_cols]
filtered_wide.shape

(163, 116)

plt.imshow(filtered_wide.isna())

<matplotlib.image.AxesImage at 0x7f488d3ef9b0>

filtered_wide = filtered_wide.fillna(filtered_wide.mean())

long[long['Series Code'].isin(keep_cols)]['Series Name'].drop_duplicates().tolist()

['Access to electricity (% of population)',
 'Access to electricity, rural (% of rural population)',
 'Access to electricity, urban (% of urban population)',
 'Adjusted savings: carbon dioxide damage (current US$)',
 'Adjusted savings: education expenditure (% of GNI)',
 'Adjusted savings: energy depletion (current US$)',
 'Adjusted savings: mineral depletion (current US$)',
 'Agricultural land (% of land area)',
 'Agricultural land (sq. km)',
 'Arable land (% of land area)',
 'Arable land (hectares per person)',
 'Arable land (hectares)',
 'Birth rate, crude (per 1,000 people)',
 'Capture fisheries production (metric tons)',
 'CO2 emissions (kt)',
 'CO2 emissions (metric tons per capita)',
 'CO2 emissions from gaseous fuel consumption (% of total)',
 'CO2 emissions from gaseous fuel consumption (kt)',
 'CO2 emissions from liquid fuel consumption (% of total)',
 'CO2 emissions from liquid fuel consumption (kt)',
 'CO2 emissions from solid fuel consumption (% of total)',
 'CO2 emissions from solid fuel consumption (kt)',
 'Crop production index (2004-2006 = 100)',
 'Death rate, crude (per 1,000 people)',
 'DEC alternative conversion factor (LCU per US$)',
 'Export unit value index (2000 = 100)',
 'Export value index (2000 = 100)',
 'Export volume index (2000 = 100)',
 'Fertility rate, total (births per woman)',
 'Fixed broadband subscriptions',
 'Fixed broadband subscriptions (per 100 people)',
 'Fixed telephone subscriptions',
 'Fixed telephone subscriptions (per 100 people)',
 'Food production index (2004-2006 = 100)',
 'Foreign direct investment, net inflows (BoP, current US$)',
 'Forest area (% of land area)',
 'Forest area (sq. km)',
 'Forest rents (% of GDP)',
 'GDP (constant 2010 US$)',
 'GDP (constant LCU)',
 'GDP (current LCU)',
 'GDP (current US$)',
 'GDP deflator (base year varies by country)',
 'GDP deflator: linked series (base year varies by country)',
 'GDP growth (annual %)',
 'GDP per capita (constant 2010 US$)',
 'GDP per capita (constant LCU)',
 'GDP per capita (current LCU)',
 'GDP per capita (current US$)',
 'GDP per capita growth (annual %)',
 'GDP: linked series (current LCU)',
 'Import unit value index (2000 = 100)',
 'Import value index (2000 = 100)',
 'Import volume index (2000 = 100)',
 'Incidence of tuberculosis (per 100,000 people)',
 'Individuals using the Internet (% of population)',
 'Inflation, GDP deflator (annual %)',
 'Inflation, GDP deflator: linked series (annual %)',
 'International migrant stock (% of population)',
 'International migrant stock, total',
 'Land area (sq. km)',
 'Life expectancy at birth, female (years)',
 'Life expectancy at birth, male (years)',
 'Life expectancy at birth, total (years)',
 'Livestock production index (2004-2006 = 100)',
 'Lower secondary school starting age (years)',
 'Merchandise exports (current US$)',
 'Merchandise exports by the reporting economy (current US$)',
 'Merchandise exports by the reporting economy, residual (% of total merchandise exports)',
 'Merchandise exports to economies in the Arab World (% of total merchandise exports)',
 'Merchandise exports to high-income economies (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in East Asia & Pacific (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in Europe & Central Asia (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in Latin America & the Caribbean (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in South Asia (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in Sub-Saharan Africa (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies outside region (% of total merchandise exports)',
 'Merchandise imports (current US$)',
 'Merchandise imports by the reporting economy (current US$)',
 'Merchandise imports by the reporting economy, residual (% of total merchandise imports)',
 'Merchandise imports from economies in the Arab World (% of total merchandise imports)',
 'Merchandise imports from high-income economies (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in East Asia & Pacific (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in Europe & Central Asia (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in Latin America & the Caribbean (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in South Asia (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies outside region (% of total merchandise imports)',
 'Mineral rents (% of GDP)',
 'Mobile cellular subscriptions',
 'Mobile cellular subscriptions (per 100 people)',
 'Net barter terms of trade index (2000 = 100)',
 'People practicing open defecation (% of population)',
 'People using at least basic drinking water services (% of population)',
 'People using at least basic sanitation services (% of population)',
 'Population density (people per sq. km of land area)',
 'Population growth (annual %)',
 'Population, total',
 'Preprimary education, duration (years)',
 'Primary education, duration (years)',
 'Primary school starting age (years)',
 'Renewable electricity output (% of total electricity output)',
 'Renewable energy consumption (% of total final energy consumption)',
 'Rural population',
 'Rural population (% of total population)',
 'Rural population growth (annual %)',
 'Secondary education, duration (years)',
 'Secure Internet servers',
 'Secure Internet servers (per 1 million people)',
 'Surface area (sq. km)',
 'Total fisheries production (metric tons)',
 'Total natural resources rents (% of GDP)',
 'Tuberculosis case detection rate (%, all forms)',
 'Urban population',
 'Urban population (% of total population)',
 'Urban population growth (annual %)']

3. PCA¶

Perform PCA with 3 principal components on the filtered, imputed data (from now on, data refers to the filtered, imputed dataset)
Plot the three embedded 2D combination next to each other (0 vs 1, 0 vs 2 and 1 vs 2)
It seems that the embedding is really dominated by a single direction. Normalize the data (each feature should have zero mean and unit variance after normalization) and re-do the PCA and the plotting (do not delete the previous plots, just make new ones).

pca = PCA(3)
pca_transformed = pca.fit_transform(filtered_wide)

plt.figure(figsize=(17, 6))
plt.subplot(131)
plt.scatter(pca_transformed[:,0], pca_transformed[:,1])

plt.subplot(132)
plt.scatter(pca_transformed[:,0], pca_transformed[:,2])

plt.subplot(133)
plt.scatter(pca_transformed[:,1], pca_transformed[:,2])

<matplotlib.collections.PathCollection at 0x7f488c2f5ef0>

filtered_scaled_wide = (filtered_wide - filtered_wide.mean())/filtered_wide.std()

pca = PCA(3)
pca_transformed = pca.fit_transform(filtered_scaled_wide)

plt.figure(figsize=(17, 6))
plt.subplot(131)
plt.scatter(pca_transformed[:,0], pca_transformed[:,1])

plt.subplot(132)
plt.scatter(pca_transformed[:,0], pca_transformed[:,2])

plt.subplot(133)
plt.scatter(pca_transformed[:,1], pca_transformed[:,2])

<matplotlib.collections.PathCollection at 0x7f488c1816a0>

4. T-SNE¶

Perform T-SNE on the scaled data with 2 components
Plot the embeddings results. Add a text label for each point to make it possible to interpret the results. It will not be possible to read all, but try to make it useful, see the attached image as an example!
Highlight Hungary and Norway! Which countries are the closest one to Hungary and Norway?

tsne = TSNE()
ts_transformed = tsne.fit_transform(filtered_scaled_wide)
plt.scatter(ts_transformed[:,0], ts_transformed[:,1])

<matplotlib.collections.PathCollection at 0x7f487073ccf8>

countries = filtered_scaled_wide.index.tolist()

plt.figure(figsize=(16, 12))
plt.scatter(ts_transformed[:,0], ts_transformed[:,1])
for idx, c in enumerate(countries):
    if c in ['Hungary', 'Norway']:
        plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=15)
        plt.scatter([ts_transformed[idx,0]], [ts_transformed[idx,1]], c='r', s=150)
    plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=8)

plt.figure(figsize=(30, 20))
plt.scatter(ts_transformed[:,0], ts_transformed[:,1])
for idx, c in enumerate(countries):
    if c in ['Hungary', 'Norway']:
        plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=15)
        plt.scatter([ts_transformed[idx,0]], [ts_transformed[idx,1]], c='r', s=150)
    plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=8)

5. Hierarchical clustering¶

Perform hierarchical clustering on the filtered and scaled data (hint: use seaborn)
Try to plot in a way that all country's name is visible
Write down your impressions that you got from this plot!

sns.clustermap(filtered_scaled_wide, col_cluster=False, figsize=(10, 45))

/home/pataki/.conda/envs/fastai/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)

<seaborn.matrix.ClusterGrid at 0x7f48703a4f60>

	Country Name	Country Code	Series Name	Series Code	2015 [YR2015]
0	Afghanistan	AFG	Access to clean fuels and technologies for coo...	EG.CFT.ACCS.ZS	30.100000
1	Afghanistan	AFG	Access to electricity (% of population)	EG.ELC.ACCS.ZS	71.500000
2	Afghanistan	AFG	Access to electricity, rural (% of rural popul...	EG.ELC.ACCS.RU.ZS	64.573354
3	Afghanistan	AFG	Access to electricity, urban (% of urban popul...	EG.ELC.ACCS.UR.ZS	92.500000
4	Afghanistan	AFG	Account ownership at a financial institution o...	FX.OWN.TOTL.ZS	NaN

Series Code	EG.CFT.ACCS.ZS	EG.ELC.ACCS.ZS	EG.ELC.ACCS.RU.ZS	EG.ELC.ACCS.UR.ZS	FX.OWN.TOTL.ZS	FX.OWN.TOTL.FE.ZS	FX.OWN.TOTL.MA.ZS	FX.OWN.TOTL.OL.ZS	FX.OWN.TOTL.40.ZS	FX.OWN.TOTL.PL.ZS	...	SG.DMK.ALLD.FN.ZS	SG.VAW.REAS.ZS	SG.VAW.ARGU.ZS	SG.VAW.BURN.ZS	SG.VAW.GOES.ZS	SG.VAW.NEGL.ZS	SG.VAW.REFU.ZS	SP.M15.2024.FE.ZS	SP.M18.2024.FE.ZS	SH.DYN.AIDS.FE.ZS
country
Afghanistan	30.10	71.500000	64.573354	92.5	NaN	NaN	NaN	NaN	NaN	NaN	...	32.6	80.2	59.2	18.2	66.9	48.4	33.4	8.8	34.8	28.6
Albania	75.37	100.000000	100.000000	100.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23.5
Algeria	92.70	99.931366	99.764565	100.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	45.3
American Samoa	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Andorra	100.00	100.000000	100.000000	100.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Series Code	EG.CFT.ACCS.ZS	EG.ELC.ACCS.ZS	EG.ELC.ACCS.RU.ZS	EG.ELC.ACCS.UR.ZS	FX.OWN.TOTL.ZS	FX.OWN.TOTL.FE.ZS	FX.OWN.TOTL.MA.ZS	FX.OWN.TOTL.OL.ZS	FX.OWN.TOTL.40.ZS	FX.OWN.TOTL.PL.ZS	...	SG.DMK.ALLD.FN.ZS	SG.VAW.REAS.ZS	SG.VAW.ARGU.ZS	SG.VAW.BURN.ZS	SG.VAW.GOES.ZS	SG.VAW.NEGL.ZS	SG.VAW.REFU.ZS	SP.M15.2024.FE.ZS	SP.M18.2024.FE.ZS	SH.DYN.AIDS.FE.ZS
Country Name
Afghanistan	30.10	71.500000	64.573354	92.5	NaN	NaN	NaN	NaN	NaN	NaN	...	32.6	80.2	59.2	18.2	66.9	48.4	33.4	8.8	34.8	28.6
Albania	75.37	100.000000	100.000000	100.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23.5
Algeria	92.70	99.931366	99.764565	100.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	45.3
American Samoa	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Andorra	100.00	100.000000	100.000000	100.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Series Code	EG.CFT.ACCS.ZS	EG.ELC.ACCS.ZS	EG.ELC.ACCS.RU.ZS	EG.ELC.ACCS.UR.ZS	FX.OWN.TOTL.ZS	FX.OWN.TOTL.FE.ZS	FX.OWN.TOTL.MA.ZS	FX.OWN.TOTL.OL.ZS	FX.OWN.TOTL.40.ZS	FX.OWN.TOTL.PL.ZS	...	SG.DMK.ALLD.FN.ZS	SG.VAW.REAS.ZS	SG.VAW.ARGU.ZS	SG.VAW.BURN.ZS	SG.VAW.GOES.ZS	SG.VAW.NEGL.ZS	SG.VAW.REFU.ZS	SP.M15.2024.FE.ZS	SP.M18.2024.FE.ZS	SH.DYN.AIDS.FE.ZS
country
Afghanistan	False	False	False	False	True	True	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False
Albania	False	False	False	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	False
Algeria	False	False	False	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	False
American Samoa	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
Andorra	False	False	False	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Virgin Islands (U.S.)	True	False	False	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
West Bank and Gaza	True	False	False	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
Yemen, Rep.	False	False	False	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	False
Zambia	False	False	False	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	False
Zimbabwe	False	False	False	False	True	True	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False