Unsupervised learning & clustering


1. Reading data

The woldbank_development_2015.csv (can be found in the same folder with this notebook) file contains the World Development Indicators for the 2015 year, downloaded from The World Bank's webpage.

  • Look at the data in any text editor. Build up an overall sense how the data is built up and how is the missing values are represented.
  • Read the file into a pandas dataframe and tell pandas which special pattern means if a value is missing.
  • The data is in a long format. Convert it into a wide format, where each row is a single country and the columns are the measured features, the Series Codes (eg the first column is 'EG.CFT.ACCS.ZS', the second is 'EG.ELC.ACCS.ZS'. Order of the columns does not matter)!
  • Keep only those rows, which represents counties and NOT regions. Luckily they are well separated in the order they occur!
  • Convert the features to numeric format, which will be needed for modeling!

2. Data preprocessing and inspection

  • Visualize the missing values!
  • Keep only those countries which has less than 700 missing features in the original table and keep features which were missing in less than 20 country in the original table
  • Visualize the missing values again! Now really only a few entries are missing. Impute the missing values with its feature's mean value!
  • How many counties and features do we have left?
  • Read the kept features' descriptions. In the original table the Series Name describe the meaning of the features. What do you think, based only on these information, which counties are the most similar to Hungary? And Norway?

3. PCA

  • Perform PCA with 3 principal components on the filtered, imputed data (from now on, data refers to the filtered, imputed dataset)
  • Plot the three embedded 2D combination next to each other (0 vs 1, 0 vs 2 and 1 vs 2)
  • It seems that the embedding is really dominated by a single direction. Normalize the data (each feature should have zero mean and unit variance after normalization) and re-do the PCA and the plotting (do not delete the previous plots, just make new ones).

4. T-SNE

  • Perform T-SNE on the scaled data with 2 components
  • Plot the embeddings results. Add a text label for each point to make it possible to interpret the results. It will not be possible to read all, but try to make it useful, see the attached image as an example!
  • Highlight Hungary and Norway! Which countries are the closest one to Hungary and Norway?

5. Hierarchical clustering

  • Perform hierarchical clustering on the filtered and scaled data (hint: use seaborn)
  • Try to plot in a way that all country's name is visible
  • Write down your impressions that you got from this plot!

Hints:

  • On total you can get 10 points for fully completing all tasks.
  • Decorate your notebook with, questions, explanation etc, make it self contained and understandable!
  • Comments you code when necessary
  • Write functions for repetitive tasks!
  • Use the pandas package for data loading and handling
  • Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation
  • Use the scikit learn package for almost everything
  • Use for loops only if it is really necessary!
  • Code sharing is not allowed between student! Sharing code will result in zero points.
  • If you use code found on web, it is OK, but, make its source clear!

Solution

1. Reading data

The woldbank_development_2015.csv (can be found in the same folder with this notebook) file contains the World Development Indicators for the 2015 year, downloaded from The World Bank's webpage.

  • Look at the data in any text editor. Build up an overall sense how the data is built up and how is the missing values are represented.
  • Read the file into a pandas dataframe and tell pandas which special pattern means if a value is missing.
  • The data is in a long format. Convert it into a wide format, where each row is a single country and the columns are the measured features, the Series Codes (eg the first column is 'EG.CFT.ACCS.ZS', the second is 'EG.ELC.ACCS.ZS'. Order of the columns does not matter)!

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

%matplotlib inline
In [2]:
long = pd.read_csv('woldbank_development_2015.csv', na_values=['..', '...'])
long.head()
Out[2]:
Country Name Country Code Series Name Series Code 2015 [YR2015]
0 Afghanistan AFG Access to clean fuels and technologies for coo... EG.CFT.ACCS.ZS 30.100000
1 Afghanistan AFG Access to electricity (% of population) EG.ELC.ACCS.ZS 71.500000
2 Afghanistan AFG Access to electricity, rural (% of rural popul... EG.ELC.ACCS.RU.ZS 64.573354
3 Afghanistan AFG Access to electricity, urban (% of urban popul... EG.ELC.ACCS.UR.ZS 92.500000
4 Afghanistan AFG Account ownership at a financial institution o... FX.OWN.TOTL.ZS NaN
In [3]:
pd.unique(long['Country Name'])[210:225]
Out[3]:
array(['Venezuela, RB', 'Vietnam', 'Virgin Islands (U.S.)',
       'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe',
       'Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area'],
      dtype=object)
In [4]:
%%time
series_codes = pd.unique(long['Series Code'].dropna())

wide = pd.DataFrame()
# had to manually split when countries stops and regions starts
for c in pd.unique(long['Country Name'])[:217]:
    tmp = long[long['Country Name'] == c][['Series Code', '2015 [YR2015]']].set_index('Series Code').T
    # tmp is here:
    ## index: 2015 [YR2015]
    ## columns: Series Code
    tmp['country'] = c
    
    wide = wide.append(tmp[['country'] + list(series_codes)])
    
wide = wide.set_index('country').astype(float)
CPU times: user 3.69 s, sys: 32 ms, total: 3.72 s
Wall time: 3.73 s
In [5]:
wide.head()
Out[5]:
Series Code EG.CFT.ACCS.ZS EG.ELC.ACCS.ZS EG.ELC.ACCS.RU.ZS EG.ELC.ACCS.UR.ZS FX.OWN.TOTL.ZS FX.OWN.TOTL.FE.ZS FX.OWN.TOTL.MA.ZS FX.OWN.TOTL.OL.ZS FX.OWN.TOTL.40.ZS FX.OWN.TOTL.PL.ZS ... SG.DMK.ALLD.FN.ZS SG.VAW.REAS.ZS SG.VAW.ARGU.ZS SG.VAW.BURN.ZS SG.VAW.GOES.ZS SG.VAW.NEGL.ZS SG.VAW.REFU.ZS SP.M15.2024.FE.ZS SP.M18.2024.FE.ZS SH.DYN.AIDS.FE.ZS
country
Afghanistan 30.10 71.500000 64.573354 92.5 NaN NaN NaN NaN NaN NaN ... 32.6 80.2 59.2 18.2 66.9 48.4 33.4 8.8 34.8 28.6
Albania 75.37 100.000000 100.000000 100.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 23.5
Algeria 92.70 99.931366 99.764565 100.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 45.3
American Samoa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Andorra 100.00 100.000000 100.000000 100.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1437 columns

In [6]:
# keep only the countries first, then pivot it
long2 = long[long['Country Name'].isin(pd.unique(long['Country Name'])[:217])]
pivoted = long2.pivot(index='Country Name', columns='Series Code', values='2015 [YR2015]')
pivoted[series_codes].head()
Out[6]:
Series Code EG.CFT.ACCS.ZS EG.ELC.ACCS.ZS EG.ELC.ACCS.RU.ZS EG.ELC.ACCS.UR.ZS FX.OWN.TOTL.ZS FX.OWN.TOTL.FE.ZS FX.OWN.TOTL.MA.ZS FX.OWN.TOTL.OL.ZS FX.OWN.TOTL.40.ZS FX.OWN.TOTL.PL.ZS ... SG.DMK.ALLD.FN.ZS SG.VAW.REAS.ZS SG.VAW.ARGU.ZS SG.VAW.BURN.ZS SG.VAW.GOES.ZS SG.VAW.NEGL.ZS SG.VAW.REFU.ZS SP.M15.2024.FE.ZS SP.M18.2024.FE.ZS SH.DYN.AIDS.FE.ZS
Country Name
Afghanistan 30.10 71.500000 64.573354 92.5 NaN NaN NaN NaN NaN NaN ... 32.6 80.2 59.2 18.2 66.9 48.4 33.4 8.8 34.8 28.6
Albania 75.37 100.000000 100.000000 100.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 23.5
Algeria 92.70 99.931366 99.764565 100.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 45.3
American Samoa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Andorra 100.00 100.000000 100.000000 100.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1437 columns

In [7]:
np.allclose(pivoted[series_codes].fillna(0), wide.fillna(0))
# in general we do not want to compare floats with == too much
Out[7]:
True
In [8]:
np.allclose??

2. Data preprocessing and inspection

  • Visualize the missing values!
  • Keep only those countries which has less than 700 missing features in the original table and keep features which were missing in less than 20 country in the original table
  • Visualize the missing values again! Now really only a few entries are missing. Impute the missing values with its feature's mean value!
  • How many counties and features do we have left?
  • Read the kept features' descriptions. In the original table the Series Name describe the meaning of the features. What do you think, based only on these information, which counties are the most similar to Hungary? And Norway?

In [10]:
wide.isna()
Out[10]:
Series Code EG.CFT.ACCS.ZS EG.ELC.ACCS.ZS EG.ELC.ACCS.RU.ZS EG.ELC.ACCS.UR.ZS FX.OWN.TOTL.ZS FX.OWN.TOTL.FE.ZS FX.OWN.TOTL.MA.ZS FX.OWN.TOTL.OL.ZS FX.OWN.TOTL.40.ZS FX.OWN.TOTL.PL.ZS ... SG.DMK.ALLD.FN.ZS SG.VAW.REAS.ZS SG.VAW.ARGU.ZS SG.VAW.BURN.ZS SG.VAW.GOES.ZS SG.VAW.NEGL.ZS SG.VAW.REFU.ZS SP.M15.2024.FE.ZS SP.M18.2024.FE.ZS SH.DYN.AIDS.FE.ZS
country
Afghanistan False False False False True True True True True True ... False False False False False False False False False False
Albania False False False False True True True True True True ... True True True True True True True True True False
Algeria False False False False True True True True True True ... True True True True True True True True True False
American Samoa True True True True True True True True True True ... True True True True True True True True True True
Andorra False False False False True True True True True True ... True True True True True True True True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Virgin Islands (U.S.) True False False False True True True True True True ... True True True True True True True True True True
West Bank and Gaza True False False False True True True True True True ... True True True True True True True True True True
Yemen, Rep. False False False False True True True True True True ... True True True True True True True True True False
Zambia False False False False True True True True True True ... True True True True True True True True True False
Zimbabwe False False False False True True True True True True ... False False False False False False False False False False

217 rows × 1437 columns

In [9]:
plt.imshow(wide.isna())
Out[9]:
<matplotlib.image.AxesImage at 0x7f488e69be48>
In [11]:
keep_cols = wide.isna().sum()[wide.isna().sum() < 20].index.values
keep_rows = wide.isna().sum(1)[wide.isna().sum(1) < 700].index.values

filtered_wide = wide[wide.index.isin(keep_rows)][keep_cols]
filtered_wide.shape
Out[11]:
(163, 116)
In [12]:
plt.imshow(filtered_wide.isna())
Out[12]:
<matplotlib.image.AxesImage at 0x7f488d3ef9b0>
In [13]:
filtered_wide = filtered_wide.fillna(filtered_wide.mean())
In [14]:
long[long['Series Code'].isin(keep_cols)]['Series Name'].drop_duplicates().tolist()
Out[14]:
['Access to electricity (% of population)',
 'Access to electricity, rural (% of rural population)',
 'Access to electricity, urban (% of urban population)',
 'Adjusted savings: carbon dioxide damage (current US$)',
 'Adjusted savings: education expenditure (% of GNI)',
 'Adjusted savings: energy depletion (current US$)',
 'Adjusted savings: mineral depletion (current US$)',
 'Agricultural land (% of land area)',
 'Agricultural land (sq. km)',
 'Arable land (% of land area)',
 'Arable land (hectares per person)',
 'Arable land (hectares)',
 'Birth rate, crude (per 1,000 people)',
 'Capture fisheries production (metric tons)',
 'CO2 emissions (kt)',
 'CO2 emissions (metric tons per capita)',
 'CO2 emissions from gaseous fuel consumption (% of total)',
 'CO2 emissions from gaseous fuel consumption (kt)',
 'CO2 emissions from liquid fuel consumption (% of total)',
 'CO2 emissions from liquid fuel consumption (kt)',
 'CO2 emissions from solid fuel consumption (% of total)',
 'CO2 emissions from solid fuel consumption (kt)',
 'Crop production index (2004-2006 = 100)',
 'Death rate, crude (per 1,000 people)',
 'DEC alternative conversion factor (LCU per US$)',
 'Export unit value index (2000 = 100)',
 'Export value index (2000 = 100)',
 'Export volume index (2000 = 100)',
 'Fertility rate, total (births per woman)',
 'Fixed broadband subscriptions',
 'Fixed broadband subscriptions (per 100 people)',
 'Fixed telephone subscriptions',
 'Fixed telephone subscriptions (per 100 people)',
 'Food production index (2004-2006 = 100)',
 'Foreign direct investment, net inflows (BoP, current US$)',
 'Forest area (% of land area)',
 'Forest area (sq. km)',
 'Forest rents (% of GDP)',
 'GDP (constant 2010 US$)',
 'GDP (constant LCU)',
 'GDP (current LCU)',
 'GDP (current US$)',
 'GDP deflator (base year varies by country)',
 'GDP deflator: linked series (base year varies by country)',
 'GDP growth (annual %)',
 'GDP per capita (constant 2010 US$)',
 'GDP per capita (constant LCU)',
 'GDP per capita (current LCU)',
 'GDP per capita (current US$)',
 'GDP per capita growth (annual %)',
 'GDP: linked series (current LCU)',
 'Import unit value index (2000 = 100)',
 'Import value index (2000 = 100)',
 'Import volume index (2000 = 100)',
 'Incidence of tuberculosis (per 100,000 people)',
 'Individuals using the Internet (% of population)',
 'Inflation, GDP deflator (annual %)',
 'Inflation, GDP deflator: linked series (annual %)',
 'International migrant stock (% of population)',
 'International migrant stock, total',
 'Land area (sq. km)',
 'Life expectancy at birth, female (years)',
 'Life expectancy at birth, male (years)',
 'Life expectancy at birth, total (years)',
 'Livestock production index (2004-2006 = 100)',
 'Lower secondary school starting age (years)',
 'Merchandise exports (current US$)',
 'Merchandise exports by the reporting economy (current US$)',
 'Merchandise exports by the reporting economy, residual (% of total merchandise exports)',
 'Merchandise exports to economies in the Arab World (% of total merchandise exports)',
 'Merchandise exports to high-income economies (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in East Asia & Pacific (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in Europe & Central Asia (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in Latin America & the Caribbean (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in South Asia (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies in Sub-Saharan Africa (% of total merchandise exports)',
 'Merchandise exports to low- and middle-income economies outside region (% of total merchandise exports)',
 'Merchandise imports (current US$)',
 'Merchandise imports by the reporting economy (current US$)',
 'Merchandise imports by the reporting economy, residual (% of total merchandise imports)',
 'Merchandise imports from economies in the Arab World (% of total merchandise imports)',
 'Merchandise imports from high-income economies (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in East Asia & Pacific (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in Europe & Central Asia (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in Latin America & the Caribbean (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in South Asia (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)',
 'Merchandise imports from low- and middle-income economies outside region (% of total merchandise imports)',
 'Mineral rents (% of GDP)',
 'Mobile cellular subscriptions',
 'Mobile cellular subscriptions (per 100 people)',
 'Net barter terms of trade index (2000 = 100)',
 'People practicing open defecation (% of population)',
 'People using at least basic drinking water services (% of population)',
 'People using at least basic sanitation services (% of population)',
 'Population density (people per sq. km of land area)',
 'Population growth (annual %)',
 'Population, total',
 'Preprimary education, duration (years)',
 'Primary education, duration (years)',
 'Primary school starting age (years)',
 'Renewable electricity output (% of total electricity output)',
 'Renewable energy consumption (% of total final energy consumption)',
 'Rural population',
 'Rural population (% of total population)',
 'Rural population growth (annual %)',
 'Secondary education, duration (years)',
 'Secure Internet servers',
 'Secure Internet servers (per 1 million people)',
 'Surface area (sq. km)',
 'Total fisheries production (metric tons)',
 'Total natural resources rents (% of GDP)',
 'Tuberculosis case detection rate (%, all forms)',
 'Urban population',
 'Urban population (% of total population)',
 'Urban population growth (annual %)']

3. PCA

  • Perform PCA with 3 principal components on the filtered, imputed data (from now on, data refers to the filtered, imputed dataset)
  • Plot the three embedded 2D combination next to each other (0 vs 1, 0 vs 2 and 1 vs 2)
  • It seems that the embedding is really dominated by a single direction. Normalize the data (each feature should have zero mean and unit variance after normalization) and re-do the PCA and the plotting (do not delete the previous plots, just make new ones).

In [15]:
pca = PCA(3)
pca_transformed = pca.fit_transform(filtered_wide)
In [16]:
plt.figure(figsize=(17, 6))
plt.subplot(131)
plt.scatter(pca_transformed[:,0], pca_transformed[:,1])

plt.subplot(132)
plt.scatter(pca_transformed[:,0], pca_transformed[:,2])

plt.subplot(133)
plt.scatter(pca_transformed[:,1], pca_transformed[:,2])
Out[16]:
<matplotlib.collections.PathCollection at 0x7f488c2f5ef0>
In [17]:
filtered_scaled_wide = (filtered_wide - filtered_wide.mean())/filtered_wide.std()
In [18]:
pca = PCA(3)
pca_transformed = pca.fit_transform(filtered_scaled_wide)
In [19]:
plt.figure(figsize=(17, 6))
plt.subplot(131)
plt.scatter(pca_transformed[:,0], pca_transformed[:,1])

plt.subplot(132)
plt.scatter(pca_transformed[:,0], pca_transformed[:,2])

plt.subplot(133)
plt.scatter(pca_transformed[:,1], pca_transformed[:,2])
Out[19]:
<matplotlib.collections.PathCollection at 0x7f488c1816a0>

4. T-SNE

  • Perform T-SNE on the scaled data with 2 components
  • Plot the embeddings results. Add a text label for each point to make it possible to interpret the results. It will not be possible to read all, but try to make it useful, see the attached image as an example!
  • Highlight Hungary and Norway! Which countries are the closest one to Hungary and Norway?

In [33]:
tsne = TSNE()
ts_transformed = tsne.fit_transform(filtered_scaled_wide)
plt.scatter(ts_transformed[:,0], ts_transformed[:,1])
Out[33]:
<matplotlib.collections.PathCollection at 0x7f487073ccf8>
In [21]:
countries = filtered_scaled_wide.index.tolist()
In [22]:
plt.figure(figsize=(16, 12))
plt.scatter(ts_transformed[:,0], ts_transformed[:,1])
for idx, c in enumerate(countries):
    if c in ['Hungary', 'Norway']:
        plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=15)
        plt.scatter([ts_transformed[idx,0]], [ts_transformed[idx,1]], c='r', s=150)
    plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=8)
In [23]:
plt.figure(figsize=(30, 20))
plt.scatter(ts_transformed[:,0], ts_transformed[:,1])
for idx, c in enumerate(countries):
    if c in ['Hungary', 'Norway']:
        plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=15)
        plt.scatter([ts_transformed[idx,0]], [ts_transformed[idx,1]], c='r', s=150)
    plt.text(ts_transformed[idx, 0], ts_transformed[idx, 1], c, fontsize=8)

5. Hierarchical clustering

  • Perform hierarchical clustering on the filtered and scaled data (hint: use seaborn)
  • Try to plot in a way that all country's name is visible
  • Write down your impressions that you got from this plot!

In [34]:
sns.clustermap(filtered_scaled_wide, col_cluster=False, figsize=(10, 45))
/home/pataki/.conda/envs/fastai/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[34]:
<seaborn.matrix.ClusterGrid at 0x7f48703a4f60>