In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

How to create, save & read pandas dataframes?

In [2]:
lin = np.linspace(0, 1, 30)
R = 5
df = pd.DataFrame({'x': R * np.sin(lin*2*np.pi),
                   'y': R * np.cos(lin*2*np.pi),
                   'fi': lin*2*np.pi}).round(2)


df = df.replace(0, np.nan)
df = df.replace(3.14, np.nan)

df.head()
Out[2]:
x y fi
0 NaN 5.00 NaN
1 1.07 4.88 0.22
2 2.10 4.54 0.43
3 3.03 3.98 0.65
4 3.81 3.24 0.87
In [3]:
df.to_csv('dummy.csv', index=False)
df2 = pd.read_csv('dummy.csv')
df2.head()
Out[3]:
x y fi
0 NaN 5.00 NaN
1 1.07 4.88 0.22
2 2.10 4.54 0.43
3 3.03 3.98 0.65
4 3.81 3.24 0.87

Missing data?

Often missing data is marked as -999 or some ad-hoc value that the person who made the data came up with!

In [4]:
df.isna()
Out[4]:
x y fi
0 True False True
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False False
7 False False False
8 False False False
9 False False False
10 False False False
11 False False False
12 False False False
13 False False False
14 False False False
15 False False False
16 False False False
17 False False False
18 False False False
19 False False False
20 False False False
21 False False False
22 False False False
23 False False False
24 False False False
25 False False False
26 False False False
27 False False False
28 False False False
29 True False False
In [5]:
df.isna().sum()
Out[5]:
x     2
y     0
fi    1
dtype: int64
In [6]:
plt.imshow(df.isna())
Out[6]:
<matplotlib.image.AxesImage at 0x7fc91ff9e4e0>
In [7]:
plt.figure(figsize=(25, 6))
plt.imshow(df.isna().T)
Out[7]:
<matplotlib.image.AxesImage at 0x7fc91ff276a0>
In [8]:
df.mean()
Out[8]:
x    -2.379049e-17
y     1.666667e-01
fi    3.250000e+00
dtype: float64
In [9]:
df = df.fillna(df.mean())
In [10]:
df.describe()
Out[10]:
x y fi
count 3.000000e+01 30.000000 30.000000
mean -2.299748e-17 0.166667 3.250000
std 3.535402e+00 3.651452 1.812628
min -4.990000e+00 -4.970000 0.220000
25% -3.337500e+00 -3.425000 1.785000
50% -2.379049e-17 0.270000 3.250000
75% 3.337500e+00 3.795000 4.715000
max 4.990000e+00 5.000000 6.280000

Matplotlib example plots

In [11]:
plt.figure(figsize=(6, 6))
plt.scatter(df.x, df.y)
plt.xlabel('this is the X axis label', fontsize=20)
plt.ylabel('this is Y', fontsize=40)
plt.title('this is the title', fontsize=22)
Out[11]:
Text(0.5, 1.0, 'this is the title')
In [12]:
plt.hist(df.x, bins=10)
plt.show()

Seaborn

In [13]:
sns.jointplot('x', 'fi', df, kind='kde')
Out[13]:
<seaborn.axisgrid.JointGrid at 0x7fc91fe77048>
In [14]:
sns.scatterplot(x='y', y='fi', data=df)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc91fd1a860>
In [15]:
R, 2*np.pi
Out[15]:
(5, 6.283185307179586)
In [16]:
sns.boxplot(data=df)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc91fd0c358>

Materials:

Fiznum1 - course in physics BSc: http://oroszl.web.elte.hu/fiznum1/

Data Exploration and Visualisation - course in physics MSc: https://github.com/sdam-elte/data-exp-vis-2020

EDA

Kaggle datasets / competitions $\to$ notebooks $\to$ most voted

eg: https://www.kaggle.com/therealcyberlord/coronavirus-covid-19-visualization-prediction