pandas#
the basic tool for data manipulation and analysis in Python#
Unless specified differently, these slides are copyright CINECA 2019 and are released under the Attribution–NonCommercial–NoDerivs (CC BY-NC-ND) Creative Commons license, version 3.0.
Uses not allowed by the above license need explicit, written permission from the copyright owner. For more information see:
http://creativecommons.org/licenses/by-nc-nd/3.0/
Slides were authored by Susana Bueno.
from IPython.display import IFrame
IFrame(src='https://pandas.pydata.org/', width=900, height=320)
What is pandas?#
Pandas: Panel data system
Python data analysis library, built on top of
numpy30.000 lines of tested Python/Cython code
More details from PyData.org:
(Pandas is a) high-level building block for doing practical, real world data analysis in Python
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
(Pandas) provides fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive
Documentation#
pandas is an enormous library.
We will concentrate on only a few elementary data analysis tasks using pandas, but more extensive descriptions of the motivations and functionality of pandas can be found in the resources below:
Data structures#
Pandas provides two primary data structures:
Seriesis built for 1-dimensional data seriesDataFrameis built for 2-dimensional collections of tabular data
These data structures differ and are related as follows:
Series: 1-dimensional array of homogeneous data
DataFrame: 2-dimensional table of heterogeneous data, composed of multiple Series
Data structures index#
Series and DataFrame also contain an integrated index:
Series objects include a second array called the index that can contain homogeneous values of any type like integers, strings or datetime objects.
DataFrame objects include a column index. The index objects allow for very sophisticated selection operations using square brackets (
[]) along with some specialized selection functions.
Fast operations#
pandas is fast, built on numpy, but goes beyond:
SeriesandDataFramesare built uponnumpy.ndarraypandassupports the sameufuncoperations as innumpy, and the same fast vectorized computations.goes beyond
numpyby providing elementwise string and datetime operations on indexing.pandasuses and supports additional C extensions written in Cython.
Integration with ecosystem#
pandas is tightly integrated with the rest of the scientific Python ecosystem
pandasis built onnumpyarrays andufuncspandasdata structures can be passed intonumpy,matplotlib, andbokehmethodspandashas built-in visualization usingmatplotlibpandasis a dependency for Python statistics librarystatsmodels
Getting started with pandas#
# conventional way to import pandas:
import pandas as pd
# explore the local data files:
!ls -l data/
!head -5 data/SalesJan.csv
# create a new DataFrame reading from a file "dataset of transactions for a non-store online retail"
sales = pd.read_csv('data/SalesJan.csv',sep=";")
type(sales)
Data Inspection#
Use the method DataFrame.head() to inspect the first few rows of data:
great way to inspect smaller data sets
useful for verifying you’ve read the right file
sales.head()
#pd.read_csv?
sales = pd.read_csv('data/SalesJan.csv', sep=';', parse_dates=[0,8,9])
Data Inspection (2)#
Use the method DataFrame.tail() to inspect the last few rows of data:
# How to visualize the last five rows of the DataFrame
sales.tail()
Data Inspection (3)#
# How to visualize all the DataFrame
sales
Data Inspection (4)#
# How to set the number of rows to be visualized
pd.set_option('max_rows',8)
sales
Data Inspection (5)#
Now use the Pandas DataFrame.sample() method to select randomly e a bit more detail
sales.sample(10)
Data Inspection (6)#
Now use the Pandas DataFrame.info() method to see a bit more detail
sales.info()
We get the name of each column, the Numpy dtype of each column, how many real values are present and the amount of memory used
side note: Pandas has excellent support for not-a-number (NaN) entries in DataFrames and Series.
Data Statistics#
In Pandas we can very easily perform many types of statistical operations
# How to calculate summary statistics of a DataFrame
sales.describe()
# Describe all columns of a DataFrame of type object:
sales.describe(include='object')
# Describe all columns of a DataFrame regardless of data type:
pd.set_option('max_rows',14)
sales.describe(include='all')
DataFrame attributes#
# Number of rows and columns
sales.shape
# Data type of each column:
sales.dtypes
# How to get the names of columns
sales.columns
Renaming columns in a DataFrame#
# rename selected columns by using the 'rename' method
sales.rename(columns={ 'Name':'Customer_Name', 'City':'Customer_City', }, \
inplace=True)
sales.columns
# a list with the modified columns name
myColumns = list(sales.columns)
myColumns
# replace the col names during the file reading process
# by using 'names' parameter
sales2 = pd.read_table('data/SalesJan.csv', sep=';', header=0, names=myColumns, parse_dates=[0,8,9])
sales2.columns
Removing columns from a DataFrame (1)#
# remove (temporarily) a single column (axis=1 refers to columns)
sales2.drop('Latitude', axis=1).head(3)
sales2.info()
Removing columns from a DataFrame (2)#
To permanently remove a column from the DataFrame, you have to used the drop method, with the argument inplace=True
# To permanently remove the column from the DataFrame
sales2.drop('Latitude', axis=1, inplace=True)
sales2.info()
Removing columns from a DataFrame (3)#
# how to remove multiple columns at once
sales2.drop(['Account_Created','Last_Login','Customer_Name', 'Longitude'], axis=1, inplace=True)
sales2.head(3)
Removing rows from a DataFrame#
# remove multiple rows at once (axis=0 refers to rows)
sales2.drop([0, 1], axis=0, inplace=True)
sales2.head()
Hands-on #1: reading and inspecting data#
Create a new Dataframe reading from the file ‘data/drinks.csv’
Inspect data in the new DataFrame
Calculate statistics on the new DataFrame
Try to rename one or more columns of the DataFrame
Remove (temporarily and permanently) the column ‘total_litres_of_pure_alcohol’ from the DataFrame
Columns names as DataFrame attributes#
# Get columns of data DataFrame
sales.columns
# Each column name is an attribute of a DataFrame
sales.Product
The output of the previous command looks different than the typical DataFrame output! What is the type of
data.Invoice?
Series#
# how to get the type of orders.orderId
type(sales.Product)
A Pandas Series is a single vector of data (like a
numpyarray), with an index that labels each element in the vectorIt supports both integer-based and label-based indexing
How to make a series#
# Make a Series
S = pd.Series([632, 1638, 569, 115])
S
Series attributes#
# How to get the values of a series
S.values
# How to get the index of a series
S.index
A
numpyarray contains the values of the seriesIf an index is not specified, the natural sequence of integers is assigned as index
The index of a series is a
pandasindex object
pandas.Series.describe()#
# How to calculate summary statistics of a Series
S.describe()
Name what you see#
We can assign meaningful labels to the index, if they are available
S1 = pd.Series([632, 1638, 569, 115], index=['A', 'B', 'C', 'D'])
S1
# Now it looks more like a dictionary
S1['C']
Series from a dictionary#
myDict = {'A': 632, 'B': 1638, 'C': 569, 'D': 115}
pd.Series(myDict)
Naming index and array of values of a series#
# We can give both the array of values and the index
# meaningful labels themselves
S1.name = 'counts'
S1.index.name = 'Letter'
S1
Numpy and pandas series#
NumPy’s math functions and other operations can be applied to Series without losing the data structure
import numpy as np
np.log(S1)
The truth about Series#
Booleans mask
mask = [False, True, True, False]
S1[mask]
Find what is missing#
If we pass a custom index to Series it will select the corresponding values from the dict
S2 = pd.Series(myDict, index= ['E','A','B','C'])
S2
Indices without corresponding values are treat as missing
Pandas uses the NaN (not a number) type for missing values
# How to find what is missing
S2.isnull()
Adding two series#
The labels are used to align data when used in operations with other series.
S1 + S2
We have a different behavior from
numpy, where arrays of the same length are combined element-wiseIn the resulting series, values are the combination of original values with the same label; the missing values were propagated by addition
Back to DataFrame#
How to make a DataFrame from scratch#
You can make a DataFrame from scratch, using the class
pd.DataFrameDataFrame inputs could be:
a Python dictionary of 1D sequences (e.g. ndarrays, lists, dicts, ..)
a 2-D numpy.ndarray
pandas Series
another DataFrame
DataFrame from a dictionary#
import numpy as np
myDict = {'colA':[1,2,3,4,5,6,7,8,9,10],
'colB':np.linspace(0,np.pi,10),
'colC':0.0,
'colD':["a","b","c","a","b","c","a","b","c","a"]}
myDF = pd.DataFrame(myDict)
myDF
DataFrame from a 2D numpy array#
myMatrix = np.random.random((1000,4))
myDF = pd.DataFrame(myMatrix, \
columns=['firstCol','secondCol','thirdCol','fourCol'])
myDF
Reading a new DataFrame#
# How to read a DataFrame from the first 3 columns of a .csv.gz file
A = pd.read_csv('data/num.csv.gz', header=None, \
names=['Elevation', 'Aspect', 'Slope'], usecols=range(0, 3))
A.info()
Selecting and examining columns#
type(A['Aspect'])
# For each value, count number of occurrences
A['Aspect'].value_counts()
Aggregating data: the groupby method#
Problem: for each unique value in column Aspect of our DataFrame, we want to calculate the arithmetic mean of ALL other numeric columns
A.groupby('Aspect').mean()
# How to calculate the mean of 'Elevation' values grouped by 'Aspect'
A.groupby('Aspect').Elevation.mean()
New columns in a DataFrame#
How to add a new column as a function of existing columns
A['new_col1'] = A.Elevation * 10
A['new_col2'] = A['Slope'] + A['Aspect'] - 1
# Check it
A.info()
# rename a column
A.rename(columns={'new_col2':'a_sum'}, inplace=True)
A
Recap: attributes and methods#
A.columns # column names (which is "an index")
A.dtypes # data types of each column
A.shape # number of rows and columns
A.values # underlying numpy array
… and many more: just write A. [and press TAB]
Filtering data#
A first simple filter#
# Filter rows, based on column values
A[A['Aspect'] == 20]
Advanced logical filtering#
# use multiple conditions
condition = (A.Elevation < 2400) & (A.Slope == 17)
cols = ['Slope', 'Aspect']
A[condition][cols]
Note: AND and OR do not work inside data frame conditions
Filtering of specific values#
B = A[A['Aspect'].isin([13,17])]
B
B.shape
Hands-on #2: grouping data#
Using the dataset from the file ‘data/drinks.csv’, calculate:
the mean beer servings across the entire dataset
the mean beer servings just for countries in Africa
the mean and the max beer servings for each continent
Selecting rows and columns from a DataFrame#
The loc method is used to select rows and columns by label. You can pass it:
A single label
A list of labels
A slice of labels
A boolean Series
A colon (which indicates “all labels”)
# row 0, all columns
A.loc[0]
# How to get rows 0 and 1 and 2, all columns
A.loc[[0, 1, 2], :]
# How to get rows 1 through 2 (inclusive), all columns
A.loc[1:2, :]
# also this implies "all columns", but it's better to be explicit!
A.loc[1:2]
# How to get rows 0 through 2 (inclusive),
# columns 'Aspect' and 'Slope'
A.loc[0:2, ['Aspect', 'Slope']]
# How to get rows 0 through 2 (inclusive),
# columns 'Elevation' through 'Slope' (inclusive)
A.loc[0:2, 'Elevation':'Slope']
# How to get a single element of the DataFrame
A.loc[1,'Slope']
# How to get rows 0 through 2 (inclusive), column 'Slope'
A.loc[0:2, 'Slope']
Selecting rows and columns by position#
The iloc method is used to select rows and columns by position. Purely integer-location based indexing for selection by position
A.head(2)
# How to get the element in row 1 and column 4 ('a_sum')
A.iloc[1,4]
# How to get rows 0 through 2 (inclusive), column 4 ('a_sum')
A.iloc[0:3,4]
Ordering Data#
# How to sort rows by label
A.sort_index()
# How to sort rows by a specific column
A.sort_values(by='Aspect')
# How to sort using descending order instead
A.sort_values(by='Aspect', ascending=False)
# How to sort by multiple columns
A.sort_values(by=['Aspect', 'Slope'])
Hands-on #3#
Create a pandas structure from the dataset below
Find the letter(s) with maximum number of ones and twos
# Our small data set
d = {'one':[1,1,1,1,1,1,1],
'two':[2,2,2,2,2,2,2],
'letter':['a','a','a', 'b','b','c', 'c']}
So far#
Check data before getting started
Choose columns by label or index
Filter rows on index and values
Group by
Sort
note: just like databases
hint: we can do better
Uniqueness of data#
How to find what is duplicated?
# Series of booleans (True if a row is identical to a previous row)
A.duplicated()
Working with duplicated data#
# Count duplicates
A.duplicated().sum()
# Find the duplicate with minimum elevation value
A[A.duplicated()].sort_values(by='Elevation').head(1)
# Check if it really is a duplicate
A[(A.Elevation==2331) & (A.Aspect==185)]
Delete duplicated data#
pd.set_option('max_rows', 0)
# Drop duplicate rows
B = A.drop_duplicates()
# The DataFrame B doesn't have duplicated rows!
B.duplicated().sum()
# How to delete duplicated rows in A
A.drop_duplicates(inplace=True)
# Now also A doesn't have duplicated rows
A.duplicated().sum()
The index#
# Use a different example dataset
baseball = pd.read_csv("https://raw.githubusercontent.com/fonnesbeck/statistical-analysis-python-tutorial/master/data/baseball.csv", index_col='id')
pd.set_option('max_columns', 8)
pd.set_option('max_rows', 8)
baseball
Reindexing#
type(baseball.index)
Reindexing allows users to manipulate the data labels in a DataFrame
# Alter the order of the rows
baseball.reindex(baseball.index[::-1]).head()
# Let's try all the possible keys
cols = ['player','year','team']
id_range = range(baseball.index.values.min(),
baseball.index.values.max())
baseball[cols].reindex(id_range).head()
# Fill the blanks
baseball[cols].reindex(id_range, fill_value='MR Unknown', columns=['player']).head()
# Using different methods:
baseball[cols].reindex(id_range, method='ffill').head()
Multiple index#
If you are feeling forced to combine fields to obtain a unique index, that was not simply an integer value
A more elegant way is to create a hierarchical index from fields
baseball_h = baseball.set_index(['year', 'team', 'player'])
baseball_h.head(10)
baseball_h.index.is_unique
Combining DataFrames#
the index magic
df1 = pd.DataFrame({'A': ['A0','A1'], 'B': ['B0','B1']}, index=[0, 1])
df2 = pd.DataFrame({'A': ['A4','A5','A6'], \
'B': ['B4','B5','B6']},index=[4, 5, 6])
df3 = pd.DataFrame({'A': ['A7','A8'],'B': ['B7','B8']}, index=[7, 8])
# Combine the 3 DataFrames:
pd.concat([df1,df2,df3])
Combining DataFrames: an alternative method#
# Alternative combo
df1.append(df2).append(df3)
Joining data#
pandas has full-featured, high performance in-memory join operations, idiomatically very similar to relational databases like SQL.
df1 = pd.DataFrame({'A': ['A0','A1'], 'key': ['k1','k2'], \
'B': ['B0','B1']}, index=[0, 1])
df2 = pd.DataFrame({'A': ['A4','A5','A6','A7'], \
'key': ['k2','k1','k4','k3'], 'B': ['B4','B0','B6','B7']}, \
index=[4, 5, 6, 7])
df1
df2
The merge() method#
pd.merge(df1,df2, on='key')
# Join on multiple keys:
pd.merge(df1,df2, on=['key','B'])
pandas I/O#
Snapshot of a DataFrame#
Quickly write the current status of a DataFrame to a CSV file
csvfile = 'data/my_file.csv'
A.to_csv(csvfile)
# Warning: index is used as first column
!head {csvfile}
# Better to ignore the index column!
A.to_csv(csvfile, index=False)
!head {csvfile}
Reading from a remote source#
# read CSV file directly from a URL and save the results
data = pd.read_csv(\
'http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv', index_col=0)
data
pandas I/O in Excel files#
# Our xls file
xlsfile = 'data/output.xlsx'
# how to write on a xls file from A DataFrame
A[:2500].to_excel(xlsfile,'Sheet1')
# how to read data from a xls file
B = pd.read_excel(xlsfile, index_col=0, sheet_name='Sheet1')
B.head()
From dataframe to JSON#
jsonfile = 'data/mydata.json'
# Save the first 5 rows
B.loc[:5].to_json(jsonfile)
# Show what we get
!more {jsonfile}
hint: reverse is pd.read_json
pandas alchemy#
Pandas integrates well with
sqlalchemy, the most famousORM; you should use it, if you work with SQL databases in pythonList of url relational database engines for
sqlalchemysqlite
postgres
mysql
oracle
microsoft
write data to a SQL db#
Write records stored in a DataFrame to a SQL database.
Parameters
name : string Name of SQL table
con : SQLAlchemy engine or DBAPI2 connection (legacy mode)
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
source: pd.DataFrame.to_sql?
read data from a SQL db#
# read db table in 3 lines
from sqlalchemy import create_engine
con = create_engine('postgresql://user@hostname:port/database')
df = pd.read_sql_table('mytable', con)
## An uncomfortable DataFrame
# A new DataFrame
mydata = pd.DataFrame({0: {'patient': 1, 'phylum': 'Firmicutes',
'value': 632},
1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})
mydata
## Transpose data
mydata = mydata.T
mydata
## Beware, pandas is very powerful!
# Take a Subseries and modify
vals = mydata.value
vals[5] = 0
vals
Some effect on mydata?#
mydata.loc[5,'value']
just like numpy, the series vals is a view of mydata
# a deep copy can help us
vals = mydata.value.copy()
vals[5] = 1000
mydata.loc[5,'value']
Plots#
%matplotlib inline
import matplotlib.pyplot as plt
sales = pd.read_csv('data/SalesJan.csv', sep=';', usecols=['Product','Price','Country'])
sales.sample(5)
# Top 10 sales' countries:
sales_p = (sales.groupby(by='Country')['Price','Product']\
.agg({'Price': 'sum', 'Product': 'count'}))\
.sort_values(by='Price', ascending=False)[:10].reset_index()
sales_p.rename(columns={'Price':'Total_Sales', 'Product':'Total_Product'}, inplace=True)
fig, (ax0, ax1) = plt.subplots(nrows=1,ncols=2, sharey=True, figsize=(10,4))
sales_p.plot(kind="barh", y="Total_Product", x="Country", ax=ax0)
sales_p.plot(kind="barh", y="Total_Sales", x="Country", ax=ax1)
ax1.set_xlim([0, 740000])
plt.xticks(rotation=45)
plt.show()
# Improve the style of the plot:
plt.style.use('ggplot')
fig, (ax0, ax1) = plt.subplots(nrows=1,ncols=2, sharey=True, figsize=(10,4))
sales_p.plot(kind="barh", y="Total_Product", x="Country", ax=ax0)
ax0.set(title='Products', xlabel='Total Product')
ax0.legend().set_visible(False)
sales_p.plot(kind="barh", y="Total_Sales", x="Country", ax=ax1)
ax1.set_xlim([0, 740000])
ax1.set(title='Sales', xlabel='Total Sales')
plt.xticks(rotation=45)
ax1.legend().set_visible(False)
plt.show()
A level up: seaborn library#
built on top of matplotlib#
import seaborn as sns
fig, (ax0, ax1) = plt.subplots(nrows=1,ncols=2, sharey=True, figsize=(10,4))
sns.barplot(y="Country", x="Total_Product", data=sales_p, label="Total", color='r', ax=ax0)
sns.barplot(y="Country", x="Total_Sales", data=sales_p, label="Total", color='r', ax=ax1)
# and many other plots...
# https://seaborn.pydata.org/examples/index.html
sns.distplot(sales_p['Total_Sales'], bins=16, color="red" )
Timeseries in pandas#
pandas has a built in support for time series
hint: time objects are the most difficult to make compatibles
# Monthly range between a time period
dates = pd.date_range('2010-01', '2015-06', freq='M')
# Pandas makes this easy!
dates
# Create a subset of our original dataframe
B = A.head(len(dates)).copy()
# Since the size is the same, we can add months as a new column
B['time'] = dates
B
B.set_index('time').plot()
B.Elevation.plot()
# Let's work now on a single Series
C = pd.Series(B.Aspect.values, copy=True, index=dates)
C.head(1)
# Select a time subset
C['2011':'2013'].plot()
D = pd.Series(B.Slope, copy=True)
D.index = dates
T = pd.DataFrame({'s1':C, 's2':D})
# Select a time subset
T['2011-11':'2013-03'].plot()
The pandas time-series API includes:
Creating date ranges
From files
From scratch
Manipulations
Shift, resample, filter
Field accessors
(e.g., hour of day)
Plotting
Time zones -localization and conversion
pandas requires “continuous training and meditation”#
Pandas is a very powerful and sometimes complex framework
The manual is > 2000 pages!
Many operations
Some operations can be obtained in different but equivalent ways
End of chapter#
# Versions
%load_ext version_information
%version_information numpy, scipy, pandas, matplotlib, seaborn