pandas

Contents

pandas#

the basic tool for data manipulation and analysis in Python#







Unless specified differently, these slides are copyright CINECA 2019 and are released under the Attribution–NonCommercial–NoDerivs (CC BY-NC-ND) Creative Commons license, version 3.0. Uses not allowed by the above license need explicit, written permission from the copyright owner. For more information see: http://creativecommons.org/licenses/by-nc-nd/3.0/

Slides were authored by Susana Bueno.

from IPython.display import IFrame

IFrame(src='https://pandas.pydata.org/', width=900, height=320)

What is pandas?#

  • Pandas: Panel data system

  • Python data analysis library, built on top of numpy

  • 30.000 lines of tested Python/Cython code

  • More details from PyData.org:

    • (Pandas is a) high-level building block for doing practical, real world data analysis in Python

    • Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

    • (Pandas) provides fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive

Documentation#

pandas is an enormous library.

We will concentrate on only a few elementary data analysis tasks using pandas, but more extensive descriptions of the motivations and functionality of pandas can be found in the resources below:

Data structures#

Pandas provides two primary data structures:

  • Series is built for 1-dimensional data series

  • DataFrame is built for 2-dimensional collections of tabular data

These data structures differ and are related as follows:

  • Series: 1-dimensional array of homogeneous data

  • DataFrame: 2-dimensional table of heterogeneous data, composed of multiple Series

Data structures index#

Series and DataFrame also contain an integrated index:

  • Series objects include a second array called the index that can contain homogeneous values of any type like integers, strings or datetime objects.

  • DataFrame objects include a column index. The index objects allow for very sophisticated selection operations using square brackets ([]) along with some specialized selection functions.

Fast operations#

pandas is fast, built on numpy, but goes beyond:

  • Series and DataFrames are built upon numpy.ndarray

  • pandas supports the same ufunc operations as in numpy, and the same fast vectorized computations.

  • goes beyond numpy by providing elementwise string and datetime operations on indexing.

  • pandas uses and supports additional C extensions written in Cython.

Integration with ecosystem#

pandas is tightly integrated with the rest of the scientific Python ecosystem

  • pandas is built on numpy arrays and ufuncs

  • pandas data structures can be passed into numpy, matplotlib, and bokeh methods

  • pandas has built-in visualization using matplotlib

  • pandas is a dependency for Python statistics library statsmodels

Getting started with pandas#

# conventional way to import pandas:
import pandas as pd
# explore the local data files:
!ls -l data/
!head -5 data/SalesJan.csv
# create a new DataFrame reading from a file "dataset of transactions for a non-store online retail"
sales = pd.read_csv('data/SalesJan.csv',sep=";")

type(sales)

Data Inspection#

Use the method DataFrame.head() to inspect the first few rows of data:

  • great way to inspect smaller data sets

  • useful for verifying you’ve read the right file

sales.head()
#pd.read_csv?
sales = pd.read_csv('data/SalesJan.csv', sep=';', parse_dates=[0,8,9])

Data Inspection (2)#

Use the method DataFrame.tail() to inspect the last few rows of data:

# How to visualize the last five rows of the DataFrame
sales.tail()

Data Inspection (3)#

# How to visualize all the DataFrame
sales

Data Inspection (4)#

# How to set the number of rows to be visualized
pd.set_option('max_rows',8)
sales

Data Inspection (5)#

Now use the Pandas DataFrame.sample() method to select randomly e a bit more detail

sales.sample(10)

Data Inspection (6)#

Now use the Pandas DataFrame.info() method to see a bit more detail

sales.info()

We get the name of each column, the Numpy dtype of each column, how many real values are present and the amount of memory used

side note: Pandas has excellent support for not-a-number (NaN) entries in DataFrames and Series.

Data Statistics#

In Pandas we can very easily perform many types of statistical operations

# How to calculate summary statistics of a DataFrame
sales.describe()
# Describe all columns of a DataFrame of type object:
sales.describe(include='object')
# Describe all columns of a DataFrame regardless of data type:
pd.set_option('max_rows',14)
sales.describe(include='all')

DataFrame attributes#

# Number of rows and columns
sales.shape
# Data type of each column:
sales.dtypes
# How to get the names of columns
sales.columns

Renaming columns in a DataFrame#

# rename selected columns by using the 'rename' method
sales.rename(columns={ 'Name':'Customer_Name', 'City':'Customer_City', }, \
              inplace=True)
sales.columns
# a list with the modified columns name
myColumns = list(sales.columns) 
myColumns
# replace the col names during the file reading process 
# by using 'names' parameter
sales2 = pd.read_table('data/SalesJan.csv', sep=';', header=0, names=myColumns,  parse_dates=[0,8,9])
sales2.columns

Removing columns from a DataFrame (1)#

# remove (temporarily) a single column (axis=1 refers to columns)
sales2.drop('Latitude', axis=1).head(3)
sales2.info()

Removing columns from a DataFrame (2)#

To permanently remove a column from the DataFrame, you have to used the drop method, with the argument inplace=True

# To permanently remove the column from the DataFrame
sales2.drop('Latitude', axis=1, inplace=True)
sales2.info()

Removing columns from a DataFrame (3)#

# how to remove multiple columns at once
sales2.drop(['Account_Created','Last_Login','Customer_Name', 'Longitude'], axis=1, inplace=True)
sales2.head(3)

Removing rows from a DataFrame#

# remove multiple rows at once (axis=0 refers to rows)
sales2.drop([0, 1], axis=0, inplace=True)
sales2.head()

Hands-on #1: reading and inspecting data#

  1. Create a new Dataframe reading from the file ‘data/drinks.csv’

  2. Inspect data in the new DataFrame

  3. Calculate statistics on the new DataFrame

  4. Try to rename one or more columns of the DataFrame

  5. Remove (temporarily and permanently) the column ‘total_litres_of_pure_alcohol’ from the DataFrame

Columns names as DataFrame attributes#

# Get columns of data DataFrame
sales.columns
# Each column name is an attribute of a DataFrame
sales.Product
  • The output of the previous command looks different than the typical DataFrame output! What is the type of data.Invoice?

Series#

# how to get the type of orders.orderId
type(sales.Product)
  • A Pandas Series is a single vector of data (like a numpy array), with an index that labels each element in the vector

  • It supports both integer-based and label-based indexing

How to make a series#

# Make a Series
S = pd.Series([632, 1638, 569, 115])
S

Series attributes#

# How to get the values of a series
S.values
# How to get the index of a series
S.index
  • A numpy array contains the values of the series

  • If an index is not specified, the natural sequence of integers is assigned as index

  • The index of a series is a pandas index object

pandas.Series.describe()#

# How to calculate summary statistics of a Series
S.describe()

Name what you see#

  • We can assign meaningful labels to the index, if they are available

S1 = pd.Series([632, 1638, 569, 115], index=['A', 'B', 'C', 'D'])
S1
# Now it looks more like a dictionary
S1['C']

Series from a dictionary#

myDict = {'A': 632, 'B': 1638, 'C': 569, 'D': 115}
pd.Series(myDict)

Naming index and array of values of a series#

# We can give both the array of values and the index 
# meaningful labels themselves

S1.name = 'counts'
S1.index.name = 'Letter'
S1

Numpy and pandas series#

NumPy’s math functions and other operations can be applied to Series without losing the data structure

import numpy as np

np.log(S1)

The truth about Series#

Booleans mask

mask = [False, True, True, False]
S1[mask]

Find what is missing#

If we pass a custom index to Series it will select the corresponding values from the dict

S2 = pd.Series(myDict, index= ['E','A','B','C'])
S2

Indices without corresponding values are treat as missing

Pandas uses the NaN (not a number) type for missing values

# How to find what is missing
S2.isnull()

Adding two series#

The labels are used to align data when used in operations with other series.

S1 + S2
  • We have a different behavior from numpy, where arrays of the same length are combined element-wise

  • In the resulting series, values are the combination of original values with the same label; the missing values were propagated by addition

Back to DataFrame#

How to make a DataFrame from scratch#

  • You can make a DataFrame from scratch, using the class pd.DataFrame

  • DataFrame inputs could be:

    1. a Python dictionary of 1D sequences (e.g. ndarrays, lists, dicts, ..)

    2. a 2-D numpy.ndarray

    3. pandas Series

    4. another DataFrame

DataFrame from a dictionary#

import numpy as np

myDict = {'colA':[1,2,3,4,5,6,7,8,9,10],
          'colB':np.linspace(0,np.pi,10),
          'colC':0.0,
          'colD':["a","b","c","a","b","c","a","b","c","a"]}
myDF = pd.DataFrame(myDict)
myDF

DataFrame from a 2D numpy array#

myMatrix = np.random.random((1000,4))
myDF = pd.DataFrame(myMatrix, \
       columns=['firstCol','secondCol','thirdCol','fourCol'])
myDF

Reading a new DataFrame#

# How to read a DataFrame from the first 3 columns of a .csv.gz file
A = pd.read_csv('data/num.csv.gz', header=None, \
    names=['Elevation', 'Aspect', 'Slope'], usecols=range(0, 3))
A.info()

Selecting and examining columns#

type(A['Aspect'])
# For each value, count number of occurrences
A['Aspect'].value_counts()

Aggregating data: the groupby method#

Problem: for each unique value in column Aspect of our DataFrame, we want to calculate the arithmetic mean of ALL other numeric columns

A.groupby('Aspect').mean()
# How to calculate the mean of 'Elevation' values grouped by 'Aspect'
A.groupby('Aspect').Elevation.mean()

New columns in a DataFrame#

How to add a new column as a function of existing columns

A['new_col1'] = A.Elevation * 10
A['new_col2'] = A['Slope'] + A['Aspect'] - 1
# Check it
A.info()
# rename a column
A.rename(columns={'new_col2':'a_sum'}, inplace=True)
A

Recap: attributes and methods#

A.columns    # column names (which is "an index")
A.dtypes     # data types of each column
A.shape      # number of rows and columns
A.values     # underlying numpy array

… and many more: just write A. [and press TAB]

Filtering data#

A first simple filter#

# Filter rows, based on column values
A[A['Aspect'] == 20]

Advanced logical filtering#

# use multiple conditions
condition = (A.Elevation < 2400) & (A.Slope == 17)
cols = ['Slope', 'Aspect']
A[condition][cols]

Note: AND and OR do not work inside data frame conditions

Filtering of specific values#

B = A[A['Aspect'].isin([13,17])]
B
B.shape

Hands-on #2: grouping data#

Using the dataset from the file ‘data/drinks.csv’, calculate:

  1. the mean beer servings across the entire dataset

  2. the mean beer servings just for countries in Africa

  3. the mean and the max beer servings for each continent

Selecting rows and columns from a DataFrame#

The loc method is used to select rows and columns by label. You can pass it:

  • A single label

  • A list of labels

  • A slice of labels

  • A boolean Series

  • A colon (which indicates “all labels”)

# row 0, all columns
A.loc[0]
# How to get rows 0 and 1 and 2, all columns
A.loc[[0, 1, 2], :]
# How to get rows 1 through 2 (inclusive), all columns
A.loc[1:2, :]
# also this implies "all columns", but it's better to be explicit!
A.loc[1:2]
# How to get rows 0 through 2 (inclusive), 
# columns 'Aspect' and 'Slope'
A.loc[0:2, ['Aspect', 'Slope']]
# How to get rows 0 through 2 (inclusive), 
# columns 'Elevation' through 'Slope' (inclusive)
A.loc[0:2, 'Elevation':'Slope']
# How to get a single element of the DataFrame
A.loc[1,'Slope']
# How to get rows 0 through 2 (inclusive), column 'Slope'
A.loc[0:2, 'Slope']

Selecting rows and columns by position#

The iloc method is used to select rows and columns by position. Purely integer-location based indexing for selection by position

A.head(2)
# How to get the element in row 1 and column 4 ('a_sum')
A.iloc[1,4]
# How to get rows 0 through 2 (inclusive), column 4 ('a_sum')
A.iloc[0:3,4]

Ordering Data#

# How to sort rows by label
A.sort_index()
# How to sort rows by a specific column
A.sort_values(by='Aspect')
# How to sort using descending order instead
A.sort_values(by='Aspect', ascending=False)
# How to sort by multiple columns
A.sort_values(by=['Aspect', 'Slope'])

Hands-on #3#

  • Create a pandas structure from the dataset below

  • Find the letter(s) with maximum number of ones and twos

# Our small data set
d = {'one':[1,1,1,1,1,1,1],
     'two':[2,2,2,2,2,2,2],
     'letter':['a','a','a', 'b','b','c', 'c']}

So far#

  • Check data before getting started

  • Choose columns by label or index

  • Filter rows on index and values

  • Group by

  • Sort

note: just like databases

hint: we can do better

Uniqueness of data#

How to find what is duplicated?

# Series of booleans (True if a row is identical to a previous row)
A.duplicated()

Working with duplicated data#

# Count duplicates
A.duplicated().sum()
# Find the duplicate with minimum elevation value
A[A.duplicated()].sort_values(by='Elevation').head(1)
# Check if it really is a duplicate
A[(A.Elevation==2331) & (A.Aspect==185)]

Delete duplicated data#

pd.set_option('max_rows', 0)
# Drop duplicate rows
B = A.drop_duplicates()
# The DataFrame B doesn't have duplicated rows!
B.duplicated().sum()
# How to delete duplicated rows in A
A.drop_duplicates(inplace=True)
# Now also A doesn't have duplicated rows
A.duplicated().sum()

The index#

# Use a different example dataset
baseball = pd.read_csv("https://raw.githubusercontent.com/fonnesbeck/statistical-analysis-python-tutorial/master/data/baseball.csv", index_col='id')
pd.set_option('max_columns', 8)
pd.set_option('max_rows', 8)
baseball 

Reindexing#

type(baseball.index)

Reindexing allows users to manipulate the data labels in a DataFrame

# Alter the order of the rows
baseball.reindex(baseball.index[::-1]).head()
# Let's try all the possible keys
cols = ['player','year','team']
id_range = range(baseball.index.values.min(), 
                 baseball.index.values.max())
baseball[cols].reindex(id_range).head()
# Fill the blanks
baseball[cols].reindex(id_range, fill_value='MR Unknown', columns=['player']).head()
# Using different methods:
baseball[cols].reindex(id_range, method='ffill').head()

Multiple index#

If you are feeling forced to combine fields to obtain a unique index, that was not simply an integer value

A more elegant way is to create a hierarchical index from fields

baseball_h = baseball.set_index(['year', 'team', 'player'])
baseball_h.head(10)
baseball_h.index.is_unique

Combining DataFrames#

the index magic

df1 = pd.DataFrame({'A': ['A0','A1'], 'B': ['B0','B1']}, index=[0, 1])
df2 = pd.DataFrame({'A': ['A4','A5','A6'], \
                    'B': ['B4','B5','B6']},index=[4, 5, 6])
df3 = pd.DataFrame({'A': ['A7','A8'],'B': ['B7','B8']}, index=[7, 8])

# Combine the 3 DataFrames:
pd.concat([df1,df2,df3])

Combining DataFrames: an alternative method#

# Alternative combo
df1.append(df2).append(df3)

Joining data#

pandas has full-featured, high performance in-memory join operations, idiomatically very similar to relational databases like SQL.

df1 = pd.DataFrame({'A': ['A0','A1'], 'key': ['k1','k2'], \
     'B': ['B0','B1']}, index=[0, 1])

df2 = pd.DataFrame({'A': ['A4','A5','A6','A7'], \
     'key': ['k2','k1','k4','k3'], 'B': ['B4','B0','B6','B7']}, \
     index=[4, 5, 6, 7])
df1
df2

The merge() method#

pd.merge(df1,df2, on='key')
# Join on multiple keys:
pd.merge(df1,df2, on=['key','B'])

pandas I/O#

Snapshot of a DataFrame#

Quickly write the current status of a DataFrame to a CSV file

csvfile = 'data/my_file.csv'
A.to_csv(csvfile)
# Warning: index is used as first column
!head {csvfile}
# Better to ignore the index column!
A.to_csv(csvfile, index=False)
!head {csvfile}

Reading from a remote source#

# read CSV file directly from a URL and save the results
data = pd.read_csv(\
    'http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv', index_col=0)
data

pandas I/O in Excel files#

# Our xls file
xlsfile = 'data/output.xlsx'

# how to write on a xls file from A DataFrame
A[:2500].to_excel(xlsfile,'Sheet1')
# how to read data from a xls file
B = pd.read_excel(xlsfile, index_col=0, sheet_name='Sheet1')
B.head()

From dataframe to JSON#

jsonfile = 'data/mydata.json'

# Save the first 5 rows
B.loc[:5].to_json(jsonfile)

# Show what we get
!more {jsonfile}

hint: reverse is pd.read_json

pandas alchemy#

  • Pandas integrates well with sqlalchemy, the most famous ORM; you should use it, if you work with SQL databases in python

  • List of url relational database engines for sqlalchemy

    • sqlite

    • postgres

    • mysql

    • oracle

    • microsoft

write data to a SQL db#

Write records stored in a DataFrame to a SQL database.

Parameters

name : string Name of SQL table

con : SQLAlchemy engine or DBAPI2 connection (legacy mode)

Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.

source: pd.DataFrame.to_sql?

read data from a SQL db#

# read db table in 3 lines
from sqlalchemy import create_engine
con = create_engine('postgresql://user@hostname:port/database')
df = pd.read_sql_table('mytable', con)

## An uncomfortable DataFrame

# A new DataFrame
mydata = pd.DataFrame({0: {'patient': 1, 'phylum': 'Firmicutes', 
                           'value': 632},
        1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
        2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
        3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
        4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
        5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
        6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
        7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})
mydata

## Transpose data

mydata = mydata.T
mydata  

## Beware, pandas is very powerful!

# Take a Subseries and modify
vals = mydata.value
vals[5] = 0
vals

Some effect on mydata?#

mydata.loc[5,'value']

just like numpy, the series vals is a view of mydata

# a deep copy can help us
vals = mydata.value.copy()
vals[5] = 1000
mydata.loc[5,'value']

Plots#

%matplotlib inline

import matplotlib.pyplot as plt
sales = pd.read_csv('data/SalesJan.csv', sep=';', usecols=['Product','Price','Country'])
sales.sample(5)
# Top 10 sales' countries:
sales_p = (sales.groupby(by='Country')['Price','Product']\
 .agg({'Price': 'sum', 'Product': 'count'}))\
 .sort_values(by='Price', ascending=False)[:10].reset_index()

sales_p.rename(columns={'Price':'Total_Sales', 'Product':'Total_Product'}, inplace=True)

fig, (ax0, ax1) = plt.subplots(nrows=1,ncols=2, sharey=True, figsize=(10,4))
sales_p.plot(kind="barh", y="Total_Product", x="Country", ax=ax0)
sales_p.plot(kind="barh", y="Total_Sales", x="Country", ax=ax1)
ax1.set_xlim([0, 740000])
plt.xticks(rotation=45)
plt.show()
# Improve the style of the plot:
plt.style.use('ggplot')

fig, (ax0, ax1) = plt.subplots(nrows=1,ncols=2, sharey=True, figsize=(10,4))
sales_p.plot(kind="barh", y="Total_Product", x="Country", ax=ax0)
ax0.set(title='Products', xlabel='Total Product')
ax0.legend().set_visible(False)

sales_p.plot(kind="barh", y="Total_Sales", x="Country", ax=ax1)
ax1.set_xlim([0, 740000])
ax1.set(title='Sales', xlabel='Total Sales')
plt.xticks(rotation=45)
ax1.legend().set_visible(False)
plt.show()

A level up: seaborn library#

built on top of matplotlib#

import seaborn as sns
fig, (ax0, ax1) = plt.subplots(nrows=1,ncols=2, sharey=True, figsize=(10,4))

sns.barplot(y="Country", x="Total_Product", data=sales_p, label="Total", color='r', ax=ax0)
sns.barplot(y="Country", x="Total_Sales", data=sales_p, label="Total", color='r', ax=ax1)
# and many other plots... 
# https://seaborn.pydata.org/examples/index.html

sns.distplot(sales_p['Total_Sales'], bins=16, color="red" )

Timeseries in pandas#

pandas has a built in support for time series

hint: time objects are the most difficult to make compatibles

# Monthly range between a time period
dates = pd.date_range('2010-01', '2015-06', freq='M')
# Pandas makes this easy!
dates
# Create a subset of our original dataframe
B = A.head(len(dates)).copy()
# Since the size is the same, we can add months as a new column
B['time'] = dates
B
B.set_index('time').plot()
B.Elevation.plot()
# Let's work now on a single Series
C = pd.Series(B.Aspect.values, copy=True, index=dates)
C.head(1)
# Select a time subset
C['2011':'2013'].plot()
D = pd.Series(B.Slope, copy=True)
D.index = dates
T = pd.DataFrame({'s1':C, 's2':D})
# Select a time subset
T['2011-11':'2013-03'].plot()

The pandas time-series API includes:

  • Creating date ranges

    • From files

    • From scratch

  • Manipulations

    • Shift, resample, filter

  • Field accessors

    • (e.g., hour of day)

  • Plotting

  • Time zones -localization and conversion

pandas requires “continuous training and meditation”#

  • Pandas is a very powerful and sometimes complex framework

  • Many operations

    • Some operations can be obtained in different but equivalent ways

End of chapter#

# Versions
%load_ext version_information
%version_information numpy, scipy, pandas, matplotlib, seaborn