This is one of the posts in a series of introductory Python articles. These posts explain programming principles in a simplified way and show very basic Python code that should help folks learning Python to get a better understanding of some concepts. In this post, I’ll share some notes about working with a popular data science and data analysis library - pandas
.
I have simplified this dataset that is part of the Census from this this portal to illustrate how pandas
work.
Introduction¶
pandas
is one of the most popular data analysis and data munging library implemented as a Python package. pandas
provides an easy-to-use interface for exploring all kinds of datasets. It has hundreds of useful functions and operations, so it is impossible to go through all of them. We will start exploring the basics of pandas
first by loading a .csv
file into a pandas
data frame.
import pandas as pd
pd.set_option('display.max_rows', 7)
df = pd.read_csv('~/data/counties.csv')
df
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
0 | Alabama | Autauga County | 54571.0 | 54597 |
1 | Alabama | Baldwin County | 182265.0 | 182265 |
2 | Alabama | Barbour County | 27457.0 | 27455 |
... | ... | ... | ... | ... |
3139 | Wyoming | Uinta County | 21118.0 | 21121 |
3140 | Wyoming | Washakie County | 8533.0 | 8528 |
3141 | Wyoming | Weston County | 7208.0 | 7208 |
3142 rows × 4 columns
Now we would like to create a new .csv
file with only counties in the Oregon state.
df = pd.read_csv('~/data/counties.csv')
oregon_counties = df[(df.STATE == 'Oregon')]
oregon_counties.to_csv('~/data/oregon.csv', index=False)
pd.read_csv('~/data/oregon.csv')
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
0 | Oregon | Baker County | 16134.0 | 16131 |
1 | Oregon | Benton County | 85579.0 | 85581 |
2 | Oregon | Clackamas County | 375992.0 | 375996 |
... | ... | ... | ... | ... |
33 | Oregon | Washington County | 529710.0 | 529862 |
34 | Oregon | Wheeler County | 1441.0 | 1439 |
35 | Oregon | Yamhill County | 99193.0 | 99216 |
36 rows × 4 columns
This kind of operations and many others that you will learn now can be done using any RDBMS such SQL Server, MySQL, or PostgreSQL or an office program such as Excel. However, not all of your data will originate or be stored in a RDBMS. We could load .csv
files into a RDBMS to do this kind of work, but if there is no need to manage the data over time in the database, we may use the wrong tool for the job.
Basic operations¶
As you see, it is very easy to load external data into something that is called a DataFrame
. You can think of this as a relational table that consists of rows and columns. This is what you would expect to have after importing a .csv
file into an Excel sheet. Let’s explore some of the basic operations that are available for a DataFrame
object:
# getting only the first 5 rows, equivalent to df[0:5]
df.head()
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
0 | Alabama | Autauga County | 54571.0 | 54597 |
1 | Alabama | Baldwin County | 182265.0 | 182265 |
2 | Alabama | Barbour County | 27457.0 | 27455 |
3 | Alabama | Bibb County | 22915.0 | 22915 |
4 | Alabama | Blount County | 57322.0 | 57322 |
# sort in-place to get most populated counties
df.sort_values('POP2010', ascending=False, inplace=True)
df.head()
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
204 | California | Los Angeles County | 9818605.0 | 9819968 |
610 | Illinois | Cook County | 5194675.0 | 5195026 |
2623 | Texas | Harris County | 4092459.0 | 4093176 |
103 | Arizona | Maricopa County | 3817117.0 | 3817365 |
222 | California | San Diego County | 3095313.0 | 3095349 |
# filter by condition, show only counties in California state
df[df.STATE == 'California']
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
204 | California | Los Angeles County | 9818605.0 | 9819968 |
222 | California | San Diego County | 3095313.0 | 3095349 |
215 | California | Orange County | 3010232.0 | 3008989 |
... | ... | ... | ... | ... |
210 | California | Modoc County | 9686.0 | 9682 |
231 | California | Sierra County | 3240.0 | 3239 |
187 | California | Alpine County | 1175.0 | 1175 |
58 rows × 4 columns
# show rows with missing values for POP2010 column
df[df.POP2010.isnull()]
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
9 | Alabama | Cherokee County | NaN | 25979 |
# show only counties within the two states and sort by state and then by population
df[df.STATE.isin(['Delaware', 'Texas'])].sort_values(
['STATE', 'POP2010'], ascending=[True, False])[:10]
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
317 | Delaware | New Castle County | 538479.0 | 538484 |
318 | Delaware | Sussex County | 197145.0 | 197103 |
316 | Delaware | Kent County | 162310.0 | 162350 |
... | ... | ... | ... | ... |
2749 | Texas | Travis County | 1024266.0 | 1024444 |
2593 | Texas | El Paso County | 800647.0 | 800633 |
2565 | Texas | Collin County | 782341.0 | 781419 |
10 rows × 4 columns
# getting total population for all counties in a state
df.POP2010[df.STATE == 'Texas'].sum()
25145561.0
# get number of counties within each state
df.STATE.value_counts()
Texas 254
Georgia 159
Virginia 133
...
Hawaii 5
Delaware 3
District of Columbia 1
Name: STATE, Length: 51, dtype: int64
# sum population of counties within a state;
# list sorted by population (omitting all other columns for brevity)
res = df[['STATE', 'POP2010']].groupby('STATE').sum()
res.sort_values('POP2010', ascending=False)
POP2010 | |
---|---|
STATE | |
California | 37253956.0 |
Texas | 25145561.0 |
New York | 19378102.0 |
... | ... |
Vermont | 625741.0 |
District of Columbia | 601723.0 |
Wyoming | 563626.0 |
51 rows × 1 columns
pandas
has a concept of boolean indexing which provides powerful techniques for selecting rows using functions and various criteria. In the example below, each item in the bool_index
list variable is either True
or False
depending on whether an item in the column STATE
when passed into the startswith()
function returned True
or False
. This list is used then to select rows in df
- those rows for which there is True
in the bool_index
will be returned.
Here we return those counties with population larger than a certain value which are in a state with the name of the state starting with W
.
bool_index = [x.startswith('W') for x in df['STATE']]
df[(df['POP2010'] < 6000) & bool_index]
STATE | COUNTY | POP2010 | EST2010 | |
---|---|---|---|---|
3072 | Wisconsin | Iron County | 5916.0 | 5916 |
3044 | West Virginia | Wirt County | 5717.0 | 5714 |
3127 | Wyoming | Hot Springs County | 4812.0 | 4812 |
... | ... | ... | ... | ... |
2987 | Washington | Wahkiakum County | 3978.0 | 3979 |
3132 | Wyoming | Niobrara County | 2484.0 | 2484 |
2964 | Washington | Garfield County | 2266.0 | 2266 |
9 rows × 4 columns
As you have seen, with pandas
it is very easy to get useful insights about the data and do some data filtering, sorting, and SQL like selections. What would require many lines of code in plain Python (using collections.Counter
, creating intermediate data structures, and copies of data) could be done in just one line in pandas
. Using this package can make you a more productive developer or an analyst. It can also simply be very handy to be able to clean the data as needed when using a graphical user interface isn’t very convenient.
Plotting¶
pandas
is able to load data from all kinds of sources such as csv
, Excel
, HTML
, and SQL
databases.
However, it is also possible to construct a DataFrame
from all kinds of Python data structures such as dictionaries, lists, and tuples as well as from numpy
arrays. This means that if any existing program already creates such a data structure, it’s very likely that you’ll be able to construct a DataFrame
from it.
pandas
also has built-in plotting functions which use matplotlib
features - one of the popular Python plotting libraries.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('~/data/states.csv')
df.sort_values('POP2010', ascending=False, inplace=True)
fig = df[['STATE', 'POP2010', 'EST2019']].head(10).plot(
kind='bar', x='STATE', alpha=0.8, legend=True, figsize=(15, 7),
title="Population change trend")
fig.legend(['Population 2010 (census)', 'Population 2019 (estimate)'])
fig.set_xlabel('')
fig.set_ylabel('Population (millions)')
Text(0, 0.5, 'Population (millions)')
df['POPDIFF'] = (df['EST2019'] - df['POP2010']) / df['EST2019']
df.sort_values('POPDIFF', ascending=False, inplace=True)
fig = df[['STATE', 'POPDIFF']].tail(15).plot(
kind='bar', x='STATE', figsize=(15, 7), legend=None,
title='Population change estimate from 2010 to 2019')
fig.set_xlabel('')
fig.set_ylabel('Population change (millions)')
fig.axhline(linewidth=2, color='g')
As you can see it is relatively easy to plot images with pandas
and matplotlib
. There are so many other options and graph types to choose from; you may explore more of those on the matplotlib
home page.
Happy analyzing!