Python basics - Pandas · Alexey Tereshenkov

This is one of the posts in a series of introductory Python articles. These posts explain programming principles in a simplified way and show very basic Python code that should help folks learning Python to get a better understanding of some concepts. In this post, I’ll share some notes about working with a popular data science and data analysis library - pandas.

I have simplified this dataset that is part of the Census from this this portal to illustrate how pandas work.

Introduction¶

pandas is one of the most popular data analysis and data munging library implemented as a Python package. pandas provides an easy-to-use interface for exploring all kinds of datasets. It has hundreds of useful functions and operations, so it is impossible to go through all of them. We will start exploring the basics of pandas first by loading a .csv file into a pandas data frame.

import pandas as pd
pd.set_option('display.max_rows', 7)

df = pd.read_csv('~/data/counties.csv')
df

	STATE	COUNTY	POP2010	EST2010
0	Alabama	Autauga County	54571.0	54597
1	Alabama	Baldwin County	182265.0	182265
2	Alabama	Barbour County	27457.0	27455
...	...	...	...	...
3139	Wyoming	Uinta County	21118.0	21121
3140	Wyoming	Washakie County	8533.0	8528
3141	Wyoming	Weston County	7208.0	7208

3142 rows × 4 columns

Now we would like to create a new .csv file with only counties in the Oregon state.

df = pd.read_csv('~/data/counties.csv')
oregon_counties = df[(df.STATE == 'Oregon')]
oregon_counties.to_csv('~/data/oregon.csv', index=False)

pd.read_csv('~/data/oregon.csv')

	STATE	COUNTY	POP2010	EST2010
0	Oregon	Baker County	16134.0	16131
1	Oregon	Benton County	85579.0	85581
2	Oregon	Clackamas County	375992.0	375996
...	...	...	...	...
33	Oregon	Washington County	529710.0	529862
34	Oregon	Wheeler County	1441.0	1439
35	Oregon	Yamhill County	99193.0	99216

36 rows × 4 columns

This kind of operations and many others that you will learn now can be done using any RDBMS such SQL Server, MySQL, or PostgreSQL or an office program such as Excel. However, not all of your data will originate or be stored in a RDBMS. We could load .csv files into a RDBMS to do this kind of work, but if there is no need to manage the data over time in the database, we may use the wrong tool for the job.

Basic operations¶

As you see, it is very easy to load external data into something that is called a DataFrame. You can think of this as a relational table that consists of rows and columns. This is what you would expect to have after importing a .csv file into an Excel sheet. Let’s explore some of the basic operations that are available for a DataFrame object:

# getting only the first 5 rows, equivalent to df[0:5]
df.head()

	STATE	COUNTY	POP2010	EST2010
0	Alabama	Autauga County	54571.0	54597
1	Alabama	Baldwin County	182265.0	182265
2	Alabama	Barbour County	27457.0	27455
3	Alabama	Bibb County	22915.0	22915
4	Alabama	Blount County	57322.0	57322

# sort in-place to get most populated counties
df.sort_values('POP2010', ascending=False, inplace=True)
df.head()

	STATE	COUNTY	POP2010	EST2010
204	California	Los Angeles County	9818605.0	9819968
610	Illinois	Cook County	5194675.0	5195026
2623	Texas	Harris County	4092459.0	4093176
103	Arizona	Maricopa County	3817117.0	3817365
222	California	San Diego County	3095313.0	3095349

# filter by condition, show only counties in California state
df[df.STATE == 'California']

	STATE	COUNTY	POP2010	EST2010
204	California	Los Angeles County	9818605.0	9819968
222	California	San Diego County	3095313.0	3095349
215	California	Orange County	3010232.0	3008989
...	...	...	...	...
210	California	Modoc County	9686.0	9682
231	California	Sierra County	3240.0	3239
187	California	Alpine County	1175.0	1175

58 rows × 4 columns

# show rows with missing values for POP2010 column
df[df.POP2010.isnull()]

	STATE	COUNTY	POP2010	EST2010
9	Alabama	Cherokee County	NaN	25979

# show only counties within the two states and sort by state and then by population
df[df.STATE.isin(['Delaware', 'Texas'])].sort_values(
    ['STATE', 'POP2010'], ascending=[True, False])[:10]

	STATE	COUNTY	POP2010	EST2010
317	Delaware	New Castle County	538479.0	538484
318	Delaware	Sussex County	197145.0	197103
316	Delaware	Kent County	162310.0	162350
...	...	...	...	...
2749	Texas	Travis County	1024266.0	1024444
2593	Texas	El Paso County	800647.0	800633
2565	Texas	Collin County	782341.0	781419

10 rows × 4 columns

# getting total population for all counties in a state
df.POP2010[df.STATE == 'Texas'].sum()

25145561.0

# get number of counties within each state
df.STATE.value_counts()

Texas                   254
Georgia                 159
Virginia                133
                       ... 
Hawaii                    5
Delaware                  3
District of Columbia      1
Name: STATE, Length: 51, dtype: int64

# sum population of counties within a state; 
# list sorted by population (omitting all other columns for brevity)
res = df[['STATE', 'POP2010']].groupby('STATE').sum()
res.sort_values('POP2010', ascending=False)

	POP2010
STATE
California	37253956.0
Texas	25145561.0
New York	19378102.0
...	...
Vermont	625741.0
District of Columbia	601723.0
Wyoming	563626.0

51 rows × 1 columns

pandas has a concept of boolean indexing which provides powerful techniques for selecting rows using functions and various criteria. In the example below, each item in the bool_index list variable is either True or False depending on whether an item in the column STATE when passed into the startswith() function returned True or False. This list is used then to select rows in df - those rows for which there is True in the bool_index will be returned.

Here we return those counties with population larger than a certain value which are in a state with the name of the state starting with W.

bool_index = [x.startswith('W') for x in df['STATE']] 
df[(df['POP2010'] < 6000) & bool_index]

	STATE	COUNTY	POP2010	EST2010
3072	Wisconsin	Iron County	5916.0	5916
3044	West Virginia	Wirt County	5717.0	5714
3127	Wyoming	Hot Springs County	4812.0	4812
...	...	...	...	...
2987	Washington	Wahkiakum County	3978.0	3979
3132	Wyoming	Niobrara County	2484.0	2484
2964	Washington	Garfield County	2266.0	2266

9 rows × 4 columns

As you have seen, with pandas it is very easy to get useful insights about the data and do some data filtering, sorting, and SQL like selections. What would require many lines of code in plain Python (using collections.Counter, creating intermediate data structures, and copies of data) could be done in just one line in pandas. Using this package can make you a more productive developer or an analyst. It can also simply be very handy to be able to clean the data as needed when using a graphical user interface isn’t very convenient.

Plotting¶

pandas is able to load data from all kinds of sources such as csv, Excel, HTML, and SQL databases. However, it is also possible to construct a DataFrame from all kinds of Python data structures such as dictionaries, lists, and tuples as well as from numpy arrays. This means that if any existing program already creates such a data structure, it’s very likely that you’ll be able to construct a DataFrame from it.

pandas also has built-in plotting functions which use matplotlib features - one of the popular Python plotting libraries.

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

df = pd.read_csv('~/data/states.csv')
df.sort_values('POP2010', ascending=False, inplace=True)
fig = df[['STATE', 'POP2010', 'EST2019']].head(10).plot(
    kind='bar', x='STATE', alpha=0.8, legend=True, figsize=(15, 7),
    title="Population change trend")
fig.legend(['Population 2010 (census)', 'Population 2019 (estimate)'])
fig.set_xlabel('')
fig.set_ylabel('Population (millions)')

Text(0, 0.5, 'Population (millions)')

png

df['POPDIFF'] = (df['EST2019'] - df['POP2010']) / df['EST2019']
df.sort_values('POPDIFF', ascending=False, inplace=True)
fig = df[['STATE', 'POPDIFF']].tail(15).plot(
    kind='bar', x='STATE', figsize=(15, 7), legend=None,
    title='Population change estimate from 2010 to 2019')

fig.set_xlabel('')
fig.set_ylabel('Population change (millions)')
fig.axhline(linewidth=2, color='g')

png

As you can see it is relatively easy to plot images with pandas and matplotlib. There are so many other options and graph types to choose from; you may explore more of those on the matplotlib home page.

Happy analyzing!

Python basics - Pandas

Introduction¶

Basic operations¶

Plotting¶

Published

Category

Tags