import pandas as pd
import numpy as np
1 pandas
series and data frames
In this lesson we introduce the two core objects in the pandas
library, the pandas.Series
and the pandas.DataFrame
. The overall goal is to gain familiarity with these two objects, understand their relation to each other, and review Python data structures such as dictionaries and lists.
Learning objectives
By the end of this lesson, students will be able to:
- Explain the relation between
pandas.Series
andpandas.DataFrame
- Construct simple
pandas.Series
andpandas.DataFrame
from scratch using different initalization methods - Perform simple operations on
pandas.Series
- Navigate the
pandas
documentation to look for attributes and methods ofpandas.Series
andpandas.DataFrame
pandas
pandas
[1] [2] is a Python package to wrangle and analyze tabular data. It is built on top of NumPy and has become the core tool for doing data analysis in Python.
The standard abbreviation for pandas
is pd
. Here we will import it together with NumPy:
Always import all your packages in a single cell at the top of you notebook! Following the PEP 8 - Style Guide for Python Code [3], each package or library import should be in a separate line.
Series
The first core object of pandas is the series. A series is a one-dimensional array of indexed data.
A pandas.Series
having an index is the main difference between a pandas.Series
and a NumPy array. Let’s see the difference:
# A numpy array
= np.random.randn(4) # random values from std normal distribution
arr print(type(arr))
print(arr, "\n")
# A pandas series made from the previous array
= pd.Series(arr)
s print(type(s))
print(s)
<class 'numpy.ndarray'>
[-0.11699598 0.6988026 -0.02075373 0.83663288]
<class 'pandas.core.series.Series'>
0 -0.116996
1 0.698803
2 -0.020754
3 0.836633
dtype: float64
Notice the index is printed as part of the pandas.Series
while, although the np.array
is indexable, the index is not part of this data structure. Printing the pandas.Series
also shows the values and their data type.
Creating a pandas.Series
The basic method to create a pandas.Series
is to call
= pd.Series(data, index=index) s
The data
parameter can be:
- a list or NumPy array,
- a Python dictionary, or
- a single number, boolean (
True
/False
), or string.
The index
parameter is optional, if we wish to include it, it must be a list of list of indices of the same length as data.
Example: Creating a pandas.Series
from a NumPy array
Let’s create a pandas.Series
from a NumPy array. To use this method we need to pass a NumPy array (or a list of objects that can be converted to NumPy types) as data
. Here, we will also include the list [2023, 2024, 2025]
to be used as an index:
# A series from a numpy array
3), index=[2023, 2024, 2025]) pd.Series(np.arange(
2023 0
2024 1
2025 2
dtype: int64
Example: Creating a pandas.Series
from a list
Here we create a pandas.Series
from a list of strings. Remember that the index
parameter is optional. If we don’t include it, the default is to make the index equal to [0,...,len(data)-1]
. For example:
# A series from a list of strings with default index
'EDS 220', 'EDS 222', 'EDS 223', 'EDS 242']) pd.Series([
0 EDS 220
1 EDS 222
2 EDS 223
3 EDS 242
dtype: object
Example: Creating a pandas.Series
from a dictionary
Recall that a dictionary is a set of key-value pairs. If we create a pandas.Series
via a dictionary the keys will become the index and the values the corresponding data.
# Construct dictionary
= {'key_0':2, 'key_1':'3', 'key_2':5}
d
# Initialize series using a dictionary
pd.Series(d)
key_0 2
key_1 3
key_2 5
dtype: object
dtype: object
Notice that in this and the previous example the data type of the values in the series is object
. This data type in pandas
usually indicates that the series is made up of strings. However, we can see in this example that the object
data type can also indicate a mix of strings and numbers.
Example: Creating a pandas.Series
from a single value
If we only provide a single number, boolean, or string as the data for the series, we need to provide an index. The value will be repeated to match the length of the index. Here, we create a series from a single float number with an index given by a list of strings:
3.0, index = ['A', 'B', 'C']) pd.Series(
A 3.0
B 3.0
C 3.0
dtype: float64
Simple operations
Arithmetic operations work on series and so most NumPy functions. For example:
# Define a series
= pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])
s
# Divide each element in series by 10
print(s /10, '\n')
# Take the exponential of each element in series
print(np.exp(s), '\n')
# Original series is unchanged
print(s)
Andrea 9.8
Beth 7.3
Carolina 6.5
dtype: float64
Andrea 3.637971e+42
Beth 5.052394e+31
Carolina 1.694889e+28
dtype: float64
Andrea 98
Beth 73
Carolina 65
dtype: int64
We can also produce new pandas.Series
with True
/False
values indicating whether the elements in a series satisfy a condition or not:
> 70 s
Andrea True
Beth True
Carolina False
dtype: bool
This kind of simple conditions on pandas.Series
will be key when we are selecting data from data frames.
Identifying missing values
In pandas
we can represent a missing, NULL, or NA value with the float value numpy.nan
, which stands for “not a number”. Let’s construct a small series with some NA values represented this way:
# Series with NAs in it
= pd.Series([1, 2, np.nan, 4, np.nan])
s s
0 1.0
1 2.0
2 NaN
3 4.0
4 NaN
dtype: float64
Notice the data type of the values it he series is still float64
.
The hasnans
attribute for a pandas.Series
returns True
if there are any NA values in it and false otherwise:
# Check if series has NAs
s.hasnans
True
After detecting there are Na values, we might be intersted in knowing which elements in the series are NAs. We can do this using the isna
method:
s.isna()
0 False
1 False
2 True
3 False
4 True
dtype: bool
The ouput is a pandas.Series
of boolean values indicating if an element in the row at the given index is np.nan
(True
= is NA) or not (False
= not NA).
- The integer number -999 is often used to represent missing values. Create a
pandas.Series
nameds
with four integer values, two of which are -999. The index of this series should be the the letters A through D.
- In the
pandas.Series
documentation, look for the methodmask()
. Use this method to update the seriess
so that the -999 values are replaced by NA values. HINT: check the first example in the method’s documentation.
There’s much more to say about pandas.Series
, but this is enough to get us going. At this point, we mainly want to know about pandas.Series
because pandas.Series
are the columns of a pandas.DataFrame
.
Data frames
The pandas.DataFrame
is the most used pandas
object. It represents tabular data and we can think of it as a spreadhseet. Each column of a pandas.DataFrame
is a pandas.Series
.
Creating a pandas.DataFrame
There are many ways of creating a pandas.DataFrame
. We present one simple one in this section.
We already mentioned each column of a pandas.DataFrame
is a pandas.Series
. In fact, the pandas.DataFrame
is a dictionary of pandas.Series
, with each column name being the key and the column values being the key’s value. Thus, we can create a pandas.DataFrame
in this way:
# Initialize dictionary with columns' data
= {'col_name_1' : pd.Series(np.arange(3)),
d 'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
}
# Create data frame
= pd.DataFrame(d)
df df
col_name_1 | col_name_2 | |
---|---|---|
0 | 0 | 3.1 |
1 | 1 | 3.2 |
2 | 2 | 3.3 |
We can change the index by changing the index
attribute in the data frame:
# Change index
= ['a','b','c']
df.index df
col_name_1 | col_name_2 | |
---|---|---|
a | 0 | 3.1 |
b | 1 | 3.2 |
c | 2 | 3.3 |
We can access the data frame’s column names via the columns
attribute. Update the column names to C1 and C2 by updating this attribute.
Next
Jump to the week 1 discussion section to practice preliminary data exploration with a real world dataset. Then, continue with the next lesson on subsetting data frames.