Python for data analysis: Getting started with Pandas

Pandas Series

Pandas series are one dimensional arrays (a list). An example one dimensional array is below:

a2
b-4
c77
d16
e19

In the below, we look at how to interact with the series. First, selecting values from a specific index and then creating our own index to replace the usual zero to n-1.

Select values from the series

Here is another quick example of selecting values using both the default index and the custom index.

Searching for an index in the series

We can check if certain index values exist in our series:

Create a series from a dictionary

We can also create a series directly from a dictionary:

In the above, the index keys will be in the same order as the dictionary. If we want to change the order, we can pass this in:

Series information

We can check for null values in our series, using the below:

The series automatically aligns index labels for arithmetic functions (similar to a join in SQL):

We can name our series too:

Pandas Dataframes

Dataframes are two dimensional arrays. Here is an example of a 2D array:

 ManufacturerTotal SalesTotal Profit
0BMW2500000600000
1Nissan55000001000000
2Renault1500000220000

Setting up a dataframe

We can declare 2D arrays using the below structure & then convert that to a dataframe, using the pd.DataFrame() function.

We can inspect our newly created dataframe, seeing only the top x rows.

Or we can use the tail function to inspect only the bottom x rows.

In the below example, I’m selecting only certain columns from my dataframe:

We may want to re-order some columns for easier reading. We can do that using the below:

We may wish to transpose our dataframe (swap the rows and columns around). We can do that as below:

We can also inspect our dataframe using the below function:

Create dataframe from a file

We can create a dataframe from a file, as below. In this script, we define the delimiter as being a comma and infer the header from the file.

Searching Dataframes

We can retrieve a specific field from the dataframe, like below:

Or we could do the same with the below notation:

We can select a specific row like this – this assumes we know the index number of that row:

Select WHERE condition

We can start interacting with our dataframes with more logic. In the below example, we’re selecting only those records where total profit is = 400,000.

Using the below, we select only those entries that are ‘male’

And in the below, we use the equivalent of an SQL %LIKE% operator & select all email addresses that include ‘oracle’.

We can add multiple where conditions, as below. Here, we’re looking for all records where gender is Female and the ID is greater than 5.

And this example does the same as above, but is OR rather than AND. Or is depicted by the pipe.

We can select all data where the value is not null, as below:

Dataframe arithmetic

We can sum a particular column within a dataframe using the below script:

We can also complete an SQL style sum & groupby clause, as below:

We can group by multiple columns, should we need to:

We can also sum and group by multiple columns, as shown below:

The below is the equivalent to count(*) in SQL. However, it counts only where the value is not null.

We can now calculate the mean, min or max of specific columns using numpy, as below:

Join & Union Dataframes

In the below, we join two dataframes (I duplicated the dataframe to make this simple). The syntax is:
df1.join(Dataframe_To_Join, field_to_join_on, Type_of_join, left_table_suffix, right_table_suffix). You can see that the suffix has been added to the end of every field name.

We may also need to stitch two dataframes together (or create a union), we can do that using the concat function, as below: