Python for data analysis: reading files

Facebook
Google+
Twitter
LinkedIn
Tumblr

Python is great for file analysis – we’ll be looking into more complex functions using the Pandas libraries in subsequent articles, but for now, let’s look at the out of the box functionality that we can use in Python.

Open a file & print the contents:

In the below, i’ve defined ‘path’ as Desktop/mexico.txt, which is the file I’m interested in reading. I’ve selected to open the file using ‘r’ which is read-only; and r+ is read + write. For the most part, when we’re analysing data, we don’t need to write to the file that we’re analysing, so I won’t go into the other possible methods.

Once I’ve opened the file, I use the readlines() function to read the lines into the ‘lines’ variable & then print the contents. The contents of the file are from the Mexico page on Wikipedia.

You’ll notice in the above that we have end of line characters (/n). We can use rstrip to remove all end of line characters (like /n and blankspace) as shown below:

Basic file analysis:

So, here we have a not very useful for loop. It prints either “not Mexico” or “Mexico” depending on the condition. While this isn’t a very useful statement, you could also increment by 1 every time the word ‘Mexico’ is located to achieve a count or you may choose to execute other functions. The example below, was to highlight the ability to analyse files with Python, more than providing any meaningful analysis.

Below, I’m using a for loop in conjunction with an if statement, to increment a variable (called increment), by 1, every time the letter ‘M’ is found within the document:

When we use ‘x’ as above, we’re counting the number of occurrences of a particular letter. We can use ‘word’ as below, to count the number of occurrences of a word in the document. The .split(” “) function splits the string at each space.

we can also complete basic counts. For example, below, I am counting the number of occurrences of the word Mexico within the file:

And below, I am counting the number of lines within the file:

Read all files in directory

Below, I have added an example on how to read multiple files at once with Python. To do this, we utilise Glob. Glob: finds pathnames matching a pattern (e.g. below, we have desktop/*.txt)

Reading CSV Files

So far, we’ve focused on reading txt files with Python. Now let’s look at a CSV file. All data has been generated with mockaroo.com for the purposes of this test. As you can see from the below, we define the file path in the same way. But, we have to include an additional provision to define the delimiter of the file.

We can enhance this slightly, to extract only certain information from the CSV file. In the below, we create two empty lists (id_list and gender_list). During the for loop, we define the column numbers that those fields relate to: in this case, id is the first column (hence is zero) and gender is column 5 (but due to the index starting at zero, we define that as column 4).

With each loop, we append the value of the id field & the gender field to the lists we’ve created.

The above, is a little more useful. But what if we’re looking for something specific? In the below, we are doing exactly the same as above (albeit with the surname field, rather than the gender field.

But we’re adding some information. We’re looking for the ID of someone with the surname Arrigo. To identify this, we have to find the index number of Arrigo in the surname list. We can do that using the below – it’ll scan the list until it finds someone with the Arrigo surname & it’ll return the index number for that entry.

We then need to find the corresponding ID number. We can easily do that by simply picking the value out of the list, with the same index number as the Arrigo surname. So, we inject the index number variable into the id2_list[], to select that ID.

All put together, we have the below:

We must always remember to close our file once we’re done with our analysis: