Pandas Data Wrangling Part1

Pandas is a very powerful and easy to use library built for Python. It is used for data manipulation, with powerful functions hat make it easier to perform data analysis on numerical tables and time series data.

This write up is just going to be quick tutorial on the basic functions of Pandas library.

As a data scientist 70% -80% of the work lies on manipulating and cleaning the dataset of choice. It is a properly manipulated data that can bring a result desired by the scientist.

For the purpose of this tutorial, the Pokémon dataset is to going to be used.

First important step is to import the library:

import pandas as pd

To read data into the notebook, there are different methods for that:

pk = pd.read_csv("Pokemon.txt", delimiter="\t")
or
pk = pd.read_csv("Pokemon.csv")
or
pk = pd.read_xlsx("Pokemon.xlsx")

If you’re to read in a url, below is an example to follow:

url = https://www.kaggle.com/rounakbanik/pokemon
pk = pd.read_csv(url, delimiter="\t")

After reading the data, you may want to see some parts of the data, perhaps the first few rows or the last few:

#to print the first ten rows .head(10) can be used or just .head() to print the first five by default. The same goes for tail.pk.head(10)
pk.tail(10)

It is important to get the general information of your dataset:

pk.info()

The .describe() generates basic statistical details for quick insights

pk.describe()

To get the names of the columns, the following is to be used;

pk.columns()

From the details provided at the pk.info() code, I’m sure you can tell that there are 414 non-null values. This simply means that of the 800 values expected in total, “Type 2” column has 386 ‘Nan’ values. Apparently, null values are of no use to us, and since the null values make up a total of 48.25% of the “Type 2” column, the entire column may not be of statistical importance to the dataset. If the null values were a little lower, 10% for example, a little manipulation could still be done that will make the column useful to the dataset.

pk.drop(['Type 2'],axis=1, inplace=True)
pk.head()

If the need arises for you to sort the values of your data, with the sort_value code, it would be done. If you desire to sort it in a descending manner, “ascending=False” should be in the bracket

pk.sort_values(['Name','Type 1'], ascending=True)
pk.sort_values(['Name'], ascending=False)

To get the unique variables in a column, “.unique()” code comes in handy. While to get the number of unique variables will be “.nunique()”

pk['Type 1'].unique()

Below is an examples and ways of filtering through the data

new_set = pk.loc[(pk['Type 1']=='Grass') & (pk['Speed']>75) & (pk['HP']>50)]
new_set

In the “Legendary” column, the variables there are objects (categorical). ‘True’ and ‘False’ in this case are not continous variables but categorical. For the ease of computation it would be better to have these variables in numbers (Numbers in this sense can also be categorical, it just reduces the complexity in analysis):

new_set['Legendary']= new_set['Legendary'].replace([False, True], ['0','1'])
new_set

It is also important that you know the type of data you are working with. This would help avoid errors that could be a colossal disaster to the entire process.

new_set.dtypes

You may consider changing the data type, here’s how;

new_set=new_set.astype({"Attack":'object', "Legendary":'int64'})

You may not be satisfied with the original index that python has automatically generated for you or that which is originally with the data, to change it, set_index is needed:

pk.set_index('Name')

All these pandas functions are simple, basic, yet important in manipulating the dataset of interest. There are many more methods, and would be discussed later on.

Your comments are welcome and you can also reach me on Twitter and LinkedIn.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store