Introduction To Pandas - Part 1
Today, we are going to learn about an infamous python library known as Pandas, this is going to be part 1 of the two-article series, what amazing functions it has and how it is to be used successfully.
Table of Contents
- What is Data Analysis?
- What is Pandas Python?
- Why Pandas?
- How to install Pandas in Python?
- How to import Pandas in Python?
- How to check the version of Pandas?
- Pandas Objects
- How to read CSV files in python using pandas?
- Methods and Attributes of DataFrame
- Some important concepts in Pandas
- Summary Functions
- Aggregation Functions
- Sorting Function
- Renaming Function
- Conclusion
What is Data Analysis?
Data analysis is a way of gathering, organizing, and, if necessary, manipulating data in order to extract insightful information (i.e. trends and patterns) from it.
What is Pandas Python?
Pandas (officially stands for Python Data Analysis Library) is an open-source library that provides a variety of data structures and data manipulation methods that allows performing complex tasks with simple one-line commands. It's mostly used by data scientists.
Why Pandas?
The major advantage of using Pandas is it helps you manipulate and analyze large volumes (millions of rows/records) of data with ease and efficiency.
How to install Pandas in Python?
Execute this single-line code in your local environment's console:
pip install pandas
How to import Pandas in Python?
Once installed, we can import it in the following way:
import pandas as pd
How to check the version of Pandas?
The latest version is 1.4.4.
pd.__version__
'1.2.3'
Pandas Objects
There are two fundamental data structures:
What is a Series?
It is a one-dimensional labeled array that can hold any data type like a column in a table along with an index. A Series having elements such as both numbers and strings, its data type is always 'object'. By default, indexing starts from 0 in Series.
a = pd.Series([10, 20, 30, 40, 50])
a
0 10 1 20 2 30 3 40 4 50 dtype: int64
What is a DataFrame?
It is a two-dimensional table made up of a sequence of aligned Series structured with labeled axes (rows and columns). Below is an example of creating a DataFrame using the dictionary.
df = pd.DataFrame({
"car": ['Mercedes', 'Maserati MC20', 'Ferrari'],
"speed": [420, 530, 450]
}, index=['a', 'b', 'c'])
df
car | speed | |
---|---|---|
a | Mercedes | 420 |
b | Maserati MC20 | 530 |
c | Ferrari | 450 |
You can give your own row indexes as above.
How to read CSV files in python using pandas?
CSV is a basic file format that stores comma-separated values. Pandas read_csv() method enables you to work with files effectively. You can try any data files (JSON, etc.) for reading and writing data.
df = pd.read_csv("filename.csv") OR
df = pd.read_csv("Link_to_file")
Now, to write DataFrames to CSV file is also easy using the to_csv function.
df = pd.to_csv("filename.csv") OR
df = pd.to_csv("filename.csv", index=False) # export without the index
Methods and Attributes of DataFrame
There are some functions and attributes that allow us to observe basic information about the data stored in a DataFrame object:
DataFrame.head():
By default, it returns the content of the first 5 rows.
df.head()
DataFrame.tail():
By default, it returns the content of the last 5 rows.
df.tail()
DataFrame.shape:
It returns a tuple of the form (number of rows, number of columns).
df.shape
DataFrame.dtypes:
It returns the data types of each column
df.dtypes
DataFrame.info():
This method returns a concise summary of the DataFrame.
df.info()
DataFrame.columns
This returns the name of the columns.
df.columns
DataFrame.index
This returns the index of the rows
df.index
Some important concepts in Pandas:
What is Indexing in Pandas?
Indexing allows easily accessing particular rows and columns from a DataFrame.
There are two different methods of indexing in Pandas:
- loc - label-based selection
- iloc - index-based selection
Index-Based Selection - selecting data based on numerical position.
df.iloc[ ]
Label-Based Selection - selects data based on the column or row names/index.
df.loc[ ]
What is Selecting?
There are two types of selection:
Attribute (Dot) Based Selection
df.column_name
Dictionary (Bracket) Based Selection
df['column_name']
To select multiple columns in a DataFrame, you can write like this:
df[['column1_name','column2_name']]
Subsetting a Dataframe
It is a way of filtering portions of your interest. Below is an example of creating a subset of data df, only taking observations that were last updated on 2020-06-13 03:33:14.
updated_data=df[df['Last Update']=='2020-06-13 03:33:14']
What is Assigning?
It allows for assigning data to a DataFrame.
df.car="Lamborghini"
So far, you have learned to read and write a CSV file, some methods to check the information of data, and select data from a DataFrame. Now, we will look at some techniques that will help you know the above information about your data.
Summary Functions:
As we learned earlier about the info(), it's also a summary function but a more brief version of it is the describe() function.
By default, the describe() method only returns a summary of numerical columns.
df.describe()
If we want to get a summary of categorical columns separately, we can use the parameter 'include'.
df.describe(include="object")
For a summary including both categorical and numerical columns, you can write like this:
df.describe(include="all")
Let's see what information about the data is returned in the above table:
💠count - the count of non-null entries in the particular column.
💠unique - the count of unique values in a column. Only for categorical columns.
💠top - This tells us which category occurs the maximum number of times. Only for categorical columns.
💠freq - This tells you the number of occurrences of that column's top category. Only for categorical columns.
💠mean - the mean value of the numerical column.
💠std - This tells you about the variation in the data.
💠min - the minimum value in the numerical column.
💠25% - the 25th percentile (or 1st quartile) value in the numerical column.
💠50% - the 50th percentile (or 2nd quartile or the median) value in the numerical column.
💠75% - the 75th percentile (or 3rd quartile) value in the numerical column.
💠max - the maximum value in the numerical column.
💠NaN values mean that a particular summary value is unavailable for a particular column.
Aggregation Functions:
Well, we saw that the describe() function is very useful but we can also use individual methods too. Some of them are:
df.mean() # For mean
df.median() # For median
df['column_name'].unique() # Returns unique value in that column
df['column_name'].value_counts() # Returns count of unique values in the column
Lastly, we are going to see How we can sort and rename columns in DataFrame.
Sorting Function:
The sort_values() which returns the sorted result, by default, in ascending order.
df.sort_values(by="Confirmed", ascending=False)
Try this to get Top N values
df['Confirmed'].sort_values(ascending=False)[0:N]
Renaming Function:
We can rename the column names.
df.rename(columns={
'car':'cars'
},inplace=True) # inplace makes changes in original dataframe
More Resources:
Let’s Put It All Together:
Now, that you guys know what is Pandas and what are its useful functions helps us analyze our dataset.
If you wish to check out more articles on the market’s most trending technologies like Big Data, Python, and Computer Vision, then you can refer here.
I will be releasing part 2 of the Pandas article series soon.
See you next time,
@TechAE
No comments:
Post a Comment
Thank you for submitting your comment! We appreciate your feedback and will review it as soon as possible. Please note that all comments are moderated and may take some time to appear on the site. We ask that you please keep your comments respectful and refrain from using offensive language or making personal attacks. Thank you for contributing to the conversation!