Data Science Workshop 3 (Part 1): Exploratory Data Analysis using Pandas in Python Programming
Today, I will describe an initial way of exploring and analyzing data using Pandas in Python programming.
Exploratory Data Analysis
There is a process called Exploratory Data Analysis, which refers to the initial investigation of data. The phrase “the initial investigation” is a broad term. The initial investigation can be as simple as looking at properties of features or columns of data, or more complex tasks, such as finding outliers, finding patterns, or even groups of data points.
Exploratory Data Analysis (EDA) is a process, not just one algorithm. The process leads to the generation of a good dataset for statistical and machine learning modeling.
Unfortunately, or fortunately, Exploratory Data Analysis is a process where steps are not well-defined. For different datasets, EDA may vary dramatically.
Pecan Yield Dataset
Today, we will be looking at a tabular dataset named Pecan Yield data (Pecan Yield Data).
The Pecan data has five columns. Row ID, Water per acre, Salinity level, Fertilizer per acre, and Pecan Yield. The data has 56 points. Each row of the data contains information about how much water was applied in a season, what was the average salinity level of the soil of the field in that season when the data was collected, the amount of fertilizer given in that season, and what was the amount of pecan harvested at the end of that season.
In the video, I use Pandas for most of the data analysis. I used Jupyter Lab as the editor. You can use Jupyter Notebook, Google Colab, Pycharm, Spyder, or any text editor of your preference. Just make sure that you have Pandas, Numpy, and MatPlotlib installed with the Python environment you are using.
Here is the YouTube video.
The Notebook File
The Jupyter notebook with the code is available in this zip file: Code.zip.
Python version of the code
The following code is a pure Python version of the code I wrote in the video.
import pandas as pd df=pd.read_csv('Pecan.csv', delimiter='\t') print("Size of the dataset:") print(df.shape) print("\n\nData head:") print(df.head()) print("\n\nData tail:") print(df.tail()) print("\n\nData describe:") print(df.describe()) print("\n\nHow many are null values:") print(df.isnull().sum()) print("\n\nDescribe a feature:") print(df['Water per acre'].describe()) print("\n\nLearn more about data types of this dataset:") print(df.info()) import matplotlib.pyplot as plt print("\n\nFind relationship between a pair of features using pyplot:") plt.scatter(df['Salinity level'], df['Pecan Yield']) plt.show()
Please save the code in a file named DataAnalysis.py and then run it using the following command on the terminal.
The command above should run the program given that your terminal has the Python command.