Data Science Workshop 1 (Part 2): Numpy
Hi, I am Dr. Shahriar Hossain. Welcome back to Part 2 of Data Science Workshop 1. We barely started to discuss NumPy in the previous video. Let us begin from where we left in the last part.
NumPy is an excellent package to keep datasets in the main memory. The Package has ample mathematical functions that are basic to many machine learning algorithms.
Here is the YouTube video for Part 2.
Contents
Regular list in Python
As I was saying, as a regular Python list, I am creating a variable named list1, which contains 5, 2, 10, and 3.
I can click this play symbol to execute the line. Alternatively, I can hit shift plus enter or shift plus return on a mac to run a line. Again, I am using Google colab as my Python editor. This editor on the screen is like Jupyter Notebook. I can just write the variable name, such as list1, and then press Shift+Return to see what the content of the variable list1 is. The content is printed.
If you are using a regular Python editor, such as Spyder, then you will have to use the print function to print the content of the variable list1.
Anyway, this variable list1 is a list variable.
NumPy array
Now, I want to create a NumPy variable. Let us import the package NumPy as np. Therefore, I can use this variable np to call NumPy functions.
I can create a NumPy array that will copy all the elements from list1. I use np.array and in parenthesis, I can provide the Python list I already have. I will save this newly created NumPy array in a variable named arr1.
Now, what is the benefit of using arr1 over list1? That is, what is the benefit of using a NumPy array over a regular Python list?
To demonstrate the benefit, let me create another Python list, named list2, that contains 5, 6, 20, and 31.
I will create a second NumPy array called arr2, which will copy the content of list2. So, arr2 is a NumPy array.
Very well.
list1 and list2 are regular python lists.
arr1 and arrr are NumPy arrays.
What does + do for two python lists?
As you know, Python is a flexible and versatile language. Unlike programming languages like Java, you can use the plus (+) operator on two arrays or lists. What will happen if we apply the operation list1+list2?
Practically, the operation will not add the elements of list1 and list2, rather a combined list will be created where you have the elements of list1 and then the elements of list2. That is, the plus operation here is concatenating the two lists. Therefore, this plus symbol is operating as a concatenation operation. It is not really mathematical addition of two arrays or two vectors.
What does + do for two NumPy arrays?
Let us see what happens if we add two NumPy arrays, arr1 and arr2.
Clearly, cellwise addition is performed.
5+5 is 10
2+6 is 8
10+20 is 30
3+31 is 34
This demonstrates that NumPy is actually equipped with mathematical functions.
NumPy functions
This np, which is the NumPy reference, has many many builtin functions that you can use to process your data. Notice, you even have matrix multiplication operations, see you have this matmul function for matrix multiplication. You have add, subtract, etc.
Why do we need matrix operations here? A data table forms a matrix (or, a two dimensional array) Many data mining and machine learning algorhtihms use require matrix operations. Numpy has efficient matrix operations. It is more convenient to keep data in NumPy multidimensional arrays than keeping the data in a Python list.
Alternatively, there is TensorFlow which also contains matrix operations. Tensorflow also includes GPU optimized matrix operations. We are not looking at Tensorflow now, but I just wanted to mention TensorFlow as a reference for the future.
NumPy matrix multiplication
As a quick example of matrix operations, I will create two matrices.
I can create a matrix with two rows and three columns like this.
Or I can create a matrix with random numbers using NumPy’s random number generators. I can ask the function to generate a matrix with five rows and four columns like this. Oops, something was wrong. There was a spelling mistake. After correcting it, we see five rows and four columns in the generated matrix.
Let us make sure that we save this generated matrix in a variable. We are saving the matrix in a variable named d1.
Let us create another matrix with four rows and three columns. Save it in a variable named d2.
I should be able to do a multiplication between these two matrices d1 and d2.
Let us use np.matmul and pass d1 and d2. Of course, the result will have five rows and three columns because d1 has five rows and four columns and d2 has four rows and three columns. That is, NumPy is doing the regular matrix multiplication.
Matrix multiplication is important because we can consider that a matrix is a tabular dataset. Many of the algorithms in machine learning use matrix operations.
At this point, I told the audience that I just wanted to demonstrate a few things about NumPy because it is a basic data structure quite commonly used by many of the algorithms. NumPy pops up very frequently for data science practitioners who write programs to solve analytic problems.
Code
You can download the code, for Google Colab or Jupyter Notebook, from this link 1_numpy.zip (please unzip to get the ipynb file.)
The code is also provided below (for a regular Python editor like Spyder.)
list1 = [5, 2, 10, 3] print(list1) import numpy as np arr1=np.array(list1) print(arr1) list2=[5, 6, 20, 31] arr2=np.array(list2) print(arr2) print(list1+list2) print(arr1+arr2) data1 = np.array([[6, 6, 25], [15, 12, 10]] ) d1=np.random.rand(5, 4) d2=np.random.rand(4, 3) print(np.matmul(d1, d2))
Now we are moving to Pandas. The Pandas library help in reading from a csv or an excel file, write to a file, manipulating tabular data, exploring the data, and do a little bit of cleaning when required.