
An Introduction to Data
Data science is a field of study that focuses on techniques and algorithms to extract knowledge from data. The area combines data mining and machine learning with dataspecific domains. This section focuses on defining "data" before going to any complicated topic.
4
Lecture1.1

Quiz1.1

Lecture1.2

Lecture1.3


Data Dimensionality and Space
This section's focus is on defining the common terminology widely used in data science. The video lectures in this section focus on terms like objects, data points, features, dimensions, vectors, highdimensional data, and mathematical space.
6
Lecture2.1

Quiz2.1

Lecture2.2

Lecture2.3

Quiz2.2

Lecture2.4


Proximity in Data Science Context
Many data mining and machine learning algorithms rely on distance or similarity between objects/data points. Video lectures in this section focus on standard proximity measures used in data science. The section also explains how to use proximity measures to examine the neighborhood of a given point.
5
Lecture3.1

Lecture3.2

Lecture3.3

Lecture3.4

Lecture3.5


Clustering algorithms
A large portion of data science focuses on exploratory analysis. Scientists and practitioners use statistical techniques to understand the data. One way to explore the data is to check if there are clusters of data points. A cluster is a group of data points that have similar features. This section explains the clustering algorithms.
7
Lecture4.1

Quiz4.1

Lecture4.2

Lecture4.3

Lecture4.4

Lecture4.5

Lecture4.6


Classification algorithms
4
Lecture5.1

Lecture5.2

Lecture5.3

Lecture5.4

Evaluation of clustering algorithms: Measure the quality of a clustering outcome
Clustering evaluation refers to the task of figuring out how well the generated clusters are. Rand Index, Purity, Sum of Square Distance (SSD), and Average Silhouette Coefficient are widely used clustering evaluation metrics. All these clustering assessment techniques fall under two categories — supervised evaluation that uses an external criterion and unsupervised evaluation with an internal criterion. This page describes both types of clustering evaluation strategies.
Supervised evaluation of clustering using an external criterion
In supervised clustering evaluation, we already know what the cluster assignments should be for all the points. For the validation purpose, we compare our clustering outcome with the known assignments. Therefore, supervised evaluation is driven by an external criterion not used in the clustering algorithm. Most of the time, humans provide an external criterion with a benchmark dataset or a gold standard dataset.
A benchmark dataset or a gold standard dataset is a dataset for which the expected outcome is provided. We use the benchmark dataset to validate the correctness of an algorithm by comparing the result of an algorithm with the expected outcome.
As we know, cluster assignments computed by an algorithm like kmeans simply form an array or a vector. The length of the assignment vector is equal to the number of objects/data points/rows in the data table. Each element of the assignment vector tells us what is the clusterID of the corresponding row in the dataset. Additionally, the known assignments also form a vector. The known assignments or gold set assignments are also called labels. A supervised clustering evaluation technique attempts to discover the amount of match between the assignments derived from a clustering algorithm and the known assignments.
Complications associated with an external criterion
For the dataset provided in Figure 1, cluster assignments resulting from kmeans exactly match with the known assignments or expected assignments. Any supervised evaluation should identify this match as a perfect match. The clusters are drawn by two ellipses on the right side of Figure 1.
While Figure 1 demonstrates a 100% assignmentwise match between the kmeans outcome and the gold set, there can be another scenario for the same dataset where each of the kmeans assignments is different from the known assignments but yet the clusterwise matching is 100% perfect. How is that possible?
Consider that kmeans has flipped the Cluster IDs. That is Cluster 1 is now called Cluster 2 and Cluster 2 is now called Cluster 1 in the kmeans outcome. The kmeans outcome is still correct but there is no direct match with the known assignments now. Should we consider that kmeans is producing incorrect results? The answer is — No. When the kmeans algorithm gives us a result it tells us which rows in the data belong to which cluster. Whatever it calls Cluster 1 now, it may call it by another ID next time we execute the algorithm with the same data. All that matters is if the same set of points are in one cluster as the known assignments. Figure 2 explains the scenario further.
That is, all the points of each cluster given by a clustering algorithm should be in one cluster of the known assignments, to consider that there is a perfect match. Therefore, when we design our evaluation metric, we should keep into consideration that clusterID is just a number and should not be compared with known assignments directly.
We will discuss two supervised clustering evaluation metrics — Rand index and Purity.
Rand index for evaluation of clustering
Rand index is a measure of how similar two sets of clustering results are. We use the Rand index to evaluate the outcome of a clustering algorithm by comparing the outcome with a known or expected outcome.
As discussed, direct matching between cluster IDs of an algorithmic outcome and the expected clustering is not an option in the evaluation of clusters. Evaluation of clustering should be performed using assignments of each pair of points. For example, with a correct clustering outcome, if a pair of points are in the same cluster in the gold set, the pair should be in the same cluster created by the algorithm too. If a pair of points are in two different clusters in the gold set, then the pair should be in two different clusters in the clusters created by the algorithm too. Therefore, the more agreements are there between pairs of points in the algorithmic outcome and the gold set, the higher the correctness is.
Positive pairs: If a pair of points are in the same cluster created by a clustering algorithm, the pair is called a positive pair.
Negative pairs: If a pair of points are in two separate clusters created by a clustering algorithm, then the pair is called a negative pair.
In a correct clustering algorithm, we expect that the positive pairs are also positive in the gold set. Similarly, a negative pair of points are expected to be in two different clusters in the gold set.
Truepositive: If a pair of points are in the same cluster in the clustering created by the algorithm and in the gold set, then the pair is called a truepositive pair.
Truenegative: If a pair of points are in two different clusters in the clustering created by the algorithm and also in the gold set, then the pair is called a truenegative pair.
Falsepositive: If a pair of points are in the same cluster in the clustering created by the algorithm but in separate clusters in the gold set, then the pair is called a falsepositive pair.
Falsenegative: If a pair of points are in two separate clusters in the clustering created by the algorithm but in the same cluster in the gold set, then the pair is called a falsenegative pair.
Rand index is the ration between all true events and all pairs. That is,
A Rand index of 1.0 indicates a perfect match in clustering. Lesser values indicate some errors. With both the clustering results of Figure 2, the Rand index will be 1.0.
Python code to compute Rand index
A Python function to compute the Rand index between a clustering outcome and the expected outcome is provided below.
def RandIndex(clusterOutcome, expected): # Compute pairwise truepositve, truenegative, # falsepositive, and falsenegative tp=0 tn=0 fp=0 fn=0 for i in range(0, len(expected)): for j in range(i+1, len(expected)): if clusterOutcome[i] == clusterOutcome[j]: # positive pair in clustering outcome if expected[i]==expected[j]: #positive in expected assignments tp=tp+1 else: #negative in expected assignments fp=fp+1 else: #negative pair in clustering outcome if expected[i]==expected[j]: #positive in expected assignments fn=fn+1; else: #negative in expected assignments tn=tn+1 rand = (tp+tn)/(tp+tn+fp+fn) return rand
Purity for evaluation of clustering
Purity is another evaluation of clustering that utilizes an external criterion. To compute purity, each cluster of the algorithmic outcome is assumed to be the cluster that holds most points from the clusters of the gold set. For example, Cluster 1 of the algorithmic outcome will be considered Cluster 2 of the gold set if most of the points of algorithmic Cluster 1 are marked as Cluster 2 of the gold set. Purity is then computed as a ratio of the summation of how many maximum points of each algorithmic cluster match with a considered gold set cluster and the total number of points in data.
If there are k algorithmic clusters, and t gold set clusters , then purity is computed using the following formula.
The higher the purity the better the clustering outcome is. The maximum purity value is 1.0.
Example: Assume that we have a dataset with 14 data points for which we already know the expected cluster assignments. We run a clustering algorithm, such as kmeans, with k=3 and we receive the assignment vector for the 14 points. Both the outcome of the clustering algorithm and the expected cluster assignments are provided in the following table.
Clustering output 
2  1  1  3  2  2  2  3  2  1  1  3  3  2 

Gold set/ expected 
1  2  2  2  1  1  1  3  2  2  1  3  1  2 
We will compute the purity of the output as an evaluation of clustering.
Cluster 1 of the clustering output has 1 match with Cluster 1 of the gold set. Cluster 1 of the clustering output has 3 matches with Cluster 2 of the gold set. Finally, Cluster 1 of the clustering output has zero matches with Cluster 3 of the gold set. The maximum count of match is 3 for Cluster 1 of the clustering output.
Cluster 2 of the clustering output has 4 matches with Cluster 1 of the gold set. Cluster 2 of the clustering output has 2 matches with Cluster 2 of the gold set. Finally, Cluster 2 of the clustering output has zero matches with Cluster 3 of the gold set. The maximum match is 4 for Cluster 2 of the clustering output.
Cluster 3 of the clustering output has 1 match with Cluster 1 of the gold set. Cluster 3 of the clustering output has 1 match with Cluster 2 of the gold set. Finally, Cluster 3 of the clustering output has 2 matches with Cluster 3 of the gold set. The maximum match is 2 for Cluster 2 of the clustering output.
Counting the maximum matches, for each cluster of the clustering output, we have 3+4+2=9 in the numerator of the formula for purity. The denominator will be 14 because we have 14 data points (hence, the length of each assignment vector is 14.) Therefore, the purity of the clustering outcome of this example is 9/14=0.642857142857143.
Unsupervised evaluation of clustering using an internal criterion
In unsupervised clustering, we do not know what the clustering assignments should be. That is, we do not have a gold set to compare with. Therefore, we cannot directly say how accurate the clustering outcomes are. In unsupervised clustering evaluation, we rely on how well the structure of each cluster is. A cluster with points that are very close to each other is considered a good cluster (due to the intracluster distance objective.) Also, in good clustering results, a pair of points from two different clusters should have a large distance (due to intercluster distance objective). Unsupervised evaluation metrics generally leverage intracluster and/or intercluster distance objectives of a clustering outcome.
Sum of squared distance for evaluation of clustering
The sum of squared distance between each point and the centroid of the cluster it is assigned to is a local measure to compute clustering quality. Let be the ith point and be the centroid of the cluster is assigned to. Then the sum of squared distance (SSD) for N data points is computed using the following formula.
The sum of squared distance can be used to compare the quality of the clustering outcomes of different executions of the same algorithm for the same data with the same number of clusters. For example, the kmeans clustering algorithm might give different clustering outcomes in different runs using the same data with the same k. While this is uncommon when the dataset has clear and well separable clusters, with complex and overlapping groups of points there might be multiple locally optimum clustering outcomes. It is a common practice to execute the kmeans clustering algorithm many times and pick up a few clustering outcomes with the smallest SSD values.
Lower SSD values indicate better results. Lower SSD indicates that the points are not that far away from the centroids they are assigned to. Theoretically, the best SSD value is 0.0. SSD is can become zero only when all points in a cluster are exactly equal to the centroid. That means, SSD=0.0 when the distance between each point and its corresponding clustercentroid is 0.0.
Average Silhouette Coefficient for the evaluation of clustering
The Silhouette Coefficient of a data point takes into account both the intracluster distance and the intercluster distance in evaluating a clustering outcome. It is an unsupervised clustering evaluation with an internal criterion. If represents the intracluster distance of a data point and represents the intercluster distance, then the Silhouette Coefficient of the point is considered .
Let the clusters be . The mean intracluster distance of the ith data point residing in cluster is computed using the following formula.
The mean intercluster distance of the ith data point is computed by taking the minimum of the mean of the distances between and all the data points of the cluster that is not .
The Silhouette Coefficient of data point is computed as:
The equation will give Silhouette Coefficient for each data point. A negative silhouette value would indicate that the intracluster distance is larger than the intercluster distance. That would indicate that the data point is not in a cluster rather in a random space where there is no structure.
A positive Silhouette Coefficient value of a data point indicates that the point is in a cluster that is separable from other clusters. That means the intracluster distance is smaller than the intercluster distance. Note that the intercluster distance () is actually the mean distance with points of possibly the nearest cluster. Therefore, positive Silhouette Coefficient indicates that the point is inside a cluster.
A Silhouette Coefficient value of 0.0 indicates that the point is probably in the border of a cluster that overlaps a little with another cluster.
Higher Silhouette Coefficient values for all data points are expected. To compute the overall quality of a clustering outcome, Silhouette Coefficients are averaged over all the points. That is the Average Silhouette Coefficient (ASC) for a dataset with N data points is computed using the following formula.
ASC may vary between 1.0 to 1.0.
Negative ASC indicates that the clustering outcome does not provide any sort of structures.
Positive ASC indicates that the clustering outcome has some sort of structure. The larger the ASC is the better the clusters are.
An ASC value of 0.0 indicates that all points are scattered in the space such that all points seem to be on the boundary of a cluster.
25 Comments
Kindly if u can add lectures on Python it will be highly advantageous for students…………….so that they can learn python in parallel with this course.thanks
Hi, Thank you for your comment and suggestion. Someday, probably I will create lectures on Python programming. For now, my available time is so limited that it is difficult to run two lecture series in parallel. I really appreciate the feedback you have provided. I will definitely keep your suggestion in mind.
Hello prof. I’m just new here and also incoming 2nd year college under BS Computational and Data Science. Any tips you have for me? From humanities became a data science student and i find hard understanding some lessons about data sciences since i barely know about computer hardware, mostly about software. By learning through this, i hope i can get the the most needed knowledge about data science here. Thank you for this free course. :D
Hi Allen,
Thank you for your question. Sorry for the delay.
You will not need to know computer hardware to study Data Science. I would say that one needs some high school math backgrounds to understand data science theories. Definitely, there are advanced topics in every subject for which more advanced backgrounds are required but we cannot learn everything at once. Learning is gradual and skill is developed over time.
At some point, one needs a programming language (such as Python, Matlab, or R) to use existing algorithms in Data Science or to implement new algorithms to solve realworld problems. I think your BS in Computation and Data Science program will cover that. I do not see any issues with a humanities background for learning data science. My suggestion is — please go over the data science lessons I have posted on this site and see if the concepts make sense. The lessons posted so far are good for starting and I hope, they are easy to understand. These lessons should give you an idea about what basic math backgrounds are required in the beginning. Then you can move forward with more complex topics of data science.
I will keep posting more lessons in a sequence in the coming months. I hope the new lessons will help too.
I wish you all the best in your academic pursuit.
Best regards,
Shahriar
Hello Prof
What kind of educational background does someone need to start this course?
Great question! The learner would need some sort of mathematics and statistics background. I would say that the math and stat backgrounds need not be any more extensive than 12thgrade math and stat. Additionally, knowledge of a programming language will be good for the implementation of the theory I explain in the lectures. I will use Python to demonstrate some of the implementations. If someone knows at least one programming language, Python will not be hard to learn.
Thank you!
I completed my course introduction to data science, is any certificate provided here?
Thank you for your interest in the course. We are still not providing any certificate. The course is still under development. We are planning to build the rest of the videos and contents over this year. If you registered for the course, you would receive emails from us when there is a new video or a new lesson.
Hello prof, I am from a Mechanical Engineering background and have no prior knowledge of programming, but my desire to transition into the field of Data science grows stronger daily. Will you advise me to enroll in this course?
That is a great question. Since you already have a STEM background, it will not be hard to learn data science. In terms of programming, it will be helpful if you at least know one programming language. To learn the basics of a programming language, you can go over the videos of our Java Programming Video Lecture Series.
For Data Science, I would recommend learning Python or/and Matlab. If you know the basics of at least one programming language (such as Java), it will not be hard to pick up Python or Matlab.
Now, to answer the question if it is possible to learn data science without knowing any programming language — you can learn the theories from the Introduction to Data Science course but if you plan to implement the concepts and utilize them on real data, you will need to learn Python or Matlab.
The Introduction to Data Science course is still under development but it has enough materials to start learning the basics of Data Science.
Please let me know if you have any questions.
good morning prof
would you recomend data sciece to a person with statitics and maths degree but no computing?
Definitely. People with statistics and math backgrounds will shine and thrive with data science expertise. Knowledge of at least one programming language will be beneficial if someone wants to use data science concepts with realworld data.
Thank you for asking this important question.
Prof can you kindly include a Lecture on Python, as it is very relevant to Data Science. Also can you suggest some further readings for me, because I want to specialize in Data Science.
Thankyou
I agree that Python is important. For the rest of the course, which I am still working on, I will include Python codes from time to time. I am planning on creating more content over this summer. Please stay tuned. I will send out a notification after publishing each new video lecture on data science.
Thank you for your comments.
I have finished this course, very educative. thank you.
Thank you for going over all the existing content. I am planning on creating more content over this summer. Please stay tuned. I will send out a notification after publishing each new video lecture on data science.
Thank you for your message and interest.
I am grateful for the knowledge provided herein. However, most of the lessons are undergoing constructions. I am hopeful that the lessons will be completed sooner. I like to see myself analyzing data now using some of the software or becoming like the Prof.
Hi, Thank you for your messages and all your efforts in completing the existing lessons. You are correct that many of the lessons are still marked as “under construction”. My plan is to create more content over the next few months. I am also planning to include some of the data sciencerelated programming tools (preferably in Python).
Thank you for your patience, interest, and perseverance.
Hello prof, how do I enrol for the course, are there any costs I will incur
Hi, Thank you for your interest in this course. This is a free course. On this page: Data Science, you will see a button titled “Take this course”. Once you register and click the “Take this course” button, you are enrolled. Then you can enjoy the lesions and the few quizzes available. Many of the topics of the course are right now under development but you will be able to start with some content now. Over the coming months, we will be developing more content including exercises. Stay tuned.
Thanks for this great opportunity. I want to know if any certificate will be issued after completing this course
Hi Alex, Thank you for your message. We are not providing certificates at this point. Many of the materials of the course are still under construction. May be someday, when the course will have enough materials and exercises, we might include certificates.
Have a wonderful week.
Good Day Prof!
I have been trying to register for the course but each time i submit my credentials, it returns “404 forbidden”.
I really want to partake in this course.
I am sorry to hear that you are struggling with registration. It is working at my end, so I thought it is working fine. Thank you for bringing the issue to my attention. I will get back to you after fixing the issue.
I have made some changes to the registration. Would you kindly try to register now? If you still have problem registering, I should be able to add you manually. Please let me know if your attempt to register works now.