• Home
  • Blog
  • Java Lectures
  • Data science
  • Contact
      • Cart

        0
    Have any question?
    computeadmin [at the rate of] computing4all.com
    RegisterLogin
    Computing for All
    • Home
    • Blog
    • Java Lectures
    • Data science
    • Contact
        • Cart

          0

      Introduction to Data Science

      With easy-to-understand video lectures, quizzes, and Python codes
      • Home
      • All courses
      • Data Science
      • Introduction to Data Science
      CoursesData ScienceIntroduction to Data Science
      • An Introduction to Data

        Data science is a field of study that focuses on techniques and algorithms to extract knowledge from data. The area combines data mining and machine learning with data-specific domains. This section focuses on defining "data" before going to any complicated topic.

        4
        • Lecture1.1
          What is data 05 min
        • Quiz1.1
          Quiz on Data 2 questions
        • Lecture1.2
          The simplest form of data
        • Lecture1.3
          Can data speak?
      • Data Dimensionality and Space

        This section's focus is on defining the common terminology widely used in data science. The video lectures in this section focus on terms like objects, data points, features, dimensions, vectors, high-dimensional data, and mathematical space.

        6
        • Lecture2.1
          Objects and features of a data table
        • Quiz2.1
          Quiz on objects and features 1 question
        • Lecture2.2
          What are “space” and a “high-dimensional space”?
        • Lecture2.3
          What is a dimension?
        • Quiz2.2
          Quiz on space and dimensions 2 questions
        • Lecture2.4
          What is a vector?
      • Proximity in Data Science Context

        Many data mining and machine learning algorithms rely on distance or similarity between objects/data points. Video lectures in this section focus on standard proximity measures used in data science. The section also explains how to use proximity measures to examine the neighborhood of a given point.

        5
        • Lecture3.1
          Notions of “nearness”: distance and similarity
        • Lecture3.2
          Distance measures
        • Lecture3.3
          Similarity measures
        • Lecture3.4
          k-nearest neighbors: Python code
        • Lecture3.5
          Matlab code: Finding k-nearest neighbors
      • Clustering algorithms

        A large portion of data science focuses on exploratory analysis. Scientists and practitioners use statistical techniques to understand the data. One way to explore the data is to check if there are clusters of data points. A cluster is a group of data points that have similar features. This section explains the clustering algorithms.

        7
        • Lecture4.1
          What is clustering?
        • Quiz4.1
          Quiz: introduction to clustering 2 questions
        • Lecture4.2
          A few types of clustering algorithms
        • Lecture4.3
          k-means clustering algorithm
        • Lecture4.4
          Hierarchical Agglomerative Clustering (HAC) algorithm (under construction)
        • Lecture4.5
          Density-based clustering algorithm: DBSCAN (under construction)
        • Lecture4.6
          Evaluation of clustering algorithms: Measure the quality of a clustering outcome
      • Classification algorithms
        4
        • Lecture5.1
          What is a classification problem?
        • Lecture5.2
          Logistic regression-based classification and linear regression 01 hour
        • Lecture5.3
          Naive Bayes classification (under construction)
        • Lecture5.4
          Evaluation of classification results (under construction)

        k-nearest neighbors: Python code

        Since data forms space and we are already familiar with distance and similarity measures, we can find points near a given point. Finding points near a point is called computing nearest neighbors. Given a data point, finding k closest points is called the computation of k-nearest neighbors. Finding k-nearest neighbors is also known as computing the knn. This article contains Python code from scratch to compute knn. Additionally, it provides an example of computing knn using the machine learning package scikit-learn in Python.

        Problem statement for knn

        Formally, the knn problem can be written as:

        Given a vector x and a data matrix D, order all n vectors of D such that D=\{x_1, x_2, x_3, \ldots, x_n\} and \text{distance}(x, x_i)\leq\text{distance}(x, x_{i+1}). Return the first k vectors S=\{x_1, x_2, x_3, \ldots, x_k\} where k\leq n. 

        For numerical objects, the length of x must be equal to the number of features/dimensions in D to be able to use a distance or similarity measure.

        To make a program efficient, knn returns the indices (row serial number) of the top k-nearest neighbors, instead of returning k complete vectors from the data.

        The distance function can be any proximity function that we are already familiar with from the previous lessons. You might need a distance measure for certain applications; for other applications, a similarity measure might be alright.

        Example of knn

        The table on the left side of the following figure has eight rows (objects) and two columns (features.) A point x=(13, 17) is provided. We are attempting to find 4 nearest neighbors of x in the data table. After computing distances between x and each of the points, we found that Row 5 is the nearest point to x. The second nearest point is Row 6 of the table. The third nearest point is Row 3. Row 8 contains the fourth nearest neighbor of  x. A 2D depiction of the points in space and the nearest neighbors are shown in the figure.

        Computing k-nearest neighbors

        The process to compute the nearest neighbors of a given point x

        To compute the k nearest neighbors of the given point x=(13, 17) in the table above, we first computed the Euclidean distance of x with each row of the data. The data point (or row) that has the smallest distance is the first nearest neighbor; the data point with the second smallest distance is the second nearest neighbor. When k=4, we select the rows for the first, second, third, and fourth in ascending order of the computed distance with x. 

        The calculations are shown below.

        Computing k-nearest neighbors (knn)

        For the example above, knn will return an array with content [5, 6, 3, 8], which indicates that Row 5 is the first nearest neighbor, Row 6 is the second nearest neighbor, Row 3 is the third nearest neighbor, and Row 8 is the fourth nearest neighbor.

        Remarks

        We used examples of two-dimensional data above because it is easy to visualize two-dimensional space. In reality, the data can have any number of dimensions. To use Euclidian distance to compute k-nearest neighbors, the given vector x must have the same number of dimensions.

        In the examples with the Python codes below, we use higher-dimensional data.

        To find knn from string data, given a string one can compute the edit distance between the given string and each string object in the data.

        Python coding to compute k-nearest neighbors

        Given a vector, we will find the row numbers (IDs) of k closest data points. We will compute k-nearest neighbors–knn using Python from scratch.

        We will create the dataset in the code and then find the nearest neighbors of a given vector. In the example, our given vector is Row 0.

        knn from scratch using Python

        Here is the code. The function name is nearest. It uses two other functions. One is euclid to compute the Euclidean distance between two data points or vectors. The other one is sortkey, which is the comparator for sorting indices based on distance.

        #knn python code
        import numpy as np
        
        def euclid (vec1, vec2) :
            euclidean_dist = np.sqrt(np.sum((vec1-vec2)**2))
            return euclidean_dist
        
        def sortkey (item):    
            return item[1]
        
        def knearest (vec, data, k):
            result=[]
            for row in range(0, len(data)):
                distance=euclid(vec, data[row])
                result.append([row, distance])
            sortedResult= sorted(result, key=sortkey)
            indices=[]
            if k<len(data):
                for r in range(0, k):
                    indices.append(sortedResult[r][0])
            else:            
                indices = [i[0] for i in sortedResult]
            return indices
            
        
        # 7 Data points with each has 5 features
        data = np.array([[10,3,3,5,10],
                          [5,4,5,3,6],
                          [10,4,6,4,9],
                          [8,6,2,6,3],
                          [10,3,3,5,8],                 
                          [9,2,1,2,11],
                          [9,3,1,2,11]])
        
        referenceVec = data[0]; # We will find knn of Row 0
        
        # Find 4 nearest neighbors of the reference vector
        k=4
        knn = knearest(referenceVec, data, k)
        print("Row IDs of ", k, ' nearest neighbors:')
        print(knn)
        
        

        Once the Python script is executed, the output will be the following.

        Row IDs of 4 nearest neighbors:
        [0, 4, 2, 6]

        Since we are finding the nearest neighbor of Row 0, the first nearest neighbor is Row 0 itself. Row 4, Row 2, and Row 6 are the second, the third, and the fourth nearest neighbors respectively.

        knn using Python scikit-learn

        scikit-learn on Python already has a function for computing k-nearest neighbors more efficiently using special data structures such as the ball tree. Here is an example of how we may use the NearestNeighbors class to find the nearest neighbors.

        import numpy as np
        from sklearn.neighbors import NearestNeighbors
        
        # 7 Data points with each has 5 features
        data = np.array([[10,3,3,5,10],
                          [5,4,5,3,6],
                          [10,4,6,4,9],
                          [8,6,2,6,3],
                          [10,3,3,5,8],                 
                          [9,2,1,2,11],
                          [9,3,1,2,11]])
        
        # Find 4 nearest neighbors of the reference vector
        k=4
        
        # Reference vector
        ReferenceVec=data[0]
        
        ## Using sklearn to find knn
        nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(data)
        distances, indices = nbrs.kneighbors([ReferenceVec])
        
        # "distances" contains the nearest distance values for all k points
        # "indices" contains the indices of the k nearest points
        
        print("Row IDs of ", k, ' nearest neighbors:')
        print(indices)
        
        print("Distances of these ", k, ' nearest neighbors:')
        print(distances)

        The code shows how we can find the nearest neighbors of Row 0 (ReferenceVec). The indices variable provides the indices of the nearest neighbors of Row 0. The distances variable provides the distance between Row 0 and each of the rows in indices. The output of the program is provided below.

        Row IDs of 4 nearest neighbors:
        [[0 4 2 6]]
        Distances of these 4 nearest neighbors:
        [[0. 2. 3.46410162 3.87298335]]

         

         

        Prev Similarity measures
        Next Matlab code: Finding k-nearest neighbors

          6 Comments

        1. Avatar
          Ekanem Mfon Kindness
          December 5, 2020
          Reply

          wow! this lecture is awesome

          • Shahriar
            Shahriar
            December 6, 2020
            Reply

            I am glad to know that you liked the lecture. Thank you for visiting and commenting. Have a wonderful time.

        2. Avatar
          Amin
          July 21, 2020
          Reply

          No video presentation I don’t have maths background the only maths I did was a basic math the video presentation help some of us. Thanks

          • Shahriar
            Shahriar
            July 23, 2020
            Reply

            Thank you for your feedback. I understand that the videos are more detailed and demonstrate the techniques well. We will make more videos and new data science lessons in the coming months. Please stay tuned.

        3. Avatar
          David W. Deemie
          May 6, 2020
          Reply

          Thanks for the wonderful presentation!

          • Shahriar
            Shahriar
            May 6, 2020
            Reply

            Glad to know that you liked it. Have a wonderful day.

        Leave A Reply Cancel reply

        Your email address will not be published. Required fields are marked *

          25 Comments

        1. Avatar
          Sohrab ul haq
          October 19, 2020

          Kindly if u can add lectures on Python it will be highly advantageous for students…………….so that they can learn python in parallel with this course.thanks

          • Shahriar
            Shahriar
            October 19, 2020

            Hi, Thank you for your comment and suggestion. Someday, probably I will create lectures on Python programming. For now, my available time is so limited that it is difficult to run two lecture series in parallel. I really appreciate the feedback you have provided. I will definitely keep your suggestion in mind.

        2. Avatar
          Allen
          August 12, 2020

          Hello prof. I’m just new here and also incoming 2nd year college under BS Computational and Data Science. Any tips you have for me? From humanities became a data science student and i find hard understanding some lessons about data sciences since i barely know about computer hardware, mostly about software. By learning through this, i hope i can get the the most needed knowledge about data science here. Thank you for this free course. :D

          • Shahriar
            Shahriar
            August 17, 2020

            Hi Allen,
            Thank you for your question. Sorry for the delay.

            You will not need to know computer hardware to study Data Science. I would say that one needs some high school math backgrounds to understand data science theories. Definitely, there are advanced topics in every subject for which more advanced backgrounds are required but we cannot learn everything at once. Learning is gradual and skill is developed over time.

            At some point, one needs a programming language (such as Python, Matlab, or R) to use existing algorithms in Data Science or to implement new algorithms to solve real-world problems. I think your BS in Computation and Data Science program will cover that. I do not see any issues with a humanities background for learning data science. My suggestion is — please go over the data science lessons I have posted on this site and see if the concepts make sense. The lessons posted so far are good for starting and I hope, they are easy to understand. These lessons should give you an idea about what basic math backgrounds are required in the beginning. Then you can move forward with more complex topics of data science.

            I will keep posting more lessons in a sequence in the coming months. I hope the new lessons will help too.

            I wish you all the best in your academic pursuit.

            Best regards,
            Shahriar

        3. Avatar
          Kanu
          June 14, 2020

          Hello Prof
          What kind of educational background does someone need to start this course?

          • Shahriar
            Shahriar
            June 15, 2020

            Great question! The learner would need some sort of mathematics and statistics background. I would say that the math and stat backgrounds need not be any more extensive than 12th-grade math and stat. Additionally, knowledge of a programming language will be good for the implementation of the theory I explain in the lectures. I will use Python to demonstrate some of the implementations. If someone knows at least one programming language, Python will not be hard to learn.

            Thank you!

        4. Avatar
          Mohammad sameena
          June 11, 2020

          I completed my course introduction to data science, is any certificate provided here?

          • Shahriar
            Shahriar
            June 11, 2020

            Thank you for your interest in the course. We are still not providing any certificate. The course is still under development. We are planning to build the rest of the videos and contents over this year. If you registered for the course, you would receive emails from us when there is a new video or a new lesson.

        5. Avatar
          Temitayo Aworo
          June 5, 2020

          Hello prof, I am from a Mechanical Engineering background and have no prior knowledge of programming, but my desire to transition into the field of Data science grows stronger daily. Will you advise me to enroll in this course?

          • Shahriar
            Shahriar
            June 7, 2020

            That is a great question. Since you already have a STEM background, it will not be hard to learn data science. In terms of programming, it will be helpful if you at least know one programming language. To learn the basics of a programming language, you can go over the videos of our Java Programming Video Lecture Series.

            For Data Science, I would recommend learning Python or/and Matlab. If you know the basics of at least one programming language (such as Java), it will not be hard to pick up Python or Matlab.

            Now, to answer the question if it is possible to learn data science without knowing any programming language — you can learn the theories from the Introduction to Data Science course but if you plan to implement the concepts and utilize them on real data, you will need to learn Python or Matlab.

            The Introduction to Data Science course is still under development but it has enough materials to start learning the basics of Data Science.

            Please let me know if you have any questions.

        6. Avatar
          Austin
          May 26, 2020

          good morning prof

          would you recomend data sciece to a person with statitics and maths degree but no computing?

          • Shahriar
            Shahriar
            May 28, 2020

            Definitely. People with statistics and math backgrounds will shine and thrive with data science expertise. Knowledge of at least one programming language will be beneficial if someone wants to use data science concepts with real-world data.

            Thank you for asking this important question.

        7. Avatar
          Promopj
          May 23, 2020

          Prof can you kindly include a Lecture on Python, as it is very relevant to Data Science. Also can you suggest some further readings for me, because I want to specialize in Data Science.
          Thankyou

          • Shahriar
            Shahriar
            May 23, 2020

            I agree that Python is important. For the rest of the course, which I am still working on, I will include Python codes from time to time. I am planning on creating more content over this summer. Please stay tuned. I will send out a notification after publishing each new video lecture on data science.

            Thank you for your comments.

        8. Avatar
          Promopj
          May 23, 2020

          I have finished this course, very educative. thank you.

          • Shahriar
            Shahriar
            May 23, 2020

            Thank you for going over all the existing content. I am planning on creating more content over this summer. Please stay tuned. I will send out a notification after publishing each new video lecture on data science.

            Thank you for your message and interest.

        9. Avatar
          David W. Deemie
          May 6, 2020

          I am grateful for the knowledge provided herein. However, most of the lessons are undergoing constructions. I am hopeful that the lessons will be completed sooner. I like to see myself analyzing data now using some of the software or becoming like the Prof.

          • Shahriar
            Shahriar
            May 7, 2020

            Hi, Thank you for your messages and all your efforts in completing the existing lessons. You are correct that many of the lessons are still marked as “under construction”. My plan is to create more content over the next few months. I am also planning to include some of the data science-related programming tools (preferably in Python).

            Thank you for your patience, interest, and perseverance.

        10. Avatar
          Stehen ebapu
          May 6, 2020

          Hello prof, how do I enrol for the course, are there any costs I will incur

          • Shahriar
            Shahriar
            May 7, 2020

            Hi, Thank you for your interest in this course. This is a free course. On this page: Data Science, you will see a button titled “Take this course”. Once you register and click the “Take this course” button, you are enrolled. Then you can enjoy the lesions and the few quizzes available. Many of the topics of the course are right now under development but you will be able to start with some content now. Over the coming months, we will be developing more content including exercises. Stay tuned.

        11. Avatar
          Alex
          April 27, 2020

          Thanks for this great opportunity. I want to know if any certificate will be issued after completing this course

          • Shahriar
            Shahriar
            April 27, 2020

            Hi Alex, Thank you for your message. We are not providing certificates at this point. Many of the materials of the course are still under construction. May be someday, when the course will have enough materials and exercises, we might include certificates.

            Have a wonderful week.

        12. Avatar
          Ebeiyamba Okon
          April 8, 2020

          Good Day Prof!
          I have been trying to register for the course but each time i submit my credentials, it returns “404 forbidden”.
          I really want to partake in this course.

          • Shahriar
            Shahriar
            April 8, 2020

            I am sorry to hear that you are struggling with registration. It is working at my end, so I thought it is working fine. Thank you for bringing the issue to my attention. I will get back to you after fixing the issue.

          • Shahriar
            Shahriar
            April 8, 2020

            I have made some changes to the registration. Would you kindly try to register now? If you still have problem registering, I should be able to add you manually. Please let me know if your attempt to register works now.

        All Courses

        • Data Science

        Latest Courses

        Introduction to Data Science

        Introduction to Data Science

        Free

        Computing For All by Computing4All.

        Login with your site account

        Lost your password?

        Not a member yet? Register now

        Register a new account

        Are you a member? Login now

        Modal title

        Message modal