Linear Regression Model for Prediction

In the next section, we will learn about Logistic Regression-based classification. The logistic regression model builds on top of the Linear regression-based prediction model. This is why in this section, we are going to discuss linear regression-based prediction.

Videos on Linear Regression-based Prediction

Linear Regression Prediction

Linear regression-based prediction is a statistical technique used to predict a particular outcome based on a set of input variables. It is based on the assumption of a linear relationship between the input variables and the outcome. By using linear regression, one can predict the outcome by applying a linear equation to the data, thereby making it possible to predict the outcome with a certain degree of accuracy. Linear regression-based prediction is a powerful technique in many areas, such as finance, marketing, and engineering. It can identify relationships between variables, estimate the impact of changes, and forecast future trends.

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n$

To find the coefficients $\beta_0$ , $\beta_1$ , $\beta_2$ , …, $\beta_n$ in the linear regression equation, we need to use a technique called Ordinary Least Squares (OLS) regression. The goal of OLS regression is to minimize the sum of the squared errors between the predicted values of y and the actual values of y in the training data.

The calculations to find the coefficients are as follows:

Calculate the mean of each predictor variable ( $x_i$ ) and the response variable ( $y$ ).

$\bar{x}_i = \frac{1}{n} \sum_{j=1}^{n} x_{ij}$

$\bar{y} = \frac{1}{n} \sum_{j=1}^{n} y_j$

where n is the number of observations in the training data, and $x_{ij}$ and $y_j$ are the values of the i-th predictor variable and the response variable, respectively, for the j-th observation.

Calculate the sample covariance between each predictor variable ( $x_i$ ) and the response variable ( $y$ ).

$s_{xyi} = \frac{1}{n-1} \sum_{j=1}^{n} (x_{ij} - \bar{x}_i)(y_j - \bar{y})$

Calculate the sample variance of each predictor variable (x_i).

$s_{xi}^2 = \frac{1}{n-1} \sum_{j=1}^{n} (x_{ij} - \bar{x}_i)^2$

Calculate the sample correlation coefficient between each predictor variable ( $x_i$ ) and the response variable ( $y$ ).

$r_{yi} = \frac{s_{xyi}} {(s_{xi} \times s_y)}$

where $s_y$ is the sample standard deviation of $y$ .

Calculate the coefficients using the following formula:

$\beta_i = \frac{s_{xyi}}{ s_{xi}^2}$

$\beta_0 = \bar{y} - \sum_{i=1}^{n} \beta_i\bar{x}_i$

These calculations assume that the predictor variables are not linearly dependent, and that the errors are normally distributed with constant variance. If these assumptions are not met, the OLS regression results may not be reliable.

Example

Let us work through a simple example to illustrate the calculations for finding the coefficients of a linear regression model using the Ordinary Least Squares method.

Suppose we have a dataset with four predictor variables (x1, x2, x3, x4) and one response variable (y), and 6 observations:

x1 x2 x3 x4 y
2.3 3.5 1.2 0.5 23
1.9 2.7 3.1 2.2 15
3.5 1.1 2.5 1.8 31
2.7 1.9 2.8 1.1 20
2.9 2.2 1.7 1.6 25
1.5 2.6 3.2 2.8 18

We want to fit a linear regression model to predict y from the four predictor variables. That is, we want to calculate the five coefficients (beta values $\beta_0$ , $\beta_1$ , $\beta_2$ , $\beta_3$ and $\beta_4$ ).

Step 1: Calculate the mean of each predictor variable and the response variable.

$\bar{x1} = (2.3 + 1.9 + 3.5 + 2.7 + 2.9 + 1.5) / 6 = 2.4$

$\bar{x2} = (3.5 + 2.7 + 1.1 + 1.9 + 2.2 + 2.6) / 6 = 2.57$

$\bar{x3} = (1.2 + 3.1 + 2.5 + 2.8 + 1.7 + 3.2) / 6 = 2.25$

$\bar{x4} = (0.5 + 2.2 + 1.8 + 1.1 + 1.6 + 2.8) / 6 = 1.6667$

$\bar{y} = (23 + 15 + 31 + 20 + 25 + 18) / 6 = 22$

Step 2: Calculate the sample covariance between each predictor variable and the response variable.

$s_{x1y} = ((2.3 - 2.4)*(23 - 22) + (1.9 - 2.4)*(15 - 22) + (3.5 - 2.4)*(31 - 22) + (2.7 - 2.4)*(20 - 22) + (2.9 - 2.4)*(25 - 22) + (1.5 - 2.4)*(18 - 22)) / 5 = -1.86$

$s_{x2y} = ((3.5 - 2.57)*(23 - 22) + (2.7 - 2.57)*(15 - 22) + (1.1 - 2.57)*(31 - 22) + (1.9 - 2.57)*(20 - 22) + (2.2 - 2.57)*(25 - 22) + (2.6 - 2.57)*(18 - 22)) / 5 = -1.21$

$s_{x3y} = ((1.2 - 2.25)*(23 - 22) + (3.1 - 2.25)*(15 - 22) + (2.5 - 2.25)*(31 - 22) + (2.8 - 2.25)*(20 - 22) + (1.7 - 2.25)(25 - 22) + (3.2 - 2.25)(18 - 22)) / 5 = 1.215$

$s_{x4y} = ((0.5 - 1.6667)(23 - 22) + (2.2 - 1.6667)(15 - 22) + (1.8 - 1.6667)(31 - 22) + (1.1 - 1.6667)(20 - 22) + (1.6 - 1.6667)(25 - 22) + (2.8 - 1.6667)*(18 - 22)) / 5 = -1.03$

Step 3: Calculate the sample variance of each predictor variable.

$s_{x1}^2 = ((2.3 - 2.4)^2 + (1.9 - 2.4)^2 + (3.5 - 2.4)^2 + (2.7 - 2.4)^2 + (2.9 - 2.4)^2 + (1.5 - 2.4)^2) / 5 = 0.352$

$s_{x2}^2 = ((3.5 - 2.57)^2 + (2.7 - 2.57)^2 + (1.1 - 2.57)^2 + (1.9 - 2.57)^2 + (2.2 - 2.57)^2 + (2.6 - 2.57)^2) / 5 = 0.571$

$s_{x3}^2 = ((1.2 - 2.25)^2 + (3.1 - 2.25)^2 + (2.5 - 2.25)^2 + (2.8 - 2.25)^2 + (1.7 - 2.25)^2 + (3.2 - 2.25)^2) / 5 = 0.849$

$s_{x4}^2 = ((0.5 - 1.6667)^2 + (2.2 - 1.6667)^2 + (1.8 - 1.6667)^2 + (1.1 - 1.6667)^2 + (1.6 - 1.6667)^2 + (2.8 - 1.6667)^2) / 5 = 0.671$

Step 4: Calculate the sample correlation coefficient between each predictor variable and the response variable.

$r_{y1} = s_{x1y} / (s_{x1} * s_y) = -1.86 / (0.5932 * 2.607) = -0.698$

$r_{y2} = s_{x2y} / (s_{x2} * s_y) = -1.21 / (0.7685 * 2.607) = -0.614$

$r_{y3} = s_{x3y} / (s_{x3} * s_y) = 1.215 / (0.9215 * 2.607) = 0.493$

$r_{y4} = s_{x4y} / (s_{x4} * s_y) = -1.03 / (0.671 * 2.607) = -0.560$

Step 5: Calculate the coefficients using the following formula.

$\beta_1 = s_{x1y} / s_{x1}^2 = -1.86 / 0.352 = -5.284$

$\beta_2 = s_{x2y} / s_{x2}^2 = -1.21 / 0.571 = -2.121$

$\beta_3 = s_{x3y} / s_{x3}^2 = 1.215 / 0.849 = 1.431$

$\beta_4 = s_{x4y} / s_{x4}^2 = -1.03 / 0.671 = -1.539$

$\beta_0 = \bar{y} - \beta_1\bar{x}_1 - \beta_2\bar{x}_2 - \beta_3\bar{x}_3 - \beta_4\bar{x}_4 \\ = 22 - (-5.284 * 2.4) - (-2.121 * 2.57) + (1.431 * 2.25) - (-1.539 * 1.6667)\\ = 33.625$

Therefore, the linear regression equation that best fits this dataset is:

$y = 33.625 - 5.284x1 - 2.121x2 + 1.431x3 - 1.539x4$

The equation can be used to predict y values for new observations.

For example, suppose we want to predict the value of the response variable y for a new observation with the predictor variables

x1=2.1, x2=3.0, x3=1.8, x4=2.3

We can use the linear regression equation we just calculated to make the prediction:

$y = 33.625 - 5.284(2.1) - 2.121(3.0) + 1.431(1.8) - 1.539(2.3) = 17.79$

So the predicted value of y for this new observation is 17.79.

Note that the linear regression model assumes that the relationship between the predictor variables and the response variable is linear and that the errors are normally distributed with constant variance. These assumptions should be checked using diagnostic plots and other techniques to ensure the model’s reliability.

Linear Regression Model for Prediction

Videos on Linear Regression-based Prediction

Linear Regression Prediction

Example

Step 1: Calculate the mean of each predictor variable and the response variable.

Step 2: Calculate the sample covariance between each predictor variable and the response variable.

Step 3: Calculate the sample variance of each predictor variable.

Step 4: Calculate the sample correlation coefficient between each predictor variable and the response variable.

Step 5: Calculate the coefficients using the following formula.

Leave A Reply Cancel reply

Login with your site account

Register a new account

Modal title