20 Linear Regression

Regression models are a class of statistical models that let us explore the relationship between a response variable and one or more explanatory variables. If such a relationship exists and is detected, we can make predictions about the value of the response variable given some explanatory variables. As an example, if a relationship between number of employees in an office and its monthly expenses on salary is established, we can predict the monthly expenses incurred by that office on salary if we know the number of employees. That lets us do thought of experiments like asking how much the a new company had to pay say 10 employees on salary expenses monthly.

Explanatory variables are sometimes also referred to as regressors or independent or predictor variables whereas response variable is also called as dependent or outcome variable. When the response variable is numeric and the relationship is linear, the regression is called as linear regression. Of course, there are certain other assumptions, which we will discuss later on in the chapter.

20.1 Basic concepts

Before we start creating regression models, it’s always a good practice to visualize the data. It will help us ascertaining the nature of relationship between the predictors and response variable, if any. To visualize the relationship between two numeric variables, we can use a scatter plot. See the two examples in figure 20.1. In the first example (plot) we can see a near perfect linear relationship between GNP and Population of the countries, whereas a moderate but negative relationship between two variables is seen in second example.

Figure 20.1: Linear Regression - Intuition

Linear regression is a perfect model to predict outcome or response variable when it has linear relationship with explanatory variables. In that case our predicted values will lie on the assumed line (in linear relationship) ideally. Maths behind estimating or predicting outcomes is thus, finding the following algebraic equation (20.1).

\[\begin{equation} y = mx + c \tag{20.1} \end{equation}\]

Where -

m is the slope of the line (in linear relationship)
c is the intercept of Y-axis. (value of $y$ when $x$ is 0 )

Interpreting this equation in real world is like estimating the coefficients i.e. both $m$ and $c$ in above equation. The goal of our exercise will thus be to estimate the best values of $m$ and $c$. These two parameters, $m$ and $c$ are sometimes also referred to as $\beta_1$ and $\beta_0$ respectively. Those familiar with mathematics behind fitting the equation of a line (in two dimensional space), may know that we require only two variables (data points i.e. $x$ and $y$ value pairs) values to find these parameters. So it means, that if we have a fair amount of data points available (which will in rarest of circumstances be collinear i.e. lying on one same line), we can actually get many such regression lines. Our goal is thus, to find best of these lines. But, how?

To answer this question, let us also understand that the values of $x$ and $y$ pairs in actual practice, don’t lie on the regression line because these points will rarely be collinear (See plots in Figure 20.1 and note that none of the data point actually lie on the line). So for each data point of response variable, there is an actual value and one fitted value (the one falling on the regression line). The difference between these values is called error term or residual. Technically these are not errors but random noise that the model is not able to explain. Mathematically, if $\hat{y}$ is fitted value, $y$ is actual value, the difference also called error (of prediction) term say $\epsilon$ can be denoted as -

\[\begin{equation} \epsilon = y - \hat{y} \tag{20.2} \end{equation}\]

OR, we can say that

\[\begin{equation} y = \beta_0 + \beta_1x + \epsilon \tag{20.3} \end{equation}\]

Now, one method to find best fit regression line is to minimize the error terms. Theoretically, it means to capture pattern/relationship between data points as much as possible so that what’s left behind is true random noise. One way could be to minimise the mean of these error terms. But these error terms can be both positive or negative. See figure 20.2. So to ensure that these are not cancelled out while taking mean, we can minimise either the mean of their absolute values or their squares. Most commonly accepted method is to take mean of squares and minimise it. One of the benefit of adopting it over another, is that while squaring errors or residuals, large residuals get higher weight than lower residuals. That’s why this linear regression technique is also sometimes referred to as Ordinary Least Squares or OLS regression.

Figure 20.2: Residuals - Intuition and Concept

20.2 Simple Linear Regression in R

Don’t worry, in R we do not have to do this minimisation job ourselves. In fact, base R has a fantastic function lm which can fit a best regression line, for a given set of variables, for us. The syntax is simple.

lm(formula, data, ...)

where -

formula an object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted. For our example above, we can write y ~ x
data an optional data frame.

It actually returns a special object, which can be printed directly like other R objects. However, it is best printed with the summary function. Firstly we will see an example of simple linear regression which is a linear regression with only one independent variable.

Example-1: Problem Statement: iris which is a famous (Fisher’s or Anderson’s) data set, loaded by default in R, gives the measurements in centimeters of the variables sepal length (Sepal.Length) and width (Sepal.Width) and petal length (Petal.Length) and width (Petal.Width), respectively, for 50 flowers from each of 3 species (Species) of iris. The species are Iris setosa, versicolor, and virginica. Let us try to establish a relationship between Sepal.Length and Sepal.Width variables of setosa Species i.e. iris` data-set, (first 50 records only).

As already stated above, it is always a good practice to visualize the data, if possible. So let’s make a scatter-plot, as seen in Figure 20.3.

iris %>% 
  # Extract First 50 records
  head(50) %>% 
  ggplot(aes(Sepal.Length, Sepal.Width)) +
  # Scatter plot
  geom_point() +
  # adding a trend line without error bands
  geom_smooth(method = 'lm', se=FALSE, formula = "y~x") +
  # Theme Black and White
  theme_bw()

Figure 20.3: Relationship between sepal widths and lengths in setosa species

The relationship seem fairly linear (Figure 20.3), so let’s build the model.

lin_reg1 <- lm(formula = Sepal.Width ~ Sepal.Length, data = iris[1:50,])

# Let us print the object directly
lin_reg1

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris[1:50, ])
## 
## Coefficients:
##  (Intercept)  Sepal.Length  
##      -0.5694        0.7985

# Try printing it with summary()
summary(lin_reg1)

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris[1:50, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72394 -0.18273 -0.00306  0.15738  0.51709 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.5694     0.5217  -1.091    0.281    
## Sepal.Length   0.7985     0.1040   7.681 6.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2565 on 48 degrees of freedom
## Multiple R-squared:  0.5514, Adjusted R-squared:  0.542 
## F-statistic: 58.99 on 1 and 48 DF,  p-value: 6.71e-10

Observing the outputs above, we can notice that simply printing the object returns coefficients whereas printing with summary gives us a lot of other information. But how to interpret this information? Before proceeding to interpret the output, let us understand a few more concepts which are essential here. These concepts are basically some assumptions, which we have made while finding the best fit line or in other words estimating the parameters statistically.

20.3 Assumptions of Linear Regression

Linear regression makes several assumptions about the data, such as :

Linearity of the data. The relationship between the predictor ($x$) and the outcome ($y$) is assumed to be linear. Obviously, the relationship should be linear. If we would try to fit non-linear relationship through linear regression our results wouldn’t be correct. Also when we use multiple predictors, as we will see shortly, we make another assumption that the model is additive in nature besides being linear. Refer figure 20.4. It is clear that if we try to establish a linear relationship (red line) when it is actually cubic (green dashed line) our model will give erroneous results.

Figure 20.4: Is the relationship linear?

Normality of residuals. The residuals are assumed to be normally distributed. Actually, this assumption is followed by the assumption that our dependent variable is normally distributed and not concentrated anywhere. Thus, if dependent variable is normally distributed, and if we have been able to capture the relationship available, then what has been left must be true noise and it should be normally distributed with a mean of $0$.
Homogeneity of residuals variance. The residuals are assumed to have a constant variance, statistically known as homoscedasticity. It shows that residuals that are left out of regression model are true noise and not related to fitted values, which in that case would have meant that the model was insufficient to capture the actual relationship. Heteroscedasticity (the violation of homoskedasticity) is present when the size of the error term differs across values of an independent variable. This can be best understood by plots in Figures 20.5 where residuals in left plot indicate equal variance and thus homoskedasticity whereas in the right plot heteroskedasticity is indicated clearly.

Figure 20.5: Homoskedasticity (left) Vs. Heteroskedasticity (right) Sample data created by author for demonstration only

Independence of residuals error terms. This assumption is also followed by original assumption that our dependent variable is independent in itself and any $y$ value is not dependent upon another set of $y$ values. In our example, we can understand it like that Sepal width of one sample is not affecting width of another.

20.4 Interpreting the output of `lm`

Let us discuss each of the component or section of lm output, hereinafter referred to as model.

20.4.1 `call`

The call section shows us the formula that we have used to fit the regression model. So Sepal.Width is our dependent or response variable, Sepal.Length is predictor or independent variable. These variables refer to our data-set which is, first 50 rows of iris or iris[1:50,].

20.4.2 `Residuals`

This depicts the quantile or five point summary of error terms or the residuals, which also as discussed, are the difference between the actual values and the predicted values. We can generate these same values by taking the actual values of $y$ variable and subtracting these from its predicted values of the model.

summary(iris$Sepal.Width[1:50] - lin_reg1$fitted.values)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.723945 -0.182730 -0.003062  0.000000  0.157380  0.517085

Ideally, the median of error values/residuals should be centered around 0 thus telling us that these are somewhat symmetrical and our model is predicting fairly at both positive/higher and negative/lower side. Any skewness will thus show that errors are not normally distributed and our model may be biased towards that side.

In our example, we can observe a slight left-skewed distribution of error terms which indicates that our model is not doing that well for higher sepal lengths as it is doing for lower ones. This can also be seen in Figure 20.6 i.e. histogram of residuals.

Figure 20.6: Histogram of Resuduals

20.4.3 `Coefficients`

Remember that these were our goals of the exercise. Here coefficients will be written as coefficient for that predictor and an intercept term, i.e. our $\beta_1$ and $\beta_0$ respectively. We can easily interpret that positive coefficients means positive relation and negative coefficient means value of outcome variable will decrease as corresponding independent variable increase. So, in our example, we can deduce that -

for 0 sepal length, the sepal width will be -0.5694 (Though mathematically only as physically having 0 and negative width and lengths are not possible).
for every unit i.e. 1 i.e. unit increase in sepal length, width increases by 0.7985

Thus our regression line equation is -

\[\begin{equation} Sepal.Width = -0.5694 + 0.7985 * Sepal.Length \tag{20.4} \end{equation}\]

Now one thing to note here that, since we are adopting OLS approach to find out the (estimated) equation of best line, the coefficients we have arrived at, are only the estimated values of mean of these coefficients. Actually, we started (behind the scenes) with a null hypothesis that there is no linear relationship or, in other words, that the coefficients are zero. Alternate hypothesis, in this case, as you may have guessed by now, was that these coefficients are not zero. The coefficients may follow a statistical/ probabilistic distribution and thus, we may infer only its estimated (mean) value.

We can see the confidence intervals of each of the coefficient using function confint. By default, the 95% confidence intervals may be generated. See the following.

confint(lin_reg1)

##                   2.5 %    97.5 %
## (Intercept)  -1.6184048 0.4795395
## Sepal.Length  0.5894925 1.0075641

Now, these estimated distribution must also have some standard error, a probability statistic and a p-value.

The standard error of the coefficient is an estimate of the standard deviation of the coefficient. It tells us how much uncertainty there is with our coefficient. We can build confidence intervals of coefficients using this statistic, as shown above.
The t-statistic is simply the estimated coefficient divided by the standard error. By now you may have understood that we are applying student’s t-distribution while estimating the parameters.
Finally p value i.e. Pr(>|t|) gives us the probability value and tells us how significant is our coefficient value.

20.4.4 `Signif. codes`

These are nothing but code legends which are simply telling us how significant out p-value may be for each case. Notice three asterisks in from of coefficient estimate of Sepal.Length which indicate that coefficient is extremely significant and we can reject null hypothesis that $\beta_1$ is $0$.

These codes give us a quick way to visually see which coefficients are significant to the model.

20.4.5 `Residual standard error`

The residual standard error is a measure, and one of the metrics, telling us how well the model fits the data. This is actually the standard deviation of all error terms with the difference that instead of taking n terms we are taking degrees of freedom.

\[\begin{equation} RSE = \sqrt{\frac{1}{(n-2)} \sum_{i=1}^n (y_i - \hat{y_i})^2} \tag{20.5} \end{equation}\]

Obviously $df$ is $n-2$, as there is one regressor and one intercept. We can verify equation (20.5) by calculation.

sqrt(sum((iris[1:50, "Sepal.Width"] - lin_reg1$fitted.values)^2)/48)

## [1] 0.2565263

20.4.6 `R-Squared` both Multiple and Adjusted

Multiple R-Squared is also called the coefficient of determination. Often this is the most cited measurement of how well the model is fitting to the data. It tells us what percentage of the variation within our dependent variable that the independent variable is explaining through the model. By looking at output we can say that about 55% of variation is explained through the model. We will discuss it in detail, in the next section.

Adjusted R squared on the other hand, shows us what percentage of the variation within our dependent variable that all predictors are explaining. Thus, it is helpful when there are multiple regressors i.e. in Multiple Linear Regression. The difference between these two metrics might be subtle where we adjust for the variance attributed by adding multiple variables.

20.4.7 `F-Statistic` and `p-value`

So why a p-value again? Is there a hypothesis again? Yes, When running a regression model, a hypothesis test is being run on the global model, that there is no relationship between the dependent variable and the independent variable(s). The alternative hypothesis is that there is a relationship. In other words, alternate hypothesis means at least one coefficient of regression is non-zero. This hypothesis is tested on F-statistic and hence the two values. p-value in our example is very small which lead us to reject the null hypothesis and conclude that there is strong evidence that a relationship does exist between Sepal.Length and Sepal.Width.

The reason for this test is based on the fact that if we run multiple hypothesis tests on our coefficients, it is likely that a variable is included which isn’t actually significant.

20.5 Model Evaluation Metrics

To evaluate the model’s performance and accuracy, evaluation metrics are needed. There are several types of metrics which are used to evaluate the performance of model we have built. We will discuss a few of them here -

20.5.1 MAE - Mean Absolute Error

As the name suggests it is mean of absolute values of errors or residuals. The formula thus, can be written as equation (20.6).

\[\begin{equation} {MAE} = \frac{1}{N}\sum_{i = 1}^{N}{\lvert}{y_i - \hat{y_i}}{\rvert} \tag{20.6} \end{equation}\]

Clearly, it is average value of residuals and a larger value denotes lesser accurate model. In isolation, the MAE is not very useful, however, to compare performance of several models while fitting a best regression model, obviously we can use this metric to choose a better model.

Moreover, once we extract $residuals out of the model, calculating the metric is easy. In our example-

# Mean Absolute Error
lin_reg1$residuals |> abs() |> mean()

## [1] 0.199243

20.5.2 MSE - Mean Square Error and RMSE - Root Mean Square Error

Again as the name suggests, the mean of square of all residuals is mean square error or MSE. The formula may be written as in equation (20.7).

\[\begin{equation} {MSE} = \frac{1}{N}\sum_{i = 1}^{N}({y_i - \hat{y_i}})^2 \tag{20.7} \end{equation}\]

MSE penalizes the higher residuals by squaring them. It may be thus thought as weighted average where more and more weight is allocated as the residual value rises. Similar, to MAE, we can use this metric to choose a better model out of the several validating models.

Interestingly, by definition it is also cost function in regression, as while finding parameters, we are actually minimising MSE only. Similar to MAE, calculating this require no special skills.

# Mean Square Error
lin_reg1$residuals^2 |> mean()

## [1] 0.0631735

RMSE or root mean square error is square root of MSE.

\[\begin{equation} {RMSE} = \sqrt{\frac{1}{N}\sum_{i = 1}^{N}({y_i - \hat{y_i}})^2} \tag{20.8} \end{equation}\]

In our Example-

# Root Mean Square Error
lin_reg1$residuals^2 |> mean() |> sqrt()

## [1] 0.2513434

20.5.3 MAPE - Mean Absolute Percentage Error

This metric instead of taking residual value in isolation, takes residual value as percentage of actual values. The formula is thus,

\[\begin{equation} {MAPE} = \frac{1}{N}\sum_{i = 1}^{N}{\lvert}\frac{({y_i - \hat{y_i}})}{y_i}\cdot{100}\%{\rvert} \tag{20.9} \end{equation}\]

Clearly, MAPE is independent of the scale of the variables since its error estimates are in terms of percentage. In our example, we can calculate MAPE-

# Mean Absolute Percentage Error
{lin_reg1$residuals/iris$Sepal.Width[1:50]} %>% 
  {abs(.)*100} %>% 
  mean(.) %>% 
  sprintf("MAPE is %1.2f%%", .)

## [1] "MAPE is 5.95%"

20.5.4 R - Squared and adjusted R-squared

As already discussed, it is coefficient of determination or goodness of fit of regression. The can be written as equation (20.10).

\[\begin{equation} R^2 = 1 - \frac{\sum_{i=1}^{N}(y_i - \hat{y_i})^2}{\sum_{i=1}^{N}(y_i - \bar{y_i})^2} \tag{20.10} \end{equation}\]

The numerator in the fraction above, is also called as $SSE$ or Sum of Squares of Errors; and denominator is also called as $TSS$ or Total Sum of Squares. Actually the ratio (or fraction in above formula) i.e. $\frac{SSE}{TSS}$ denotes ratio of variance in errors to the variance (about mean) in the actual values. Thus $R^2$ actually denotes how much variance is explained by the model; and clearly as the variance errors or $SSE$ minimises and approaches $0$, $R^2$ increases and approaches $1$ i.e. a perfect model.

We can easily verify the formula from the results obtained.

## Sum of Squares of Errors
y_bar <- mean(iris[1:50,"Sepal.Width"])
(SSE <- sum(lin_reg1$residuals^2))

## [1] 3.158675

(TSS <- sum((iris[1:50,"Sepal.Width"] - y_bar)^2))

## [1] 7.0408

(r_sq <- 1 - SSE/TSS)

## [1] 0.5513756

Now, we can think of this $R^2$ in one more way, as it is simply the square of the correlation between the actual and predicted values. We can verify once again

(cor(iris[1:50, 'Sepal.Width'], lin_reg1$fitted.values))^2

## [1] 0.5513756

Usually, when we keep on adding independent variables to our regression model $R^2$ increases and it can easily incorporate over-fitting in itself. For this reason, sometimes adjusted r squared is used, which penalizes R squared for the number of predictors used in the model.

\[\begin{equation} {Adjusted}\;{ R^2} = 1 - \frac{(1-R^2)(N - 1)}{(N - p - 1)} \tag{20.11} \end{equation}\]

where $p$ is the number of predictors included in model. The summary function returns both these metrics. We can verify the calculation-

1 - ((1-r_sq)*(50-1)/(50-1-1))

## [1] 0.5420292

20.6 Plotting the results and their interpretion

The output of lm can be plotted with plot command to see six diagnostics plots, one by one, which can be chosen using which argument. These six plots are -

Residuals Vs. Fitted Values
Normal Q-Q
Scale-Location
Cook’s distance
Residuals vs. leverage
Cook’s distance vs. leverage

Let us see these, for the example above.

par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))
plot(lin_reg1, which = 1:2, sub.caption = "")

Figure 20.7: First two diagnostic plots

Residuals Vs. Fitted Values: This plot is used to determine if the residuals exhibit non-linear patterns. If the red line across the center of the plot is roughly horizontal then we can assume that the residuals follow a linear pattern. In our example we can see that the red line deviates from a perfect horizontal line but not severely. We would likely declare that the residuals follow a roughly linear pattern and that a linear regression model is appropriate for this data-set. This plot is useful to check first assumption of linear regression i.e. linearity of the data.
Normal Q-Q: This plot is used to determine if the residuals of the regression model are normally distributed, which was our another assumption. If the points in this plot fall roughly along a straight diagonal line, then we can assume the residuals are normally distributed.

Moreover, notice that the extreme outlier values impacting our modelling will be labeled. We can see that values from rows, 23, 33 and 42 are labeled.

par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))
plot(lin_reg1, which = 3:4, sub.caption = "")

Figure 20.8: Next two diagnostic plots

Scale-Location: This plot is used to check the assumption of equal variance, i.e. “homoskedasticity” among the residuals in our regression model. If the red line is roughly horizontal across the plot, then the assumption of equal variance is likely met.
Cook’s Distance: An influential value is a value, which inclusion or exclusion can alter the results of the regression analysis. Such a value is associated with a large residual. Not all outliers (or extreme data points) are influential in linear regression analysis. A metric called Cook’s distance, is used to determine the influence of a value. This metric defines influence as a combination of leverage and residual size.

par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))
plot(lin_reg1, which = 5:6, sub.caption = "")

Figure 20.9: Last two diagnostic plots

Residuals vs. leverage: This plot is used to identify influential observations. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation. Actually, a data point has high leverage, if it has extreme predictor x values.
Sixth plot i.e. Cooks distance vs. leverage is also used to identify extreme values that may be impacting our model.

Though the base R’s command plot can generate all the plots, we may make use of library performance to generate all the relevant diagnostic plots beautifully and with small interpretation. See

library(performance)
check_model(lin_reg1)

Figure 20.10: Output from package-performance

The warning is obvious, we cannot have multi-collinearity problem as there is only one regressor. Actually multi-collinearity is about another assumption, we would have made in case there were more than one independent variables. This assumption would have been that the independent variables are not mutually collinear. We will see detailed explanation in case of multiple linear regression in the subsequent section.

20.7 Using `lm` for predictions

The output of lm is actually a list which contains much more information than we saw above. See which info is contained here -

names(lin_reg1) |>
  as.data.frame() |>
  setNames('Objects')

##          Objects
## 1   coefficients
## 2      residuals
## 3        effects
## 4           rank
## 5  fitted.values
## 6         assign
## 7             qr
## 8    df.residual
## 9        xlevels
## 10          call
## 11         terms
## 12         model

We may extract any of the as per requirement. E.g.

lin_reg1$coefficients

##  (Intercept) Sepal.Length 
##   -0.5694327    0.7985283

There is a package broom which uses tidy fundamentals to returns all the useful information by a single function augment.

library(broom)
augment(lin_reg1)

## # A tibble: 50 × 8
##    Sepal.Width Sepal.Length .fitted   .resid   .hat .sigma    .cooksd .std.resid
##          <dbl>        <dbl>   <dbl>    <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
##  1         3.5          5.1    3.50 -0.00306 0.0215  0.259 0.00000160    -0.0121
##  2         3            4.9    3.34 -0.343   0.0218  0.254 0.0205        -1.35  
##  3         3.2          4.7    3.18  0.0163  0.0354  0.259 0.0000772      0.0649
##  4         3.1          4.6    3.10 -0.00380 0.0471  0.259 0.00000568    -0.0152
##  5         3.6          5      3.42  0.177   0.0200  0.258 0.00495        0.696 
##  6         3.9          5.4    3.74  0.157   0.0455  0.258 0.00940        0.628 
##  7         3.4          4.6    3.10  0.296   0.0471  0.255 0.0346         1.18  
##  8         3.4          5      3.42 -0.0232  0.0200  0.259 0.0000853     -0.0914
##  9         2.9          4.4    2.94 -0.0441  0.0803  0.259 0.00140       -0.179 
## 10         3.1          4.9    3.34 -0.243   0.0218  0.257 0.0103        -0.959 
## # ℹ 40 more rows

Let’s also visualise the predicted values vis-a-vis actual values/residuals in Figure 20.11.

Figure 20.11: Predicted Vs. Actual Values (Left) and Residuals (Right)

So if we have predict output from a new data, just ensure that data is in exactly same format as of regressor and we can use predict from base R directly. See this example.

new_vals <- rnorm(10, 5, 1) |> 
  as.data.frame() |>
  setNames('Sepal.Length')
predict(lin_reg1, new_vals)

##        1        2        3        4        5        6        7        8 
## 3.416249 3.552228 2.691731 5.185508 2.801470 3.523949 4.037588 2.183525 
##        9       10 
## 4.190880 3.667629

20.8 Multiple Linear Regression

As the name suggests, multiple linear regression is the model where multiple independent variables may have linear relationship with dependent variable. In this case, the regression equation will be -

\[\begin{equation} y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + ... + \epsilon \tag{20.12} \end{equation}\]

where $x_1$, $x_2$, $x_3$ …, $x_n$ will be $n$ independent variables and $y$ will be dependent variable as usual.

It may be clear that this equation represents an equation of a plane if there are two regressors. If there are $n$ regressors, the equation will represent equation of a hyperplane of $n-1$ dimensions. However, visualizing the variables and relation between them can be a bit tricky if there are multiple variables. We may decide about the type of the visualization will suit the requirement in that case.

Now, as already stated, there is an additional assumption, that all the independent variables are mutually independent too i.e. do not have multi-collinearity between them. Let’s build an example model again. We have to just add the predictors (independent variables) using the + operator in the formula call.

Problem Statement: Example-2. Let’s try to establish the relationship between cars’ mileage (mpg variable in mtcars data-set) with engine displacement disp, horse power hp and weight wt. First of all let’s visualise the individual relationships between the variables through three plots as in Figure 20.12.

Relationship between regressors and outcome variables

Figure 20.12: Relationship between regressors and outcome variables

Let’s also visualise the correlation between all these variables. See Figure 20.13.

Figure 20.13: Correlation between the variables in Example-2

Now let’s build the model.

data <- mtcars[, c("mpg", "disp", "hp", "wt")]
lin_reg2 <- lm(mpg ~ ., 
               data = data)
summary(lin_reg2)

## 
## Call:
## lm(formula = mpg ~ ., data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.891 -1.640 -0.172  1.061  5.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
## disp        -0.000937   0.010350  -0.091  0.92851    
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11

In formula call, please note that I have used . instead of naming all the variables. In fact . is a shorthand style of mentioning that all the variables except that mentioned as y variable will be our independent variables/predictors. In results, there is one coefficient for slope for each predictors and one intercept term in the output. The output equation can be written as equation (20.13).

\[\begin{equation} {mpg} = -0.000937\cdot{disp} - 0.031157\cdot{hp} - 3.800891\cdot{wt} + 37.105505 \tag{20.13} \end{equation}\]

Interpretation: Notice that all slopes are in negative meaning that mileage drops by increase in each of weight, displacement and horsepower. Interpreting the equation would be on similar lines. We can now say that keeping other variables constant, the effect of change (increase) of 1 unit in disp will result in decrease of mileage by 0.000937 miles per gallon. Keeping other variables constant is important here. Of course, by above equation we may deduce that if other factors change, the effect on response variable would be different.

Results: Analysing results, we may notice that our model is explains 83% of variance in data. Of course, adjusted r-squared is lower which means that adding extra variables may have increased the r-squared. Global p-value is highly significant which means that at least of the coefficients is non-zero. Of course, the least significant coefficient is that of disp which we can remove and re-run the model to check the parameters again.

20.9 Including categorical or factor variables in `lm`

By now we have seen that linear regression is useful for predicting numerical output and through the examples we have seen that our regressors were numerical too. But what if there’s an input variable which is categorical or nominal?

In such case, we will have to ensure that categorical variable is of type factor before proceeding to build a model.

Problem Statement: Example-3. Let’s try to predict ${Sepal.Width}$ from ${Species}$ in the iris data-set. This time we will take complete data-set. Since, we know that Species is already of factor type we need not convert it into one. Before moving on let’s visualize the relation between the two variables using ggplot2. Box-plots are best suited here.

Figure 20.14: Species Vs. Sepal Width

Now let’s build the model.

lm_fact <- lm(Sepal.Width ~ Species, data = iris)
summary(lm_fact)

## 
## Call:
## lm(formula = Sepal.Width ~ Species, data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.128 -0.228  0.026  0.226  0.972 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.42800    0.04804  71.359  < 2e-16 ***
## Speciesversicolor -0.65800    0.06794  -9.685  < 2e-16 ***
## Speciesvirginica  -0.45400    0.06794  -6.683 4.54e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3397 on 147 degrees of freedom
## Multiple R-squared:  0.4008, Adjusted R-squared:  0.3926 
## F-statistic: 49.16 on 2 and 147 DF,  p-value: < 2.2e-16

In results we now got one intercept and two slopes for single regressor Species. So what happened now?

Interpretation: Actually, factor data type requirement for categorical regressor was due to the fact that this factor variable is encoded as dummy variable for each of the category available in it. Dummy variable is numeric and we can now run linear regression as earlier. The first category available in it will be baseline level. Since there were three levels included in Species it has been encoded into two dummy variables. To see what happened behind the scenes, we may use contrasts function.

contrasts(iris$Species)

##            versicolor virginica
## setosa              0         0
## versicolor          1         0
## virginica           0         1

model.matrix

It is clear that when versicolor is 1 the other variable is 0 and vice versa. Obviously when both are 0 it means that Species is setosa and that’s why no separate slope for that is present in output. Now we can write our regression line equation as (20.14)

\[\begin{equation} {Sepal.Width} = 3.428 + (-0.658)\cdot{Speciesversicolor} + (-0.454)\cdot{Speciesvirginica} \tag{20.14} \end{equation}\]

Interpreting above equation is now easy. For each versicolor the Sepal.Width may be 3.428 - 0.658 or 2.77. Obviously when two dummy variables are zero, the Sepal.Width would be equal to intercept; and thus, we can conclude that intercept is nothing but prediction for base-line level. In fact we can subtract a 1 to obtain these interpreted results.

lm_fact <- lm(Sepal.Width ~ Species -1, data = iris)
summary(lm_fact)

## 
## Call:
## lm(formula = Sepal.Width ~ Species - 1, data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.128 -0.228  0.026  0.226  0.972 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## Speciessetosa      3.42800    0.04804   71.36   <2e-16 ***
## Speciesversicolor  2.77000    0.04804   57.66   <2e-16 ***
## Speciesvirginica   2.97400    0.04804   61.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3397 on 147 degrees of freedom
## Multiple R-squared:  0.9881, Adjusted R-squared:  0.9879 
## F-statistic:  4083 on 3 and 147 DF,  p-value: < 2.2e-16

Notice that other two coefficients have now been adjusted automatically.

Results: Observing the figure 20.14, we may notice that the coefficients are nothing but mean Sepal Widths for each Species, which is obvious and logical too. Those, who are interested in seeing equation (without intercept) can also make one, as in equation (20.15).

\[\begin{equation} {Sepal.Width} = 3.428\cdot{Speciessetosa} + (2.77)\cdot{Speciesversicolor} + (2.974)\cdot{Speciesvirginica} \tag{20.15} \end{equation}\]

Since, these coefficients are not slopes in true sense, we will refer these as intercepts, in next examples/sections.

20.9.1 Parallel slopes regression

To understand how categorical response variable acts, when there are other numerical variables in regression model, let us build a model step by step.

Problem statement: Example-4. We are taking mpg data-set included by default with ggplot2 package. This data-set shows Fuel economy data from 1999 to 2008 for 38 popular models of cars. Let us predict highway mileage hwy from engine displacement displ and year of the model year. Let us visualize the variables. From Figure 20.15 it is clear that highway mileage is linearly associated with displacement.

ggplot(mpg, aes(hwy, displ)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  facet_wrap(.~ year) +
  theme_bw()

Figure 20.15: Highway Mileage Vs. Displacement over cars manufactured in 1998 Vs. 2008

Step-1: Let us first include a single numerical variable displ to predict highway mileage hwy; and examine the coefficients.

par_slop1 <- lm(hwy ~ displ, data = mpg)
coef(par_slop1)

## (Intercept)       displ 
##   35.697651   -3.530589

We got one intercept and one slope coefficient. We can even see related visualisation in figure 20.16 (left).

Step-2: Now let us try to predict mileage on the basis of year of manufacture only. So as already stated, we have to convert it into factor. We can do that directly in the formula. Also note that we are subtracting $1$ from the response variables, which actually replaces intercept with the baseline level explicitly. Now see the coefficients-

library(ggplot2)
par_slop2 <- lm(hwy ~ factor(year) - 1, data = mpg)
coef(par_slop2)

## factor(year)1999 factor(year)2008 
##         23.42735         23.45299

Notice that we got intercept for each of the category available in factor variable. By seeing plot in Figure 20.16-(Right) that these intercepts are nothing but means for each category.

Figure 20.16: Mileage vs. Displacement (Left) and Year (right)

Step-3: Now we will build the complete model by including both variables together; and examine the coefficients.

par_slop <- lm(hwy ~ displ + factor(year) - 1, data = mpg)
coef(par_slop)

##            displ factor(year)1999 factor(year)2008 
##        -3.610986        35.275706        36.677842

We can see one slope coefficient for numerical variable and two different intercepts for each of the Years. Try visualising these. In fact there are two different parallel lines (same slope). Refer figure 20.17.

Figure 20.17: Parallel Slopes

This is also evident, if we write out the equation (20.16).

\[\begin{equation} {hwy} = -3.610986\cdot{displ} + 35.275706\cdot{year1999} + 36.677842\cdot{year2008} \tag{20.16} \end{equation}\]

Either one of the dummy variables will be 0 and another 1, so that variable with 1 will act as intercept term, but the slope term will remain same. In other words, whatever be the year, the mileage will vary with displacement at the same rate. This is in actual circumstances, rare. Rate of change of response variable will change as per factor variables (regressor) change their values. So how to incorporate these changes in our model? The answer is interaction which has been discussed in next section.

20.9.2 Extending multiple linear regression by including `interactions`

The parallel slopes model, we saw in previous section enforced a common slope for each category. That’s not always the best option.

Problem Statement: In same example-4 (earlier section) we can introduce interaction between two predictors using special operator OR shorthand notation : in formula call. See the following example-

new_model <- lm(hwy ~ displ + factor(year) + displ:factor(year) -1 , data = mpg)
coef(new_model)

##                  displ       factor(year)1999       factor(year)2008 
##              -3.768406              35.792231              36.136730 
## displ:factor(year)2008 
##               0.305168

Now notice an extra coefficient, though small which is change in slope when moving from baseline category to category of year- 2008. This, in fact represents that the model has a different slope for each category; refer Figure 20.18.

Figure 20.18: Changing Slopes with interaction

Interpreting these models are now, not that difficult as it seem earlier. Let’s generate summary first.

summary(new_model)

## 
## Call:
## lm(formula = hwy ~ displ + factor(year) + displ:factor(year) - 
##     1, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8595 -2.4360 -0.2103  1.6037 15.3677 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## displ                   -3.7684     0.2788 -13.517   <2e-16 ***
## factor(year)1999        35.7922     0.9794  36.546   <2e-16 ***
## factor(year)2008        36.1367     1.0492  34.442   <2e-16 ***
## displ:factor(year)2008   0.3052     0.3882   0.786    0.433    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.784 on 230 degrees of freedom
## Multiple R-squared:  0.9759, Adjusted R-squared:  0.9755 
## F-statistic:  2332 on 4 and 230 DF,  p-value: < 2.2e-16

We may write our equation now (equation (20.17).

\[\begin{equation} {hwy} = -3.7684\cdot{displ} + 35.7922\cdot{year1999} + 36.1367\cdot{year2008} + 0.3052\cdot{displ}\cdot{year2008} \tag{20.17} \end{equation}\]

Clearly, when year is 2008, the slope for displ changes by 0.3052.

We can add as many interactions as we would like to, using the shorthand :; however, when there are many interactions we may make use of another shorthand operator *. So x*z would mean x + z + x:z and x*z*w would mean x + z + w + x:z + x:w + z:w. This obviously wouldn’t make any difference but would save us a lot of typing.

20.10 Multi-collinearity and Variance Inflation Factor (VIF)

Multi-collinearity indicates a strong linear relationship among the predictor variables. This can create challenges in the regression analysis because it becomes difficult to determine the individual effects of each independent variable on the dependent variable accurately. Multi-collinearity can lead to unstable and unreliable coefficient estimates, making it harder to interpret the results and draw meaningful conclusions from the model. It is essential to detect and address multi-collinearity to ensure the validity and robustness of regression models.

But why it poses a problem in regression analysis. Actually, multi-collinearity means that one independent variable can be predicted from another and it in turn means that independent variables are no longer independent.

Multi-collinearity can be detected using many different methods. One of the method can be to use correlation plots, which is explained in next section. Another method is to use Variance Inflation Factor or VIF. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. In other words, VIF score of an independent variable represents how well the variable is explained by other independent variables.

\[\begin{equation} {VIF} = \frac{1}{1-R^2} \tag{20.18} \end{equation}\]

So the closer $R^2$ value to $1$ the higher the ${VIF}$.

$VIF$ starts at $1$ and has no upper limit
${VIF} = 1$ means there is no correlation between the independent variable and the other variables
${VIF}$ exceeding $5$ indicates high multi-collinearity between that independent variable and others.

Again in R, we do not have to manually calculate ${VIF}$ for each variable. Using vif() function, of library car we can calculate this. Let’s use it on another model built on mtcars data where mpg variable is predicted using all other variables. For all other variables, instead of using names, we will use another shorthand operator ..

mod <- lm(mpg ~ . , data = mtcars)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

car::vif(mod)

##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873  4.648487 
##      gear      carb 
##  5.357452  7.908747

We may notice high collinearity among some of these predictors as also indicated in above output.

20.11 One Complete Example

Problem Statement: Example-5. The data pertains to inter-country Life-Cycle Savings 1960-1970. Under the life-cycle savings hypothesis as developed by Franco Modigliani, the savings ratio (aggregate personal saving divided by disposable income) sr is explained by per-capita disposable income dpi, the percentage rate of change in per-capita disposable income ddpi, and two demographic variables: the percentage of population less than 15 years old pop15 and the percentage of the population over 75 years old pop75. The data are averaged over the decade 1960–1970 to remove the business cycle or other short-term fluctuations.

Let’s try linear regression on LifeCycleSavings data.

# Visualise first 6 rows
head(LifeCycleSavings)

##              sr pop15 pop75     dpi ddpi
## Australia 11.43 29.35  2.87 2329.68 2.87
## Austria   12.07 23.32  4.41 1507.99 3.93
## Belgium   13.17 23.80  4.43 2108.47 3.82
## Bolivia    5.75 41.89  1.67  189.13 0.22
## Brazil    12.88 42.19  0.83  728.47 4.56
## Canada     8.79 31.72  2.85 2982.88 2.43

# Build a model
ex1 <- lm(sr ~ . , data = LifeCycleSavings)

In call formula above, notice the shorthand . operator which here means all variable other than y variable are treated as input variables. See its output-

summary(ex1)

## 
## Call:
## lm(formula = sr ~ ., data = LifeCycleSavings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

Multiple R squared is about 34% which means nearly 34% variability is explained by linear model. Let us see some diagnostics plots (figure 20.19)

performance::check_model(ex1)

Figure 20.19: Diagnostic Plots

We may notice some multi-collinearity, between two population variables. See figures and 20.19 and 20.20.

LifeCycleSavings %>% 
  ggplot(aes(pop15, pop75)) +
  geom_point()+
  geom_smooth(method = 'lm', formula = 'y~x') +
  theme_bw()

Figure 20.20: Are x and y correlated?

This can also be verified by corrplots in figure 20.21.

Figure 20.21: Correlation Plots

Let us see the VIF among the predictors.

car::vif(ex1)

##    pop15    pop75      dpi     ddpi 
## 5.937661 6.629105 2.884369 1.074309

Thus, the model should be tuned better by removing this multi-collinear variable. Let us try to remove pop75 and re-run the model.

ex2 <- lm(sr ~ pop15 + dpi + ddpi, data = LifeCycleSavings)
summary(ex2)

## 
## Call:
## lm(formula = sr ~ pop15 + dpi + ddpi, data = LifeCycleSavings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6889 -2.8813  0.0296  1.7989 10.4330 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.2771687  4.3888974   4.392 6.53e-05 ***
## pop15       -0.2883861  0.0945354  -3.051  0.00378 ** 
## dpi         -0.0008704  0.0008795  -0.990  0.32755    
## ddpi         0.3929355  0.1989390   1.975  0.05427 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.862 on 46 degrees of freedom
## Multiple R-squared:  0.3026, Adjusted R-squared:  0.2572 
## F-statistic: 6.654 on 3 and 46 DF,  p-value: 0.0007941

Notice that multiple R-squared has now reduced to 30%. Let us try to remove the variable dpi the coefficient of which is not that significant.

ex3 <- lm(sr ~ pop15 + ddpi, data = LifeCycleSavings)
summary(ex3)

## 
## Call:
## lm(formula = sr ~ pop15 + ddpi, data = LifeCycleSavings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5831 -2.8632  0.0453  2.2273 10.4753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.59958    2.33439   6.682 2.48e-08 ***
## pop15       -0.21638    0.06033  -3.586 0.000796 ***
## ddpi         0.44283    0.19240   2.302 0.025837 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.861 on 47 degrees of freedom
## Multiple R-squared:  0.2878, Adjusted R-squared:  0.2575 
## F-statistic: 9.496 on 2 and 47 DF,  p-value: 0.0003438

Multiple R squared increased slightly i.e. now reached around 29%. Let us try to visualise the relationship through a scatter plot.

LifeCycleSavings %>% 
  ggplot(aes(pop15, sr)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = 'y ~ x', se = FALSE) +
  theme_bw()

Figure 20.22: Ascertaining linear relationship

Clearly the relationship between the predictor and response variable is not that strong and hence the results.

20.12 Simpson’s paradox

Edward Hugh Simpson, a statistician and former cryptanalyst at Bletchley Park, described this statistical phenomenon in a paper in 1951. It is classic example how regressions, without including necessary terms, can be misleading. At its core, the paradox arises when a trend that appears in different subgroups of data is either reversed or disappears when the subgroups are combined. This seemingly counter-intuitive occurrence can lead to misleading conclusions and thus underscores the importance of careful analysis and interpretation of data.

Problem Statement: Example-6. Here is data of palmerpenguins where let’s try to establish relationship between penguins bills’ depth and lengths i.e. bill_depth_mm and bill_length_mm. See the plot in fig 20.23.

Figure 20.23: Regression having variable hidden (left) and exposed (right)

In plot (left) a negative relationship is seen, but when the lurking variable is exposed (right), we can see a positive relationship for each of the species . Thus, species is a significant confounding variable to assess linear relationship here. Thus before finalizing the model, we have to be sure that we are not missing any important variable. To avoid incorrect results due to underlying Simpson’s Paradox, we must ensure to:

Identify Confounding Variables: Be vigilant in identifying potential confounding variables that could affect the relationship between the variables under study.

Consider All Levels: Analyze data at different levels, including subgroup and aggregate levels, to gain a comprehensive understanding of the relationship.

Utilise Statistical Techniques: such as regression analysis or propensity score matching, to control for confounding variables and obtain more accurate insights.

Transparent Reporting: Clearly report the methodology, assumptions, and limitations of the analysis to ensure that others can critically evaluate the findings.

Simpson’s Paradox, thus, serves as a powerful reminder that data analysis is an intricate process that requires careful consideration of underlying factors.

20.13 Conclusion and Final thoughts

In above sections, we learned techniques of regression analysis, which is a powerful and useful tool in data analytics while auditing. We may use regression analysis, inter alia, for -

Detection of Anomalies and Outliers: Regression analysis can help auditors in identifying anomalies, outliers, or unexpected patterns in financial data. Unusual relationships between variables can signal potential errors, fraud, or irregularities that require further investigation.

Risk Assessment: By analyzing the relationships between various financial or operational variables, regression analysis can assist auditors in assessing the level of risk associated with different aspects of an organization’s operations. This helps auditors prioritize their efforts and allocate resources effectively.

Control Testing: Regression analysis can aid us in testing the effectiveness of internal controls within an organization. By examining the relationship between control variables and outcomes, auditors can assess whether controls are functioning as intended. We can also use regression analysis to compare an organization’s financial performance against industry benchmarks or similar companies. Deviations from expected relationships can highlight areas that warrant closer examination.

Part-IV: Machine Learning in R

21 Principal Component Analysis in R