Skip to content

Statistics | Regression Diagnostics

Posted on:February 22, 2025

Table of contents

1. For Linear Regression

1.1. Assumptions for LR

1.2. Residuals-Fitted Values Plot

A “healthy” plot should look like the following ones:

1) Residuals should randomly bounce above and below 0 (i.e. symmetry) >>> Linearity. The following plot indicates a violation, suggesting that the relationship between Y and Xs might not be linear:

2) Residuals are distributed within a horizontal band centered around 0 >>> Equal variances. The following plot indicates a violation, suggesting that variances of error might not be constant.

3) No residuals deviate drastically from the basic pattern >>> No outliers.

Remark: You can also plot a series of scatterplots with residuals on the y-axis and each predictor on the x-axis (i.e. Residuals-Predictors Plot). These plots function similarly to the Residuals-Fitted Values Plot, and the judging criteria are the same. A violation in at least one plot suggests that you should apply some transformations.

In R, we can use plot(model.1, which = 1) to generate the Residuals-Fitted Values Plot.

1.3. Plot Residuals Only

Plot a histogram, boxplot, or normal probability plot of the residuals to check for Normality.

Normal probability plot: If the data follow a normal distribution, then a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles should be approximately linear. In this case, this plot is also called “Q-Q plot.”

Some indications for violations:

1) Skewed residuals

On the Q-Q plot, it mainly appears as larger sample quantiles on both ends.

2) Heavy-tailed residuals

On the Q-Q plot, it mainly appears as a smaller sample quantile on the far left and a larger sample quantile on the far right.

3) Normal residuals but with one outlier

In R,

## Generate Q-Q Plot
plot(model, which = 2)

# Histogram of residuals
hist(residuals(model),
     main = "Histogram of Residuals", xlab = "Residuals")

# Boxplot of residuals
boxplot(residuals(model),
        main = "Boxplot of Residuals", ylab = "Residuals", horizontal = TRUE)

1.4. Identify Influential Points

Influential data points can affect any part of the regression analysis, leading to significantly different results depending on their inclusion. Two sources of being influential: 1) high-leverage points (i.e., have extreme X values), 2) outliers (i.e., have Y values that do not follow the general trent of the rest of the data):

High-leverage points and outliers are not necessarily influential; they only have the potential to be. Therefore, further analysis is needed, such as comparing regressions with and without these points. This requires a two-step process: (1) identifying high-leverage points and outliers, and (2) assessing their actual influence.

Identify High-Leverage Points

Identify Outliers

ri=eise(ei)=eiσ^1hiir_i = \frac{e_i}{se(e_i)} =\frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}}

Cook’s Distance

1.5. Residuals-Time Plot (Optional)

The plot is used to detect non-independence. The plot is only appropriate if you know the order in which the data were collected. In the plot, y-axis is residuals and x-axis is time (or other indicators for ordering). Here is a plot suggesting non-independence (i.e., residuals bounce randomly around the residual = 0 line):

The following plot shows a time trend:

2. For Logistic Regression

2.1. Residuals

In logistic regression diagnostics, residuals also play a central role. However, the calculation of residuals in logistic regression differs from that in linear regression. Below, we introduce two commonly used residuals in logistic regression.

di=si2(yilog(p^i)+(1yi)log(1p^i))\begin{equation*} d_{i}=s_i\sqrt{-2\biggl(y_{i}\log({\hat{p}_{i}})+(1-y_{i})\log({1-\hat{p}_{i}})\biggr)} \end{equation*} ri=yipi^pi^(1pi^)r_i =\frac{y_i-\hat{p_i}}{\sqrt{\hat{p_i}(1-\hat{p_i})}}

2.2. For Goodness-Of-Fit

1) GOF Test with Deviance

H0:Model fits data wellH1:Saturated ModelH_0: \text{Model fits data well} \\ H_1: \text{Saturated Model}

3) GOF Test with Hosmer & Lemeshow statistic

2.3. Residual Plotting

Some say that residual plots are nearly useless when nin_is are small or using ungrouped data (source), while others say that in this case residual plots are still useful (source).

Pearson residuals/Deviance residuals vs. Index

3. References