# Multicollinearity

Multicollinearity | |
---|---|

See also |

**Multicollinearity** is a phenomenon in which two or more predictor variables in a multiple linear regression model are highly correlated. This can lead to unreliable and unstable estimates of regression coefficients and can lead to incorrect conclusions and inferences.

The presence of multicollinearity can be detected in a few ways:

- By looking at a correlation matrix of all of the predictor variables, one can see if any two variables have a correlation coefficient that is over 0.8.
- By looking at the variance inflation factor (VIF), which is a measure of how much the variance of the estimated regression coefficients is inflated due to multicollinearity. A VIF above 10 is an indication of multicollinearity.
- By looking at the condition index, which is a measure of how much the predictor variables are correlated with one another. If the index is over 30, it indicates the presence of multicollinearity.

Multicollinearity can be addressed by dropping one of the correlated predictors, by combining them into a single predictor, or by regularization techniques such as ridge regression.

## Example of Multicollinearity

An example of multicollinearity is when predicting house prices with square footage and number of bedrooms. In this case, the two predictors are highly correlated since larger houses tend to have more bedrooms. This can lead to unreliable and unstable estimates of regression coefficients and incorrect conclusions.

The presence of multicollinearity in this example can be detected by looking at the correlation coefficient between square footage and number of bedrooms, which would likely be very high. It can also be detected by looking at the VIF and condition index, which would both be high.

To address this multicollinearity, one could drop one of the predictors, combine them into a single predictor, or use regularization techniques such as ridge regression.

## Formula of Multicollinearity

The formula for multicollinearity is given by the following equation\[\rho = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}\]

Where ρ is the correlation coefficient, x_{i}$ and y_{i} are the values of the two variables, and x-bar and y-bar are the means of the respective variables. This equation measures the correlation between two variables, and a value of ρ close to 1 indicates strong multicollinearity.

## When to use Multicollinearity

Multicollinearity is typically used when analyzing data for multiple linear regression models. It can be used to detect relationships between predictor variables, and to ensure that the estimates of the regression coefficients are reliable and stable. It can also be used to identify potential sources of bias in a model.

## Types of Multicollinearity

There are three types of multicollinearity: perfect multicollinearity, near multicollinearity, and spurious multicollinearity.

- Perfect multicollinearity is when two or more of the predictor variables are perfectly correlated with each other, meaning that they are linearly dependent. This can lead to unstable estimates of the regression coefficients and can make the regression model inapplicable.
- Near multicollinearity is when two or more of the predictor variables are highly correlated with each other, but not perfectly correlated. This can lead to unreliable estimates of the regression coefficients and can lead to incorrect conclusions and inferences.
- Spurious multicollinearity is when two or more predictor variables are correlated with each other, but there is no causal relationship between them. This can lead to inaccurate estimates of the regression coefficients and can lead to incorrect inferences.

## Advantages of Multicollinearity

Despite the fact that multicollinearity can lead to unreliable and unstable estimates of regression coefficients, it can also be beneficial in some respects.

- One advantage is that it can reduce the variance of the estimated regression coefficients, which can be helpful when there is a large number of predictors in the model.
- Another advantage is that it can increase the predictive power of the model, as the multiple predictors can provide more information about the response variable.

## Limitations of Multicollinearity

Multicollinearity can lead to several problems that can affect the accuracy of the model, including:

- It can lead to an overestimation of the standard errors of the regression coefficients, leading to incorrect conclusions about the significance of the coefficients.
- It can lead to an increase in the root mean square error of the model, which means that the model is less accurate than it should be.
- It can lead to an increase in the variance of the estimated coefficients, meaning that the coefficients are more volatile and are more likely to change when the model is re-estimated.

In addition to the methods mentioned above, there are a few other approaches that can be used to address multicollinearity.

- Principal Components Analysis (PCA) is a technique used to reduce the number of variables in a model by combining them into a smaller set of uncorrelated variables. This can help reduce the effects of multicollinearity.
- Orthogonal polynomials can be used to create polynomial terms for predictor variables, which can help reduce the effects of multicollinearity.
- Regularization techniques such as ridge regression can be used to reduce the magnitude of estimated regression coefficients, which can help reduce the effects of multicollinearity.

In summary, there are a few other approaches that can be used to address multicollinearity, such as Principal Components Analysis, orthogonal polynomials, and regularization techniques.

## Suggested literature

- Alin, A. (2010).
*Multicollinearity*. Wiley interdisciplinary reviews: computational statistics, 2(3), 370-374. - Mansfield, E. R., & Helms, B. P. (1982).
*Detecting multicollinearity*. The American Statistician, 36(3a), 158-160.