Heteroskedasticity: Difference between revisions
No edit summary |
|||
(68 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
Heteroscedasticity is the case if homoscedasticity is not fulfilled, which is one of the most important assumptions of the Ordinary Least Squares Regression (OLS). | Heteroscedasticity is the case if homoscedasticity is not fulfilled, which is one of the most important assumptions of the Ordinary Least Squares Regression (OLS). | ||
One of the assumptions of the OLS regression is that the errors are normally and independently distributed. The assumption regarding the OLS regression assumes that the variance of the error terms stays constant over periods <ref>Wooldrige, J. (2005). pg.13</ref>. | |||
<math> \epsilon ~i.i.d(0,\sigma^2)</math> | <math> \epsilon ~i.i.d(0,\sigma^2)</math> | ||
==Definition of Heteroscedasticity== | |||
<ref>Wooldrige, J. (2005). pg.13</ref> | '''Heteroskedasticity''' is defined as the residuals that '''don't''' have the same variances in the model. That means that the difference in the true values of the residuals is not the same in every period. This causes the variance of the errors to depend on the independent variables, which causes an error om the variance of the OLS estimators and therefore in their [[standard]] errors <ref>Wooldrige, J. (2005). pg.13</ref>. | ||
<math> \hat V(\epsilon)=\sigma_i^2 \neq \sigma^2</math> | |||
<math>\ | |||
==Consequences of heteroscedasticity== | ==Consequences of heteroscedasticity== | ||
If you run your regression under the fact that there is heteroscedasticity you get '''unbiased''' values for your beta coefficients. That means there is no correlation between the explanatory variable and the residual. | If you run your regression under the fact that there is heteroscedasticity you get '''unbiased''' values for your beta coefficients. That means there is no correlation between the explanatory variable and the residual. | ||
So, consistency and unbiasedness are still given if only the homoscedasticity assumption is violated. Overall, there is '''no''' impact on the model fit. | So, consistency and unbiasedness are still given if only the homoscedasticity assumption is violated. Overall, there is '''no''' impact on the model fit <ref>Kaufmann, R.(2013) pg. 2-5</ref>. | ||
But you get an impact on other parts: | But you get an impact on other parts: | ||
* The estimates of your coefficients are '''not''' efficient anymore | |||
* The standard errors are '''biased''' as the '''test statistics''' | |||
Due to wrong standard errors, our T-statistic is wrong, and we make any valid statement about their significance. For example, if the standard errors will be too small then it’s more unlikely to reject the null hypothesis. | Due to wrong standard errors, our T-statistic is wrong, and we make any valid statement about their significance. For example, if the standard errors will be too small then it’s more unlikely to reject the null hypothesis. | ||
Thus, the inference, as well as efficiency, are affected. | Thus, the inference, as well as [[efficiency]], are affected. | ||
The results '''won’t be efficient''' anymore because they don’t have the minimum variance anymore. | The results '''won’t be efficient''' anymore because they don’t have the minimum variance anymore. | ||
<ref>Astivia, O., & Zumbo B. (2019), pg. 2-4</ref> | It’s very important to correct heteroskedasticity to get a useful interpretation of your model and to have a correct interpretation of statistical test decisions <ref>Astivia, O., & Zumbo B. (2019), pg. 2-4</ref>. | ||
==Reasons for Heteroscedasticity== | ==Reasons for Heteroscedasticity== | ||
Heteroscedasticity is often found in time series data or cross-sectional data. | Heteroscedasticity is often found in time series data or cross-sectional data. | ||
Reasons can be omitted variables, outliers in data, or incorrectly specified model equations. | Reasons can be omitted variables, outliers in data, or incorrectly specified model equations <ref>Klein et al (2015) pg.543 </ref>. | ||
==How to find out if there is heteroskedasticity?== | ==How to find out if there is heteroskedasticity?== | ||
In doubt, you should adopt that in your regression is heteroscedasticity and check if it is true or not regarding the reality. | In doubt, you should adopt that in your regression is heteroscedasticity and check if it is true or not regarding the reality. | ||
===Plot in R=== | |||
To find out if there is heteroskedasticity, there exist several ways. | To find out if there is heteroskedasticity, there exist several ways. | ||
The fastest way is with the use of statistical programs for example R studio. | The fastest way is with the use of statistical programs for example R studio. | ||
The plot shows you the residuals against the fitted values on the graphic. | The plot shows you the residuals against the fitted values on the graphic. | ||
If there is any kind of trend or pattern, then it is very likely that your assumption of the OLS model is violated and there is heteroscedasticity. If you have a random distribution of your values your assumption is not violated. | If there is any kind of trend or pattern, then it is very likely that your assumption of the OLS model is violated and there is heteroscedasticity. If you have a random distribution of your values your assumption is not violated. | ||
There exists also the possibility to run the Breusch-Pagan Test to identify if there is heteroscedasticity | ===Breusch-Pagan Test=== | ||
There exists also the possibility to run the Breusch-Pagan Test to identify if there is heteroscedasticity (In R the lmtest package.) | |||
This test checks whether our independent variables affect the error terms by regressing the squared residuals (an easier approximation to the Variance of u) on our regressors and checking the significance. | This test checks whether our independent variables affect the error terms by regressing the squared residuals (an easier approximation to the Variance of u) on our regressors and checking the significance. | ||
===White Test=== | |||
This test is more general and does not only test for homoscedasticity. | This test is more general and does not only test for homoscedasticity. | ||
In general, adds squares and interaction terms to catch all interdependence between the variance of residuals and the independent variables. Easier and fewer degrees of freedom: Use fitted valuers and their squared form. (het.test in R) | In general, adds squares and interaction terms to catch all interdependence between the variance of residuals and the independent variables. Easier and fewer [[degrees of freedom]]: Use fitted valuers and their squared form. (het.test in R) | ||
In both cases, the test's null hypothesis is that your residuals are homoscedastic, and your alternative hypothesis maintains the opposite. If your p-value is lower than your significance level, then there is heteroskedasticity. | In both cases, the test's null hypothesis is that your residuals are homoscedastic, and your alternative hypothesis maintains the opposite. If your p-value is lower than your significance level, then there is heteroskedasticity. | ||
In both proceedings (White and Breusch-Pagan Test) the null hypothesis is that the residuals are homoscedastic. The alternative hypothesis is that the residuals are not homoscedastic. | |||
As a consequence if the null hypothesis is rejected, then your residuals are heteroscedastic. | |||
If you fail to reject the null hypothesis the residuals are homoscedastic. | If you fail to reject the null hypothesis the residuals are homoscedastic. | ||
===Goldfeld-Quant Test=== | |||
The main fact is to compare two variances of two subsamples here. | The main fact is to compare two variances of two subsamples here. | ||
Then you run two different regressions for the groups. | Then you run two different regressions for the groups. | ||
The null hypothesis says that those two groups have the same variance, which means there is homoscedasticity. | The null hypothesis says that those two groups have the same variance, which means there is homoscedasticity. | ||
If your results differ then there exists heteroscedasticity. | If your results differ then there exists heteroscedasticity <ref>Astivia, O., & Zumbo B. (2019), pg.4-7</ref>. | ||
You can also use the Levene-Test, Glejser-Test, or the RESET-Test | |||
<ref> | You can also use the Levene-Test, Glejser-Test, or the RESET-Test <ref>Godfrey L. G. & Orme C. D. (1999), pg.173-176</ref>. | ||
==What to do against heteroskedasticity?== | ==What to do against heteroskedasticity?== | ||
In order to get correct for heteroscedasticity several approaches can be found in the literature. | In order to get correct for heteroscedasticity several approaches can be found in the literature. | ||
One possibility is to change to a '''WLS regression''', the so-called weighted least squares regression. | |||
That means that you use weights based on the variance. The choice of your weights depends on the structure of the data. For this solution, you [[need]] to know the error variance of every observation. Very often the size of the variances is unknown which makes this approach impractical. | |||
That means that you use weights based on the variance. The choice of your weights depends on the structure of the data. For this solution, you need to know the error variance of every observation. Very often the size of the variances is unknown which makes this approach impractical. | Another alternative is to '''redesign the model'''. That means a transformation of your dependent variable and the data. Then you try to stabilize the variance. One example is to take your values quadratic. | ||
Moreover literature shows how to solve it with '''bootstraping'''. An advantage is there are no strong assumptions necessary. | |||
It got a popular alternative for calculating p-values and confidence intervals if assumptions are violated. You do that by constructing a data-generating [[process]] based on unknown parameters and probability distributions <ref>Astivia, O., & Zumbo B. (2019), pg.34 </ref>. | |||
==Examples of Heteroskedasticity== | |||
It got a popular alternative for calculating p-values and confidence intervals if assumptions are violated. You do that by constructing a data-generating process based on unknown parameters and probability distributions | * The first example of heteroskedasticity is income inequality. This occurs when income is not evenly distributed among individuals and businesses, resulting in varying levels of expenditure across the population. | ||
<ref>Astivia, O., & Zumbo B. (2019), pg.34 </ref> | * Another example of heteroskedasticity is the effect of [[price]] on [[demand]]. When the price of a good or [[service]] increases, the demand for it will decrease - this leads to different levels of demand across different price ranges. | ||
* A third example of heteroskedasticity is the effect of age on wages. As people age, their wages tend to increase - this creates different levels of wages across different age groups. | |||
* A fourth example of heteroskedasticity is the effect of [[education]] on income. Generally, people with higher levels of education tend to earn more than those with lower levels - this results in different levels of income across different educational backgrounds. | |||
== | ==Advantages of Heteroskedasticity== | ||
Heteroskedasticity has several advantages in the context of Ordinary Least Squares regression (OLS): | |||
* It allows for more flexibility in the estimation of regression coefficients as the variances of the residuals can be modelled separately for each observation, depending on its explanatory variables. | |||
* This can be useful in cases where it is not possible to make assumptions about the [[homogeneity of variance]] across the observations. | |||
* Heteroskedasticity can also be used to capture nonlinearity in the relationship between the dependent and independent variables, allowing for more accurate predictions. | |||
* Additionally, heteroskedasticity can help detect outliers in datasets, as the variance of the residuals will be higher for observations that are outliers. | |||
==Limitations of Heteroskedasticity== | |||
Heteroscedasticity has several major limitations that should be considered when using OLS regression: | |||
* It can produce biased and inconsistent estimates of the regression parameters, leading to erroneous inferences about the data. | |||
* It can cause standard errors to be incorrect, leading to incorrect p-values and confidence intervals. | |||
* It can lead to incorrect prediction intervals, resulting in misleading forecasts. | |||
* It can cause the coefficients of different variables to be correlated, resulting in invalid conclusions. | |||
* It can cause the R-squared statistic to be overstated and lead to incorrect inferences about the goodness of fit. | |||
* It can lead to problems in identifying the outliers in the data. | |||
== | ==Other approaches related to Heteroskedasticity== | ||
* | Heteroskedasticity can be addressed with a variety of alternative methods, such as: | ||
* '''Weighted Least Squares (WLS)''': WLS uses weights to reduce the effect of outliers and reduce heteroscedasticity. | |||
* '''Robust Regression''': Robust regression uses robust fitting techniques to minimize the effects of outliers. | |||
* '''Generalized Least Squares (GLS)''': GLS takes into account the correlation between errors and can be used to address heteroscedasticity. | |||
* '''Nonlinear Regression''': Nonlinear regression techniques can be used to model nonlinear relationships between variables and address heteroscedasticity. | |||
In summary, there are a variety of methods that can be used to address heteroscedasticity, such as Weighted Least Squares, Robust Regression, Generalized Least Squares, and Nonlinear Regression. | |||
==Footnotes== | |||
<references/> | |||
{{infobox5|list1={{i5link|a=[[Adjusted mean]]}} — {{i5link|a=[[Random error]]}} — {{i5link|a=[[Statistical power]]}} — {{i5link|a=[[Parametric analysis]]}} — {{i5link|a=[[Experimental error]]}} — {{i5link|a=[[Descriptive statistics]]}} — {{i5link|a=[[Cyclic variation]]}} — {{i5link|a=[[Box diagram]]}} — {{i5link|a=[[Np chart]]}} }} | |||
==References== | |||
* Astivia, O., & Zumbo B. (2019) [https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1331&context=pare ''What it is, How to Detect it and How to Solve it with Applications in R and SPSS''], Practical Assessment, Research, and [[Evaluation]]. | |||
* Godfrey L. G. & Orme C. D. (1999) [https://www.tandfonline.com/doi/pdf/10.1080/07474939908800438?needAccess=true ''The robustness, reliabiligy and power of heteroskedasticity tests''], Econometric Reviews. | |||
* Kaufman, R. (2013) [https://books.google.pl/books?hl=de&lr=&id=KbYgAQAAQBAJ&oi=fnd&pg=PP1&dq=heteroskedasticity+regression&ots=3C_DEnpOuG&sig=IvvO1gkPHTuID3LiWZasyOhEOcQ&redir_esc=y#v=onepage&q=heteroskedasticity%20regression&f=false ''Heteroskedasticity in Regression: Detection and Correction'']. SAGE Publications 2013. | |||
* Klein G., A., Gerhard C., Büchner R., Diestel S., & Schermelleh-Engel K. (2015) [https://www.researchgate.net/profile/Karin-Schermelleh-Engel/publication/311518028_The_Detection_of_Heteroscedasticity_in_Regression_Models_for_Psychological_Data/links/584d117408ae4bc8992c45ea/The-Detection-of-Heteroscedasticity-in-Regression-Models-for-Psychological-Data.pdf ''The Detection of Heteroscedasticity in Regression Models for Psychological Data''], Psychological Test and Assessment Modeling. | |||
* Wooldrige, J. (2005) [https://books.google.pl/books?hl=de&lr=&id=wUF4BwAAQBAJ&oi=fnd&pg=PR3&dq=Wooldridge,+J.+(2005),+Introductory+Econometrics:+A+Modern+Approach,+3+edn,+South-Western+College+Pub.&ots=cAWD1Fiomk&sig=AF1656pUpflLA37RjcP5jVWm_QQ&redir_esc=y#v=onepage&q=Heteroscedasticity&f=false ''Introductory Econometrics: A Modern Approach''].Cengage Learning, 2015. | |||
[[Category:Economics]] | |||
{{a|Annamarie Dietz}} | {{a|Annamarie Dietz}} | ||
Latest revision as of 08:35, 18 November 2023
Heteroscedasticity is the case if homoscedasticity is not fulfilled, which is one of the most important assumptions of the Ordinary Least Squares Regression (OLS). One of the assumptions of the OLS regression is that the errors are normally and independently distributed. The assumption regarding the OLS regression assumes that the variance of the error terms stays constant over periods [1].
Definition of Heteroscedasticity
Heteroskedasticity is defined as the residuals that don't have the same variances in the model. That means that the difference in the true values of the residuals is not the same in every period. This causes the variance of the errors to depend on the independent variables, which causes an error om the variance of the OLS estimators and therefore in their standard errors [2].
Consequences of heteroscedasticity
If you run your regression under the fact that there is heteroscedasticity you get unbiased values for your beta coefficients. That means there is no correlation between the explanatory variable and the residual. So, consistency and unbiasedness are still given if only the homoscedasticity assumption is violated. Overall, there is no impact on the model fit [3]. But you get an impact on other parts:
- The estimates of your coefficients are not efficient anymore
- The standard errors are biased as the test statistics
Due to wrong standard errors, our T-statistic is wrong, and we make any valid statement about their significance. For example, if the standard errors will be too small then it’s more unlikely to reject the null hypothesis. Thus, the inference, as well as efficiency, are affected. The results won’t be efficient anymore because they don’t have the minimum variance anymore.
It’s very important to correct heteroskedasticity to get a useful interpretation of your model and to have a correct interpretation of statistical test decisions [4].
Reasons for Heteroscedasticity
Heteroscedasticity is often found in time series data or cross-sectional data. Reasons can be omitted variables, outliers in data, or incorrectly specified model equations [5].
How to find out if there is heteroskedasticity?
In doubt, you should adopt that in your regression is heteroscedasticity and check if it is true or not regarding the reality.
Plot in R
To find out if there is heteroskedasticity, there exist several ways. The fastest way is with the use of statistical programs for example R studio. The plot shows you the residuals against the fitted values on the graphic. If there is any kind of trend or pattern, then it is very likely that your assumption of the OLS model is violated and there is heteroscedasticity. If you have a random distribution of your values your assumption is not violated.
Breusch-Pagan Test
There exists also the possibility to run the Breusch-Pagan Test to identify if there is heteroscedasticity (In R the lmtest package.) This test checks whether our independent variables affect the error terms by regressing the squared residuals (an easier approximation to the Variance of u) on our regressors and checking the significance.
White Test
This test is more general and does not only test for homoscedasticity. In general, adds squares and interaction terms to catch all interdependence between the variance of residuals and the independent variables. Easier and fewer degrees of freedom: Use fitted valuers and their squared form. (het.test in R) In both cases, the test's null hypothesis is that your residuals are homoscedastic, and your alternative hypothesis maintains the opposite. If your p-value is lower than your significance level, then there is heteroskedasticity.
In both proceedings (White and Breusch-Pagan Test) the null hypothesis is that the residuals are homoscedastic. The alternative hypothesis is that the residuals are not homoscedastic. As a consequence if the null hypothesis is rejected, then your residuals are heteroscedastic. If you fail to reject the null hypothesis the residuals are homoscedastic.
Goldfeld-Quant Test
The main fact is to compare two variances of two subsamples here. Then you run two different regressions for the groups. The null hypothesis says that those two groups have the same variance, which means there is homoscedasticity. If your results differ then there exists heteroscedasticity [6].
You can also use the Levene-Test, Glejser-Test, or the RESET-Test [7].
What to do against heteroskedasticity?
In order to get correct for heteroscedasticity several approaches can be found in the literature. One possibility is to change to a WLS regression, the so-called weighted least squares regression. That means that you use weights based on the variance. The choice of your weights depends on the structure of the data. For this solution, you need to know the error variance of every observation. Very often the size of the variances is unknown which makes this approach impractical. Another alternative is to redesign the model. That means a transformation of your dependent variable and the data. Then you try to stabilize the variance. One example is to take your values quadratic. Moreover literature shows how to solve it with bootstraping. An advantage is there are no strong assumptions necessary. It got a popular alternative for calculating p-values and confidence intervals if assumptions are violated. You do that by constructing a data-generating process based on unknown parameters and probability distributions [8].
Examples of Heteroskedasticity
- The first example of heteroskedasticity is income inequality. This occurs when income is not evenly distributed among individuals and businesses, resulting in varying levels of expenditure across the population.
- Another example of heteroskedasticity is the effect of price on demand. When the price of a good or service increases, the demand for it will decrease - this leads to different levels of demand across different price ranges.
- A third example of heteroskedasticity is the effect of age on wages. As people age, their wages tend to increase - this creates different levels of wages across different age groups.
- A fourth example of heteroskedasticity is the effect of education on income. Generally, people with higher levels of education tend to earn more than those with lower levels - this results in different levels of income across different educational backgrounds.
Advantages of Heteroskedasticity
Heteroskedasticity has several advantages in the context of Ordinary Least Squares regression (OLS):
- It allows for more flexibility in the estimation of regression coefficients as the variances of the residuals can be modelled separately for each observation, depending on its explanatory variables.
- This can be useful in cases where it is not possible to make assumptions about the homogeneity of variance across the observations.
- Heteroskedasticity can also be used to capture nonlinearity in the relationship between the dependent and independent variables, allowing for more accurate predictions.
- Additionally, heteroskedasticity can help detect outliers in datasets, as the variance of the residuals will be higher for observations that are outliers.
Limitations of Heteroskedasticity
Heteroscedasticity has several major limitations that should be considered when using OLS regression:
- It can produce biased and inconsistent estimates of the regression parameters, leading to erroneous inferences about the data.
- It can cause standard errors to be incorrect, leading to incorrect p-values and confidence intervals.
- It can lead to incorrect prediction intervals, resulting in misleading forecasts.
- It can cause the coefficients of different variables to be correlated, resulting in invalid conclusions.
- It can cause the R-squared statistic to be overstated and lead to incorrect inferences about the goodness of fit.
- It can lead to problems in identifying the outliers in the data.
Heteroskedasticity can be addressed with a variety of alternative methods, such as:
- Weighted Least Squares (WLS): WLS uses weights to reduce the effect of outliers and reduce heteroscedasticity.
- Robust Regression: Robust regression uses robust fitting techniques to minimize the effects of outliers.
- Generalized Least Squares (GLS): GLS takes into account the correlation between errors and can be used to address heteroscedasticity.
- Nonlinear Regression: Nonlinear regression techniques can be used to model nonlinear relationships between variables and address heteroscedasticity.
In summary, there are a variety of methods that can be used to address heteroscedasticity, such as Weighted Least Squares, Robust Regression, Generalized Least Squares, and Nonlinear Regression.
Footnotes
Heteroskedasticity — recommended articles |
Adjusted mean — Random error — Statistical power — Parametric analysis — Experimental error — Descriptive statistics — Cyclic variation — Box diagram — Np chart |
References
- Astivia, O., & Zumbo B. (2019) What it is, How to Detect it and How to Solve it with Applications in R and SPSS, Practical Assessment, Research, and Evaluation.
- Godfrey L. G. & Orme C. D. (1999) The robustness, reliabiligy and power of heteroskedasticity tests, Econometric Reviews.
- Kaufman, R. (2013) Heteroskedasticity in Regression: Detection and Correction. SAGE Publications 2013.
- Klein G., A., Gerhard C., Büchner R., Diestel S., & Schermelleh-Engel K. (2015) The Detection of Heteroscedasticity in Regression Models for Psychological Data, Psychological Test and Assessment Modeling.
- Wooldrige, J. (2005) Introductory Econometrics: A Modern Approach.Cengage Learning, 2015.
Author: Annamarie Dietz