Descriptive statistics
Descriptive statistics refers to methods for summarizing and organizing data through numerical measures and graphical representations. Unlike inferential statistics, which draw conclusions about populations from samples, descriptive statistics simply characterize the data at hand. The field emerged from 19th-century work by Francis Galton and was formalized by Karl Pearson, who founded the world's first university statistics department at University College London in 1911[1].
Historical development
Statistical description has ancient roots in census-taking and record-keeping. However, modern descriptive statistics developed primarily through biological research. Galton, studying hereditary traits in the 1870s and 1880s, developed methods for summarizing distributions and measuring relationships between variables. His 1889 book "Natural Inheritance" introduced regression concepts and stimulated further mathematical development.
Karl Pearson built upon Galton's foundations. In 1893, he coined the term "standard deviation" to describe spread around the mean. He borrowed the concept of moments from physics to describe distribution shapes. Pearson also developed the chi-squared test and refined correlation coefficients. His collaboration with Galton and biologist W.F.R. Weldon led to founding the journal Biometrika in 1901[2].
Measures of central tendency
Central tendency measures identify typical or representative values within a dataset. Three primary measures exist:
Mean
The arithmetic mean equals the sum of all values divided by the count of observations. Calculating the mean for exam scores of 72, 85, 91, and 68 yields (72+85+91+68)/4 = 79. The mean incorporates every data point, making it sensitive to extreme values. An outlier can substantially shift the mean away from what most observations suggest[3].
Median
The median is the middle value when data are arranged in order. With an odd number of observations, it's the central value; with an even number, it's the average of the two middle values. Professional athlete salaries are typically reported as medians because a few extremely high earners would distort the mean upward.
Mode
The mode identifies the most frequently occurring value. A clothing retailer analyzing shirt sales would find the mode useful for stocking decisions. Some datasets have multiple modes (bimodal or multimodal), while others have none if all values occur equally.
When data follow a normal distribution, mean, median, and mode coincide. Skewed distributions pull them apart. Right-skewed data (with a long tail of high values) produce means higher than medians[4].
Measures of variability
Variability measures describe how spread out data points are:
Range
The simplest variability measure subtracts the minimum from the maximum. Temperatures ranging from 15 to 28 degrees have a range of 13. This measure depends entirely on two extreme values and ignores all other observations.
Variance
Variance calculates the average squared deviation from the mean. Each data point's distance from the mean is squared, and these squared distances are averaged. Squaring prevents negative and positive deviations from canceling each other. Population variance divides by N; sample variance divides by N-1 to correct for bias[5].
Standard deviation
Pearson's standard deviation takes the square root of variance, returning the measure to original units. If heights are measured in centimeters, standard deviation is also in centimeters (unlike variance, which would be in squared centimeters). Approximately 68% of observations fall within one standard deviation of the mean in a normal distribution. About 95% fall within two standard deviations.
Interquartile range
The interquartile range (IQR) measures the spread of the middle 50% of data. It equals the difference between the 75th percentile (Q3) and 25th percentile (Q1). Outliers affect this measure less than they affect range or standard deviation.
Graphical representations
Visual displays communicate distributional properties quickly:
- Histograms show frequency distributions through contiguous bars
- Box plots display median, quartiles, and potential outliers
- Scatter plots reveal relationships between two variables
- Bar charts compare categorical frequencies
- Pie charts show proportions of a whole
Each visualization serves different purposes. Histograms work well for continuous data. Bar charts suit categorical comparisons. Box plots excel at comparing distributions across groups[6].
Applications in management
Managers rely on descriptive statistics for operational decisions. Quality control uses means and standard deviations to monitor production consistency. Human resources analyzes salary distributions to ensure competitive compensation. Marketing examines customer spending patterns through frequency analysis.
{{{Concept}}} Primary topic {{{list1}}} Related topics {{{list2}}} Methods and techniques {{{list3}}}
Related articles
References
- Pearson K. (1894). Contributions to the Mathematical Theory of Evolution, Philosophical Transactions
- Galton F. (1889). Natural Inheritance, Macmillan
- Triola M.F. (2022). Elementary Statistics, Pearson
- Moore D.S., McCabe G.P., Craig B.A. (2021). Introduction to the Practice of Statistics, W.H. Freeman
- Agresti A., Franklin C. (2018). Statistics: The Art and Science of Learning from Data, Pearson
Footnotes
<references> <ref name="fn1">[1] Pearson founded UCL Department of Applied Statistics in 1911</ref> <ref name="fn2">[2] Historical development through Galton and Pearson</ref> <ref name="fn3">[3] Mean calculation and properties</ref> <ref name="fn4">[4] Relationship between measures in normal and skewed distributions</ref> <ref name="fn5">[5] Variance formula and Bessel's correction for samples</ref> <ref name="fn6">[6] Common graphical methods</ref> <ref name="fn7">[7] Distinction from inferential statistics</ref> </references>