In Statistics, the distribution of the data can be expressed as a function representing the possible values for a variable and how often they occur. The distribution of the categorical variables can be identified using frequency tables and the visuals like bar chart. As categorical variables, It is possible to summarize the characteristics of the numerical variables in two ways, numerically and graphically.
A numerical summary of the distribution should report its center and spread or variability.
The measure of central tendencies are the numbers or words that attempt to describe, most generally the middle or typical value for a distribution. Fundamentally, they are used to describe the center of the data.
The mean is calculated by taking the numerical sum of the possible values of the variable divided by the size of the variable.
Example: The following table shows the amount of financial aid by Ankara Municipality for school transportation to familes in December 2021.
District | Financial Aid (TL) |
---|---|
MAMAK | 119.760 |
KEÇİÖREN | 100.169 |
ALTINDAÄž | 55.685 |
YENÄ°MAHALLE | 55.275 |
SÄ°NCAN | 54.985 |
KAHRAMANKAZAN | 40.100 |
ÇANKAYA | 38.830 |
ETÄ°MESGUT | 29.337 |
GÖLBAŞI | 23.250 |
AKYURT | 21.250 |
PURSAKLAR | 17.925 |
ÇUBUK | 17.060 |
POLATLI | 14.620 |
ŞEREFLİKOÇHİSAR | 11.456 |
ELMADAÄž | 8.280 |
BEYPAZARI | 3.620 |
AYAÅž | 1.775 |
NALLIHAN | 1.500 |
KALECÄ°K | 800 |
KIZILCAHAMAM | 275 |
\(\bar X = \frac{119.760+\dots+275}{20} = 30.797,625\)
The average amount of financial aid for school transportation in Ankara in December 2021 is 30.797,625 Turkish Liras.
It is the middle data point. Half of the data is below the median and half is above the median.
Example: Consider the table given below. The median of financial aid for school transportation in Ankara in December is 19.587 TL. This means that half of the financial aids are below 19.587 TL, and half of them above 19.587 TL.
It is the most commonly occurring value. There may be more than one mode. Seldom used, but sometimes useful.
Example: A study in an office examined the retirement age of the past members. Here are the data for a sample of 11 stuffs.
60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63.
The mode is 63 since it has the most occurring values. (4 times) It is said that the most of the past members in the office were retired when they were 63 years old.
Right Skewed / Positively Skewed | Symmetric / Bell-Shaped | Left Skewed / Negatively Skewed |
---|---|---|
Mean > Median > Mode | Mode=Median=Mean | Mode > Median > Mean |
If the distribution is not skewed, the values of the mode, median, and mean are similar, and any of them can be used to describe the central tendency of the distribution.
When a distribution is skewed, the ideal choice is to report both the mean and the median.
Note that the mean is heavily influenced by the extreme observations in the data. In this case, median would be a reasonable choice to illustrate the center of the data.
Indicate whether the following skewed distributions are positively skewed because the mean exceeds the median or negatively skewed because the median exceeds the mean.
a distribution of test scores on an easy test, with most students scoring high and a few students scoring low
a distribution of ages of college students, with most students in their late teens or early twenties and a few students in their fifties or sixties
a distribution of loose change carried by classmates, with most carrying less than \(\$1\) and with some carrying \(\$3\) or \(\4\) worth of loose change
a distribution of the sizes of crowds in attendance at a popular movie theater, with most audiences at or near capacity
Negatively Skewed because the median exceeds the mean
Positively Skewed because the mean exceeds the median
Positively Skewed
Negatively Skewed
Averages are important, but they tell only part of the story. Statistics flourishes because we live in a world of variability; no two people are identical, and a few are really far out. When summarizing a set of data, we specify not only measures of central tendency, such as the mean, but also measures of variability, that is, measures of the amount by which scores are dispersed or scattered in a distribution.
The difference between the maximum and minimum values. It is not commonly used since the maximum or the minimum of the data may be outliers which may distort actual spread of the data.
Example: Consider the financial aid table again. The range is \(119.760-275 = 119.485\). The difference between maximum and minimum financial aids for school transportation in Ankara in December 2021 is 119.485 TL.
The variance is simply described as the mean of all squared deviation scores. Although it has great theoretical properties, it is seldomly used as a measure of variability.
Why we do not prefer variance for interpreting the measure of variability in the data?
Because it is in square unit. Check the formula in lecture note.
It is the square root of the variance and measures how spread out the data are from the mean. It is never negative and typically not zero. Larger values mean the data is highly variable. Smaller values mean the data is consistent and not as variable.
Example: The standard deviation of the retirement age in the office is 6.5 while the mean is 61. We can say that 68% of the retirement age of the personals in the office is between (61-6.5=54.5) and (61+6.5 = 67.5). (mean \(\pm\) one standard deviation according to the rule of one sigma = empirical rule)
Employees of Corporation A earn annual salaries described by a mean of \(\$90,000\) and a standard deviation of \(\$10,000.\)
The majority of all salaries fall between what two values?
A small minority of all salaries are less than what value?
A small minority of all salaries are more than what value?
Assume that the distribution of IQ scores for all college students has a mean of 120, with a standard deviation of 15. These two bits of information imply which of the following?
The percent of data that is equal to or less than a given data point. Useful for describing the relative position of a data point within a data set. If the percentile is close to 100, then the observation is one of the largest. If it is close to zero, then the observation is one of the smallest.
Example: The 30th percentile of the financial aid for school transportation in Ankara in December 2021 is 10503.2. This means 30% of the financial aid is 10503.2 TL or below.
Quartiles are the specific percentile values. There are four quartiles. The first quartile (Q1 or lower quartile) is the 25th percentile of the data, the second quartile is median, which is th 50th percentile of the data, the third quartile (Q3 or upper quartile) is 75th percentile of the data and the fourth quartile is the maximum of the data, which is 100th percentile.
Quartiles are used to describe the spread of the middle half of the data by generating Interquartile Range which is the difference between Q3 and Q1.
When you use the median to indicate the center of the distribution, describe its spread by giving IQR.
Many different graphics are available for numerical data to illustrate the data distribution, and they all have the same aim: to display the important features of the data. Make sure that you should have lots of practice in interpreting for any kind of plots you use.
Graphics are the appropriate tools for displaying the features that make up the shapes of the data distributions. They can provide more and different kinds of information a set of summary statistics.
Obviously it is best to use both approaches
Asymmetry: The distribution is skewed to the left or right.
Outliers: There are one or more values that are far from the rest of the data.
Multimodality: The distribution has ore than one peak.
Gaps: There are ranges of values within the data where no cases are recorded.
Heaping: Some values occur unexpectedly often.
These characters can be identified using
Box Plots
Histograms
Dot Plots
Line Plot
Rug Plots
Density or Dist. Estimation
QQ-Plot
Minimum
Q1
Median (Q2)
Q3
Maximum
A convenient way to describe both the center and spread of a data set is to give the median to measure center and the quartiles and the smallest and largest individual observartions to show the spread. This five number summary of a distribution leads to a new graph, the box plot.
https://www.simplypsychology.org/boxplot-outliers.png
Box plots are the best tools for identifying outliers and for comparing distributions across the groups. They are also used for describing the shape of the variable.
Example: The following data includes the information of 28819 movies and user ratings from IMDB.com.
The following figure shows the box plot and the five numbers summary of the average rating data.
The plot shows that the average rating of the movies have approximately left skewed distirbution. The median rating is between 5 and 7. The data has outliers which are below 2.5 points.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.100 5.933 7.000 10.000
The average of the average ratings of the movies is 5.93.
What about others?
Interpretation: The documentary has the highest median rating while the action has the lowest one. Drama and Romances have same median rating. All movie types have outlier ratings, except romance. Also, we can say that the subgroups may have approximately negatively skewed distribution.
It is a visual tool used for displaying the distribution of a single quantitative variable, especially when the sample size is large. It looks like bar chart, but there are some important differences.
In histogram, number of bins is the key component. Luckily, natural binwidths based on context are usually good choice for histogram. They are good at emphasizing features of the raw data.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation: The ratings are spread between 0 and 10, and most of the ratings are between 5 and 7.5. Although it have a left skewed distribution roughly, we can say that we have multimodal distribution since we have more than one peak in the histogram. Outliers cannot be detected unless the box plot of the variable.
Consider movie data again, the following figure displays the histogram of the length of the movies. What is the wrong with the output? Which tools you can suggests to solve it?
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram is hopeless, except that it imples there must be at least one very high value to the extreme right, even if the resolution of the plot is not good enough to see it.
The boxplot does what boxplots do best, telling us something more about the outliers.
Clearly, we have extreme outliers, and they should be ignored. After that you can talk about the distribution of length variable.
They are generally used for visualizing how one continuous variable, on the y-axis, changes in relation to another coninous variable, on the x-axis. The variable on the x-axis generally denotes the time.
Interpretation: It is seen that the between 1890 and 1920, the average ratings of the movies show a shart increasing trend over time, but they started to represent a decreasing trend until at the beginning of 2000’s. Then, the average rating of the movies showed an increasing pattern.
The following chart shows the chocolate price in Turkey between 2008 and 2022, April. How do you interpret this?
## Warning: Removed 36 row(s) containing missing values (geom_path).
## Warning: Removed 36 rows containing missing values (geom_point).
It is seen that the chocolate price in Turkey increases over time, especially a sharp increase is observed after 2020.
In the previous part, we have mainly discussed about the plots used for displaying the single numerical variable. However, there are usual visual tools used for plotting more than one numerical variables.
The scatter plot is used to display the relationship between two numerical variables. It is good for two cont. variables but can be applicable for one discrete and one cont. variable.
In scatter plot, y-axis denotes the dependent or target variable, while x-axis denotes independent or explanatory variable. For example,
independent: the monthly income, dependent: the chance of buying iPhone 14.
independent: the number of hours for studying, dependent: the midterm grades of STAT 112.
independent: the amount of alcohol, dependent: the body temperature
Drawing scatter plot is one of the first things statisticans do when looking at datasets. Scatter plot can revela structure that is not readily apparent from summary statstics. They can reveal alot of information.
The major role of scatter plots lies in revelaing association between variables, but not only linear association, any kind of association. It is also useful for detecting outliers and spotting distributional features.
Casual Relationship (Linear or Nonlinear)
Associations
Outliers or groups of outliers
Clusters
Gaps
Example: Consider that data sets including information of over ten thousands athletes in London Summer Olympics of 2012.
Name | Country | Age | Height | Weight | Sex | DOB | PlaceOB | Gold | Silver | Bronze | Total | Sport | Event |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lamusi A | People’s Republic of China | 23 | 1.70 | 60 | M | 1989-02-06 | NEIMONGGOL (CHN) | 0 | 0 | 0 | 0 | Judo | Men’s -60kg |
A G Kruger | United States of America | 33 | 1.93 | 125 | M | NA | Sheldon (USA) | 0 | 0 | 0 | 0 | Athletics | Men’s Hammer Throw |
Jamale Aarrass | France | 30 | 1.87 | 76 | M | NA | BEZONS (FRA) | 0 | 0 | 0 | 0 | Athletics | Men’s 1500m |
Abdelhak Aatakni | Morocco | 24 | NA | NA | M | 1988-09-02 | AIN SEBAA (MAR) | 0 | 0 | 0 | 0 | Boxing | Men’s Light Welter (64kg) |
Maria Abakumova | Russian Federation | 26 | 1.78 | 85 | F | NA | STAVROPOL REGION (RUS) | 0 | 0 | 0 | 0 | Athletics | Women’s Javelin Throw |
Luc Abalo | France | 27 | 1.82 | 80 | M | 1984-06-09 | 0 | 0 | 0 | 0 | Handball | Men’s Handball |
Let’s look at the relationship between height and weight of athletes at the London Olympics.
## Warning: Removed 1346 rows containing missing values (geom_point).
Interpretation: The expected relationship between weight and height can be clearly seen, which is an approximately positive linear association, although it is a little obsecured by some outlies, which distort the scales. There is evidence of discretiastion in the height measurements and the same effect would be visible for the weight measurements but for the outliers. Given the large number of points there is also a lot of overpoltting, and most of the points in the middle of the plot represent more than one case.
The following data set shows the father’s height and his son’s height.
fheight | sheight |
---|---|
65.04851 | 59.77827 |
63.25094 | 63.21404 |
64.95532 | 63.34242 |
65.75250 | 62.79238 |
61.13723 | 64.28113 |
63.02254 | 64.24221 |
What are the independent and dependent variables?
The plot given below displays the relationship between fheight and sheight. Interpret the output.
Independent: Father’s Height, Dependent: Son’s Height
Expectedly, we can observe a positive relationship between father’s and son’s height. Tall fathers have tall sons, short father have short suns. However, some of them are not short or tall as their fathers, or they are shorter or taller than their fathers.
Consider IMDB data set. The following plot shows the association between votes and ratings.
Which one of the followings cannot be gained from this plot?
There no films with lots of votes and a low votes and low average rating.
For films with more than a small number of votes, the average rating increases with the number of votes.
There are many films with lots of votes has an average rating close to the maximum possible.
The only films with very high average ratings are films with relatively few votes.
The answer is c. No films with lots of votes has an average rating close to the maximum possible.
Correlation: If you calculate a correlation coefficient, you should always draw a scatter plot to learn what the coefficient might mean. It measures linear association and it is a rare scatter plot where that is all you can see.
The correlation gives you the strength and the direction of the linear relationship on the plot you have.
Do not forget that! The zero correlation may not imply that the variables are unrelated. Why?
The dataset is primarily centered around the housing market of Ankara in 2021. It was scrapped by Ozancan Ozdemir from a real estate page.
## id fiyat oda_salon_sayisi net_m2 bina_yasi isinma_tipi krediye_uygunluk
## 1 3566 2099000 5 178 12 Kombi Uygun
## 2 2620 535000 4 170 24 Kombi Uygun
## 3 4446 345500 4 135 0 Kombi Uygun
## 4 2447 455000 4 111 5 Kombi Uygun
## 5 3163 790000 4 140 12 Kombi Uygun
## 6 1644 580000 4 115 35 Merkezi Uygun
## bulundugu_kat banyo_sayisi ilce nüfus eğitim okuma_yazma_bilmeyen
## 1 2. Kat 2 Keçiören 938565 Lise 1.43
## 2 2. Kat 1 Keçiören 938565 Lise 1.43
## 3 1. Kat 2 Sincan 549108 Lise 1.44
## 4 2. Kat 1 Çankaya 925828 Lisans 0.70
## 5 8. Kat 2 Keçiören 938565 Lise 1.43
## 6 2. Kat 1 Keçiören 938565 Lise 1.43
ID | price | livroom | area | age | heating | credibility | floor | bathroom | district | population | education | illiterate_population |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3566 | 2099000 | 5 | 178 | 12 | Kombi | Uygun | 2. Kat | 2 | Keçiören | 938565 | Lise | 1.43 |
2620 | 535000 | 4 | 170 | 24 | Kombi | Uygun | 2. Kat | 1 | Keçiören | 938565 | Lise | 1.43 |
4446 | 345500 | 4 | 135 | 0 | Kombi | Uygun | 1. Kat | 2 | Sincan | 549108 | Lise | 1.44 |
2447 | 455000 | 4 | 111 | 5 | Kombi | Uygun | 2. Kat | 1 | Çankaya | 925828 | Lisans | 0.70 |
3163 | 790000 | 4 | 140 | 12 | Kombi | Uygun | 8. Kat | 2 | Keçiören | 938565 | Lise | 1.43 |
1644 | 580000 | 4 | 115 | 35 | Merkezi | Uygun | 2. Kat | 1 | Keçiören | 938565 | Lise | 1.43 |
Interpret the descriptive statistics for house price in Ankara. Can you guess the shape of the distribution?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 850 335000 460000 609462 695000 8850000
The average house price in Ankara is 609462 TL. The half of the price is below/above 460000 TL. The minimum and the maximum prices are 850 TL and 8850000 TL, respectively. The range is quitely high, i.e high variation, which indicates the existence of the outliers.
Since median is close to first quartile and mean is greater than median, the price may have right skewed distribution.
The 40th percentile of price is 409.000 TL. This means that the 40% of the house price in Ankara is below 409.000 TL.
The given statement is true or false.
True.
The following table shows the average housing price for 4 districts in Ankara with their standard deviations. Assume that you are the investor who would like to invest two houses, one is expensive and another is cheap. Which districts do you prefer? Why?
district | avg | vari |
---|---|---|
Çankaya | 668481.9 | 567281.1 |
Keçiören | 665860.6 | 558948.3 |
Mamak | 289654.8 | 116876.3 |
Sincan | 298055.9 | 124895.6 |
The investor may prefer Keçiören and Mamak since they have less variation.
The following table shows the average housing price by number of bathrooms. Please interpret the output briefly, then explain whether the number of bathroom is an effective factor on the house price or not.
bathroom | avg_price |
---|---|
1 | 431551.6 |
2 | 819756.5 |
3 | 1250869.0 |
4 | 1856255.3 |
5 | 1899385.3 |
6 | 8850000.0 |
It is seen that the average house price increases as number of bathroom increases. Thus, it would be reasonable to say that the number of bathroom can be an effective factor on the house price in Ankara.
For what purpose this box plot could be drawn? Can we interpret this output? If you have a problem, please suggest a solution.
The box plot shows that there are extreme outliers in the data, especially around 750000 TL. In this case, we can ignore these outliers and set some kind of upper limit to explore the distribution of the variables. Over 90% are less than 1100000 TL, so we will restrict ourselves to them.
Please interpret the following output.
It is seen that the area of the house has right skewed distribution. The area of the most of the houses in Ankara is between 105 m2 and 140 m2. Also, there are some outliers around 400 m2 and 600 m2.
What is the association between area and price for houses in Ankara?
The plot shows that the there is a positive association between Area and Price. This means that higher area indicates higher price. However, there are some rare cases occuring for 400 m2 and 600 m2 houses. Moreover, the plot shows gap for price when the area is between 400 and 600, and overplotting up to 200 m2.
The correlation between age and price is \(-0.589.\) The value verifies what you see in the previously drawn plot or not?
Yes, the plot above indicates an obvious positive association between two variables, and this correlation value indicates that.
The analyst would like to examine the distribution of the age of the buildings. Is the given plot is appropriate or not?
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Yes, histograms can be used for displaying the distribution of the numerical variables.