Descriptive Analysis of Numerical Data

In Statistics, the distribution of the data can be expressed as a function representing the possible values for a variable and how often they occur. The distribution of the categorical variables can be identified using frequency tables and the visuals like bar chart. As categorical variables, It is possible to summarize the characteristics of the numerical variables in two ways, numerically and graphically.

A numerical summary of the distribution should report its center and spread or variability.

Numerical Summary of Numerical Data

Measure of Central Tendency

The measure of central tendencies are the numbers or words that attempt to describe, most generally the middle or typical value for a distribution. Fundamentally, they are used to describe the center of the data.

Mean

The mean is calculated by taking the numerical sum of the possible values of the variable divided by the size of the variable.

Example: The following table shows the amount of financial aid by Ankara Municipality for school transportation to familes in December 2021.

District Financial Aid (TL)
MAMAK 119.760
KEÇİÖREN 100.169
ALTINDAÄž 55.685
YENÄ°MAHALLE 55.275
SÄ°NCAN 54.985
KAHRAMANKAZAN 40.100
ÇANKAYA 38.830
ETÄ°MESGUT 29.337
GÖLBAŞI 23.250
AKYURT 21.250
PURSAKLAR 17.925
ÇUBUK 17.060
POLATLI 14.620
ŞEREFLİKOÇHİSAR 11.456
ELMADAÄž 8.280
BEYPAZARI 3.620
AYAÅž 1.775
NALLIHAN 1.500
KALECÄ°K 800
KIZILCAHAMAM 275

\(\bar X = \frac{119.760+\dots+275}{20} = 30.797,625\)

The average amount of financial aid for school transportation in Ankara in December 2021 is 30.797,625 Turkish Liras.

Median

It is the middle data point. Half of the data is below the median and half is above the median.

Example: Consider the table given below. The median of financial aid for school transportation in Ankara in December is 19.587 TL. This means that half of the financial aids are below 19.587 TL, and half of them above 19.587 TL.

Mode

It is the most commonly occurring value. There may be more than one mode. Seldom used, but sometimes useful.

Example: A study in an office examined the retirement age of the past members. Here are the data for a sample of 11 stuffs.

60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63.

The mode is 63 since it has the most occurring values. (4 times) It is said that the most of the past members in the office were retired when they were 63 years old.

The relationship between central tendency and distribution

Right Skewed / Positively Skewed Symmetric / Bell-Shaped Left Skewed / Negatively Skewed
Mean > Median > Mode Mode=Median=Mean Mode > Median > Mean
  • If the distribution is not skewed, the values of the mode, median, and mean are similar, and any of them can be used to describe the central tendency of the distribution.

  • When a distribution is skewed, the ideal choice is to report both the mean and the median.

Note that the mean is heavily influenced by the extreme observations in the data. In this case, median would be a reasonable choice to illustrate the center of the data.

Exercise 1

Indicate whether the following skewed distributions are positively skewed because the mean exceeds the median or negatively skewed because the median exceeds the mean.

  1. a distribution of test scores on an easy test, with most students scoring high and a few students scoring low

  2. a distribution of ages of college students, with most students in their late teens or early twenties and a few students in their fifties or sixties

  3. a distribution of loose change carried by classmates, with most carrying less than \(\$1\) and with some carrying \(\$3\) or \(\4\) worth of loose change

  4. a distribution of the sizes of crowds in attendance at a popular movie theater, with most audiences at or near capacity

Answers

  1. Negatively Skewed because the median exceeds the mean

  2. Positively Skewed because the mean exceeds the median

  3. Positively Skewed

  4. Negatively Skewed

Measure of Variability

Averages are important, but they tell only part of the story. Statistics flourishes because we live in a world of variability; no two people are identical, and a few are really far out. When summarizing a set of data, we specify not only measures of central tendency, such as the mean, but also measures of variability, that is, measures of the amount by which scores are dispersed or scattered in a distribution.

Range

The difference between the maximum and minimum values. It is not commonly used since the maximum or the minimum of the data may be outliers which may distort actual spread of the data.

Example: Consider the financial aid table again. The range is \(119.760-275 = 119.485\). The difference between maximum and minimum financial aids for school transportation in Ankara in December 2021 is 119.485 TL.

Variance

The variance is simply described as the mean of all squared deviation scores. Although it has great theoretical properties, it is seldomly used as a measure of variability.

Exercise 2

Why we do not prefer variance for interpreting the measure of variability in the data?

Answers

Because it is in square unit. Check the formula in lecture note.

Standard Deviation

It is the square root of the variance and measures how spread out the data are from the mean. It is never negative and typically not zero. Larger values mean the data is highly variable. Smaller values mean the data is consistent and not as variable.

Example: The standard deviation of the retirement age in the office is 6.5 while the mean is 61. We can say that 68% of the retirement age of the personals in the office is between (61-6.5=54.5) and (61+6.5 = 67.5). (mean \(\pm\) one standard deviation according to the rule of one sigma = empirical rule)

Exercise 3

Employees of Corporation A earn annual salaries described by a mean of \(\$90,000\) and a standard deviation of \(\$10,000.\)

  1. The majority of all salaries fall between what two values?

  2. A small minority of all salaries are less than what value?

  3. A small minority of all salaries are more than what value?

Answers

  1. \(\$80,000\) to \(\$100,000\)
  2. \(\$70,000\)
  3. \(\$110,000\)

Exercise 4

Assume that the distribution of IQ scores for all college students has a mean of 120, with a standard deviation of 15. These two bits of information imply which of the following?

  1. All students have an IQ of either 105 or 135 because everybody in the distribution is either one standard deviation above or below the mean. True or false?
  2. All students score between 105 and 135 because everybody is within one standard deviation on either side of the mean. True or false?
  3. On the average, students deviate approximately 15 points on either side of the mean. True or false?
  4. Some students deviate more than one standard deviation above or below the mean. True or false?
  5. All students deviate more than one standard deviation above or below the mean. True or false?
  6. Scott’s IQ score of 150 deviates two standard deviations above the mean. True or false?

Answers

  1. False. Relatively few students will score exactly one standard deviation from the mean.
  2. False. Students will score both within and beyond one standard deviation from the mean.
  3. True
  4. True
  5. False. See (b).
  6. True

Percentile

The percent of data that is equal to or less than a given data point. Useful for describing the relative position of a data point within a data set. If the percentile is close to 100, then the observation is one of the largest. If it is close to zero, then the observation is one of the smallest.

Example: The 30th percentile of the financial aid for school transportation in Ankara in December 2021 is 10503.2. This means 30% of the financial aid is 10503.2 TL or below.

Quartile and Interquartile Range

Quartiles are the specific percentile values. There are four quartiles. The first quartile (Q1 or lower quartile) is the 25th percentile of the data, the second quartile is median, which is th 50th percentile of the data, the third quartile (Q3 or upper quartile) is 75th percentile of the data and the fourth quartile is the maximum of the data, which is 100th percentile.

Quartiles are used to describe the spread of the middle half of the data by generating Interquartile Range which is the difference between Q3 and Q1.

  • When you use the mean to indicate the center of the distribution, describe its spread by giving standard deviation.

When you use the median to indicate the center of the distribution, describe its spread by giving IQR.

Graphical Summary of Numerical Data

Many different graphics are available for numerical data to illustrate the data distribution, and they all have the same aim: to display the important features of the data. Make sure that you should have lots of practice in interpreting for any kind of plots you use.

Graphics are the appropriate tools for displaying the features that make up the shapes of the data distributions. They can provide more and different kinds of information a set of summary statistics.

Obviously it is best to use both approaches

What features might continious variables have?

  • Asymmetry: The distribution is skewed to the left or right.

  • Outliers: There are one or more values that are far from the rest of the data.

  • Multimodality: The distribution has ore than one peak.

  • Gaps: There are ranges of values within the data where no cases are recorded.

  • Heaping: Some values occur unexpectedly often.

These characters can be identified using

  • Box Plots

  • Histograms

  • Dot Plots

  • Line Plot

  • Rug Plots

  • Density or Dist. Estimation

  • QQ-Plot

Box Plot (1 numeric, 1 numeric + 1 cat. (+2 levels))

Tukey’s Five Number Summary

  • Minimum

  • Q1

  • Median (Q2)

  • Q3

  • Maximum

A convenient way to describe both the center and spread of a data set is to give the median to measure center and the quartiles and the smallest and largest individual observartions to show the spread. This five number summary of a distribution leads to a new graph, the box plot.

https://www.simplypsychology.org/boxplot-outliers.png

Box plots are the best tools for identifying outliers and for comparing distributions across the groups. They are also used for describing the shape of the variable.

Example: The following data includes the information of 28819 movies and user ratings from IMDB.com.

The following figure shows the box plot and the five numbers summary of the average rating data.

The plot shows that the average rating of the movies have approximately left skewed distirbution. The median rating is between 5 and 7. The data has outliers which are below 2.5 points.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.100   5.933   7.000  10.000

The average of the average ratings of the movies is 5.93.

What about others?

Interpretation: The documentary has the highest median rating while the action has the lowest one. Drama and Romances have same median rating. All movie types have outlier ratings, except romance. Also, we can say that the subgroups may have approximately negatively skewed distribution.

Histogram (1 numeric var.)

It is a visual tool used for displaying the distribution of a single quantitative variable, especially when the sample size is large. It looks like bar chart, but there are some important differences.

In histogram, number of bins is the key component. Luckily, natural binwidths based on context are usually good choice for histogram. They are good at emphasizing features of the raw data.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: The ratings are spread between 0 and 10, and most of the ratings are between 5 and 7.5. Although it have a left skewed distribution roughly, we can say that we have multimodal distribution since we have more than one peak in the histogram. Outliers cannot be detected unless the box plot of the variable.

Exercise 5

Consider movie data again, the following figure displays the histogram of the length of the movies. What is the wrong with the output? Which tools you can suggests to solve it?

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answers

The histogram is hopeless, except that it imples there must be at least one very high value to the extreme right, even if the resolution of the plot is not good enough to see it.

The boxplot does what boxplots do best, telling us something more about the outliers.

Clearly, we have extreme outliers, and they should be ignored. After that you can talk about the distribution of length variable.

Line Plot

They are generally used for visualizing how one continuous variable, on the y-axis, changes in relation to another coninous variable, on the x-axis. The variable on the x-axis generally denotes the time.

Interpretation: It is seen that the between 1890 and 1920, the average ratings of the movies show a shart increasing trend over time, but they started to represent a decreasing trend until at the beginning of 2000’s. Then, the average rating of the movies showed an increasing pattern.

Exercise 6

The following chart shows the chocolate price in Turkey between 2008 and 2022, April. How do you interpret this?

## Warning: Removed 36 row(s) containing missing values (geom_path).
## Warning: Removed 36 rows containing missing values (geom_point).

Answers

It is seen that the chocolate price in Turkey increases over time, especially a sharp increase is observed after 2020.

Displaying Multiple Numerical Variables

In the previous part, we have mainly discussed about the plots used for displaying the single numerical variable. However, there are usual visual tools used for plotting more than one numerical variables.

Scatter Plot

The scatter plot is used to display the relationship between two numerical variables. It is good for two cont. variables but can be applicable for one discrete and one cont. variable.

In scatter plot, y-axis denotes the dependent or target variable, while x-axis denotes independent or explanatory variable. For example,

  • independent: the monthly income, dependent: the chance of buying iPhone 14.

  • independent: the number of hours for studying, dependent: the midterm grades of STAT 112.

  • independent: the amount of alcohol, dependent: the body temperature

Drawing scatter plot is one of the first things statisticans do when looking at datasets. Scatter plot can revela structure that is not readily apparent from summary statstics. They can reveal alot of information.

The major role of scatter plots lies in revelaing association between variables, but not only linear association, any kind of association. It is also useful for detecting outliers and spotting distributional features.

What features might be visible in scatter plots.

  • Casual Relationship (Linear or Nonlinear)

  • Associations

  • Outliers or groups of outliers

  • Clusters

  • Gaps

Example: Consider that data sets including information of over ten thousands athletes in London Summer Olympics of 2012.

Name Country Age Height Weight Sex DOB PlaceOB Gold Silver Bronze Total Sport Event
Lamusi A People’s Republic of China 23 1.70 60 M 1989-02-06 NEIMONGGOL (CHN) 0 0 0 0 Judo Men’s -60kg
A G Kruger United States of America 33 1.93 125 M NA Sheldon (USA) 0 0 0 0 Athletics Men’s Hammer Throw
Jamale Aarrass France 30 1.87 76 M NA BEZONS (FRA) 0 0 0 0 Athletics Men’s 1500m
Abdelhak Aatakni Morocco 24 NA NA M 1988-09-02 AIN SEBAA (MAR) 0 0 0 0 Boxing Men’s Light Welter (64kg)
Maria Abakumova Russian Federation 26 1.78 85 F NA STAVROPOL REGION (RUS) 0 0 0 0 Athletics Women’s Javelin Throw
Luc Abalo France 27 1.82 80 M 1984-06-09 0 0 0 0 Handball Men’s Handball

Let’s look at the relationship between height and weight of athletes at the London Olympics.

## Warning: Removed 1346 rows containing missing values (geom_point).

Interpretation: The expected relationship between weight and height can be clearly seen, which is an approximately positive linear association, although it is a little obsecured by some outlies, which distort the scales. There is evidence of discretiastion in the height measurements and the same effect would be visible for the weight measurements but for the outliers. Given the large number of points there is also a lot of overpoltting, and most of the points in the middle of the plot represent more than one case.

Exercise 7

The following data set shows the father’s height and his son’s height.

fheight sheight
65.04851 59.77827
63.25094 63.21404
64.95532 63.34242
65.75250 62.79238
61.13723 64.28113
63.02254 64.24221
  1. What are the independent and dependent variables?

  2. The plot given below displays the relationship between fheight and sheight. Interpret the output.

Answers

  1. Independent: Father’s Height, Dependent: Son’s Height

  2. Expectedly, we can observe a positive relationship between father’s and son’s height. Tall fathers have tall sons, short father have short suns. However, some of them are not short or tall as their fathers, or they are shorter or taller than their fathers.

Exercise 8

Consider IMDB data set. The following plot shows the association between votes and ratings.

Which one of the followings cannot be gained from this plot?

  1. There no films with lots of votes and a low votes and low average rating.

  2. For films with more than a small number of votes, the average rating increases with the number of votes.

  3. There are many films with lots of votes has an average rating close to the maximum possible.

  4. The only films with very high average ratings are films with relatively few votes.

Answers

The answer is c. No films with lots of votes has an average rating close to the maximum possible.

Correlation: If you calculate a correlation coefficient, you should always draw a scatter plot to learn what the coefficient might mean. It measures linear association and it is a rare scatter plot where that is all you can see.

The correlation gives you the strength and the direction of the linear relationship on the plot you have.

Do not forget that! The zero correlation may not imply that the variables are unrelated. Why?

Exercise 9: Case Study

The dataset is primarily centered around the housing market of Ankara in 2021. It was scrapped by Ozancan Ozdemir from a real estate page.

##     id   fiyat oda_salon_sayisi net_m2 bina_yasi isinma_tipi krediye_uygunluk
## 1 3566 2099000                5    178        12       Kombi            Uygun
## 2 2620  535000                4    170        24       Kombi            Uygun
## 3 4446  345500                4    135         0       Kombi            Uygun
## 4 2447  455000                4    111         5       Kombi            Uygun
## 5 3163  790000                4    140        12       Kombi            Uygun
## 6 1644  580000                4    115        35     Merkezi            Uygun
##   bulundugu_kat banyo_sayisi     ilce  nüfus eğitim okuma_yazma_bilmeyen
## 1        2. Kat            2 Keçiören 938565   Lise                 1.43
## 2        2. Kat            1 Keçiören 938565   Lise                 1.43
## 3        1. Kat            2   Sincan 549108   Lise                 1.44
## 4        2. Kat            1  Çankaya 925828 Lisans                 0.70
## 5        8. Kat            2 Keçiören 938565   Lise                 1.43
## 6        2. Kat            1 Keçiören 938565   Lise                 1.43
ID price livroom area age heating credibility floor bathroom district population education illiterate_population
3566 2099000 5 178 12 Kombi Uygun 2. Kat 2 Keçiören 938565 Lise 1.43
2620 535000 4 170 24 Kombi Uygun 2. Kat 1 Keçiören 938565 Lise 1.43
4446 345500 4 135 0 Kombi Uygun 1. Kat 2 Sincan 549108 Lise 1.44
2447 455000 4 111 5 Kombi Uygun 2. Kat 1 Çankaya 925828 Lisans 0.70
3163 790000 4 140 12 Kombi Uygun 8. Kat 2 Keçiören 938565 Lise 1.43
1644 580000 4 115 35 Merkezi Uygun 2. Kat 1 Keçiören 938565 Lise 1.43

9.1

Interpret the descriptive statistics for house price in Ankara. Can you guess the shape of the distribution?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     850  335000  460000  609462  695000 8850000

Answers

The average house price in Ankara is 609462 TL. The half of the price is below/above 460000 TL. The minimum and the maximum prices are 850 TL and 8850000 TL, respectively. The range is quitely high, i.e high variation, which indicates the existence of the outliers.

Since median is close to first quartile and mean is greater than median, the price may have right skewed distribution.

9.2

The 40th percentile of price is 409.000 TL. This means that the 40% of the house price in Ankara is below 409.000 TL.

The given statement is true or false.

Answers

True.

9.3

The following table shows the average housing price for 4 districts in Ankara with their standard deviations. Assume that you are the investor who would like to invest two houses, one is expensive and another is cheap. Which districts do you prefer? Why?

district avg vari
Çankaya 668481.9 567281.1
Keçiören 665860.6 558948.3
Mamak 289654.8 116876.3
Sincan 298055.9 124895.6

Answers

The investor may prefer Keçiören and Mamak since they have less variation.

9.4

The following table shows the average housing price by number of bathrooms. Please interpret the output briefly, then explain whether the number of bathroom is an effective factor on the house price or not.

bathroom avg_price
1 431551.6
2 819756.5
3 1250869.0
4 1856255.3
5 1899385.3
6 8850000.0

Answers

It is seen that the average house price increases as number of bathroom increases. Thus, it would be reasonable to say that the number of bathroom can be an effective factor on the house price in Ankara.

9.5

For what purpose this box plot could be drawn? Can we interpret this output? If you have a problem, please suggest a solution.

Answers

The box plot shows that there are extreme outliers in the data, especially around 750000 TL. In this case, we can ignore these outliers and set some kind of upper limit to explore the distribution of the variables. Over 90% are less than 1100000 TL, so we will restrict ourselves to them.

9.6

Please interpret the following output.

Answers

It is seen that the area of the house has right skewed distribution. The area of the most of the houses in Ankara is between 105 m2 and 140 m2. Also, there are some outliers around 400 m2 and 600 m2.

9.7

What is the association between area and price for houses in Ankara?

Answers

The plot shows that the there is a positive association between Area and Price. This means that higher area indicates higher price. However, there are some rare cases occuring for 400 m2 and 600 m2 houses. Moreover, the plot shows gap for price when the area is between 400 and 600, and overplotting up to 200 m2.

9.8

The correlation between age and price is \(-0.589.\) The value verifies what you see in the previously drawn plot or not?

Answers

Yes, the plot above indicates an obvious positive association between two variables, and this correlation value indicates that.

9.9

The analyst would like to examine the distribution of the age of the buildings. Is the given plot is appropriate or not?

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answers

Yes, histograms can be used for displaying the distribution of the numerical variables.