17 Generating Descriptive statistics

Exploratory Data Analysis or often abbreviated as EDA, is mostly the first and foremost step before carrying out any data analytics task, is used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1960s, EDA techniques continue to be a widely used method in the data discovery process today.

17.1 Using base R

Base R provides us with two functions used ato ascertain structure and summary statistics of a data frame. First is str short for structure (and not to be confused with string) which as its full name suggests gives us structure of the data. Its usage is simple

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

As can be seen it gives us number of variables (columns) as well as observations (rows) available in the given data. It thereafter presents us names of all the columns/variables in the data along with their types. That’s not all. It also prints few first values in all of the columns. For factor columns it also gives us available levels in those factor variables.

Another function from base R is summary which can be used to generate some summary statistics from the given data frame. Let’s see what we can get from this function.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

We can see that it nicely gives us five-point summary for all numeric variables and count of all values present in factor variables. Apart from the five point summary i.e. (1) minimum, (2) 1st quartile, (3) Median, (4) third quartile and (5) maximum; we also get mean (arithmetic) of all numeric variables.

Before moving forward, we can discuss again table() function here which is used to genrate counts of factor/character variable(s) in base R.

with(iris, table(Species))

## Species
##     setosa versicolor  virginica 
##         50         50         50

17.2 Dplyr functions

For calculating other statistics we can use dplyr::summarise in combination with across. For Example to calculate mean, sd, variance for all numeric variables of say iris data, we can do-

library(dplyr)
iris %>%
  summarise(across(where(is.numeric),
                   .fns = list(
                     Mean = ~ mean(.),
                     SD = ~ sd(.),
                     Var = ~ var(.)
                   )))

##   Sepal.Length_Mean Sepal.Length_SD Sepal.Length_Var Sepal.Width_Mean
## 1          5.843333       0.8280661        0.6856935         3.057333
##   Sepal.Width_SD Sepal.Width_Var Petal.Length_Mean Petal.Length_SD
## 1      0.4358663       0.1899794             3.758        1.765298
##   Petal.Length_Var Petal.Width_Mean Petal.Width_SD Petal.Width_Var
## 1         3.116278         1.199333      0.7622377       0.5810063

Before trying to understand the output let’s learn to use dplyr::across. Actually across is used inside dplyr verbs mostly with mutate or summarise through which we can mutate/summarise multiple variables (columns) simultaneously. So, at least two arguments are needed; first variable names which can be provided through a type checking variable, str detecting function, etc.; and second argument either a function name or a list of functions together. So in above example we have summarised all numeric columns (see first argument is a function is.numeric which only operates on column names) and second argument is a list of three functions in lambda style notation. In our example we are having 4 numeric columns and three aggregating functions, so 12 columns we are getting in output.

We can further reshape/transform the data using tidyr::pivot_longer. See

library(tidyr)
iris %>%
  summarise(across(where(is.numeric),
                   .fns = list(
                     Mean = ~ mean(.),
                     SD = ~ sd(.),
                     Var = ~ var(.)
                   ))) %>%
  pivot_longer(everything(),
               names_sep = "_",
               names_to = c(".value", "Function"))

## # A tibble: 3 × 5
##   Function Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <chr>           <dbl>       <dbl>        <dbl>       <dbl>
## 1 Mean            5.84        3.06          3.76       1.20 
## 2 SD              0.828       0.436         1.77       0.762
## 3 Var             0.686       0.190         3.12       0.581

iris %>%
  summarise(across(where(is.numeric),
                   .fns = list(
                     Mean = ~ mean(.),
                     SD = ~ sd(.),
                     Var = ~ var(.)
                   ))) %>%
  pivot_longer(everything(),
               names_sep = "_",
               names_to = c("Variable", ".value"))

## # A tibble: 4 × 4
##   Variable      Mean    SD   Var
##   <chr>        <dbl> <dbl> <dbl>
## 1 Sepal.Length  5.84 0.828 0.686
## 2 Sepal.Width   3.06 0.436 0.190
## 3 Petal.Length  3.76 1.77  3.12 
## 4 Petal.Width   1.20 0.762 0.581

Let us also discuss one more data summary statistics function of dplyr that is glimpse. It is basically a pipe friendly version of str(). See

iris %>% 
  glimpse()

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

To calculate counts of factor variable (as generated by table in base R), we can use dplyr::count a pipe friendly function.

iris %>% 
  count(Species)

##      Species  n
## 1     setosa 50
## 2 versicolor 50
## 3  virginica 50

We can generate counts of multiple combinations of variables

ggplot2::diamonds %>% 
  count(cut, color, name = "count")

## # A tibble: 35 × 3
##    cut   color count
##    <ord> <ord> <int>
##  1 Fair  D       163
##  2 Fair  E       224
##  3 Fair  F       312
##  4 Fair  G       314
##  5 Fair  H       303
##  6 Fair  I       175
##  7 Fair  J       119
##  8 Good  D       662
##  9 Good  E       933
## 10 Good  F       909
## # ℹ 25 more rows

17.3 Using `psych`

There are indeed some beautiful packages in R, which creates beautiful EDA summaries for us without much ado. Package psych is one of these.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(USArrests)

##          vars  n   mean    sd median trimmed    mad  min   max range  skew
## Murder      1 50   7.79  4.36   7.25    7.53   5.41  0.8  17.4  16.6  0.37
## Assault     2 50 170.76 83.34 159.00  168.48 110.45 45.0 337.0 292.0  0.22
## UrbanPop    3 50  65.54 14.47  66.00   65.88  17.79 32.0  91.0  59.0 -0.21
## Rape        4 50  21.23  9.37  20.10   20.36   8.60  7.3  46.0  38.7  0.75
##          kurtosis    se
## Murder      -0.95  0.62
## Assault     -1.15 11.79
## UrbanPop    -0.87  2.05
## Rape         0.08  1.32

Note that output is in data.frame format ready to use. Another function in psych is describeBy which creates grouped summaries.

describeBy(ggplot2::diamonds, group = "cut")

## 
##  Descriptive statistics by group 
## cut: 1
##         vars    n    mean      sd  median trimmed     mad    min      max
## carat      1 1610    1.05    0.52    1.00    0.98    0.43   0.22     5.01
## cut        2 1610    1.00    0.00    1.00    1.00    0.00   1.00     1.00
## color      3 1610    3.85    1.71    4.00    3.85    1.48   1.00     7.00
## clarity    4 1610    3.02    1.45    3.00    2.93    1.48   1.00     8.00
## depth      5 1610   64.04    3.64   65.00   64.48    1.33  43.00    79.00
## table      6 1610   59.05    3.95   58.00   58.64    2.97  49.00    95.00
## price      7 1610 4358.76 3560.39 3282.00 3695.65 2183.13 337.00 18574.00
## x          8 1610    6.25    0.96    6.18    6.21    0.81   0.00    10.74
## y          9 1610    6.18    0.96    6.10    6.14    0.79   0.00    10.54
## z         10 1610    3.98    0.65    3.97    3.95    0.52   0.00     6.98
##            range  skew kurtosis    se
## carat       4.79  1.68     5.31  0.01
## cut         0.00   NaN      NaN  0.00
## color       6.00  0.06    -0.86  0.04
## clarity     7.00  0.68     0.14  0.04
## depth      36.00 -1.17     2.20  0.09
## table      46.00  1.34     4.83  0.10
## price   18237.00  1.78     3.07 88.73
## x          10.74  0.36     1.58  0.02
## y          10.54  0.36     1.53  0.02
## z           6.98  0.34     1.43  0.02
## ------------------------------------------------------------ 
## cut: 2
##         vars    n    mean      sd  median trimmed     mad    min      max
## carat      1 4906    0.85    0.45    0.82    0.80    0.43   0.23     3.01
## cut        2 4906    2.00    0.00    2.00    2.00    0.00   2.00     2.00
## color      3 4906    3.57    1.76    3.00    3.51    1.48   1.00     7.00
## clarity    4 4906    3.60    1.47    3.00    3.44    1.48   1.00     8.00
## depth      5 4906   62.37    2.17   63.40   62.70    0.74  54.30    67.00
## table      6 4906   58.69    2.85   58.00   58.57    2.97  51.00    66.00
## price      7 4906 3928.86 3681.59 3050.50 3251.51 2853.26 327.00 18788.00
## x          8 4906    5.84    1.06    5.98    5.80    1.10   0.00     9.44
## y          9 4906    5.85    1.05    5.99    5.82    1.08   0.00     9.38
## z         10 4906    3.64    0.65    3.70    3.62    0.68   0.00     5.79
##            range  skew kurtosis    se
## carat       2.78  1.03     1.22  0.01
## cut         0.00   NaN      NaN  0.00
## color       6.00  0.25    -0.93  0.03
## clarity     7.00  0.81     0.29  0.02
## depth      12.70 -1.20     0.17  0.03
## table      15.00  0.31    -0.64  0.04
## price   18461.00  1.72     3.04 52.56
## x           9.44  0.15    -0.15  0.02
## y           9.38  0.14    -0.17  0.02
## z           5.79  0.09     0.12  0.01
## ------------------------------------------------------------ 
## cut: 3
##         vars     n    mean      sd  median trimmed     mad   min      max
## carat      1 12082    0.81    0.46    0.71    0.75    0.46   0.2     4.00
## cut        2 12082    3.00    0.00    3.00    3.00    0.00   3.0     3.00
## color      3 12082    3.57    1.72    3.00    3.51    1.48   1.0     7.00
## clarity    4 12082    4.00    1.59    4.00    3.87    1.48   1.0     8.00
## depth      5 12082   61.82    1.38   62.10   61.95    1.48  56.8    64.90
## table      6 12082   57.96    2.12   58.00   57.88    1.48  44.0    66.00
## price      7 12082 3981.76 3935.86 2648.00 3243.22 2855.49 336.0 18818.00
## x          8 12082    5.74    1.10    5.74    5.69    1.25   0.0    10.01
## y          9 12082    5.77    1.10    5.77    5.72    1.25   0.0     9.94
## z         10 12082    3.56    0.73    3.56    3.53    0.76   0.0    31.80
##            range  skew kurtosis    se
## carat       3.80  0.99     0.89  0.00
## cut         0.00   NaN      NaN  0.00
## color       6.00  0.25    -0.89  0.02
## clarity     7.00  0.57    -0.43  0.01
## depth       8.10 -0.71    -0.30  0.01
## table      22.00  0.28     0.04  0.02
## price   18482.00  1.60     2.24 35.81
## x          10.01  0.23    -0.65  0.01
## y           9.94  0.23    -0.66  0.01
## z          31.80  4.96   183.94  0.01
## ------------------------------------------------------------ 
## cut: 4
##         vars     n    mean      sd  median trimmed     mad   min      max
## carat      1 13791    0.89    0.52    0.86    0.83    0.56   0.2     4.01
## cut        2 13791    4.00    0.00    4.00    4.00    0.00   4.0     4.00
## color      3 13791    3.70    1.71    4.00    3.67    1.48   1.0     7.00
## clarity    4 13791    3.74    1.50    4.00    3.60    1.48   1.0     8.00
## depth      5 13791   61.26    1.16   61.40   61.36    1.19  58.0    63.00
## table      6 13791   58.75    1.48   59.00   58.77    1.48  51.0    62.00
## price      7 13791 4584.26 4349.20 3185.00 3822.23 3371.43 326.0 18823.00
## x          8 13791    5.97    1.19    6.11    5.92    1.42   0.0    10.14
## y          9 13791    5.94    1.26    6.06    5.89    1.41   0.0    58.90
## z         10 13791    3.65    0.73    3.72    3.62    0.86   0.0     8.06
##            range  skew kurtosis    se
## carat       3.81  0.86     0.43  0.00
## cut         0.00   NaN      NaN  0.00
## color       6.00  0.12    -0.88  0.01
## clarity     7.00  0.69     0.06  0.01
## depth       5.00 -0.61    -0.37  0.01
## table      11.00 -0.37     1.33  0.01
## price   18497.00  1.33     1.07 37.03
## x          10.14  0.17    -0.85  0.01
## y          58.90  5.53   225.05  0.01
## z           8.06  0.11    -0.44  0.01
## ------------------------------------------------------------ 
## cut: 5
##         vars     n    mean      sd  median trimmed     mad   min      max
## carat      1 21551    0.70    0.43    0.54    0.64    0.33   0.2     3.50
## cut        2 21551    5.00    0.00    5.00    5.00    0.00   5.0     5.00
## color      3 21551    3.53    1.66    4.00    3.48    1.48   1.0     7.00
## clarity    4 21551    4.46    1.71    4.00    4.39    1.48   1.0     8.00
## depth      5 21551   61.71    0.72   61.80   61.76    0.59  43.0    66.70
## table      6 21551   55.95    1.25   56.00   55.97    1.48  43.0    63.00
## price      7 21551 3457.54 3808.40 1810.00 2656.14 1630.86 326.0 18806.00
## x          8 21551    5.51    1.06    5.25    5.41    1.19   0.0     9.65
## y          9 21551    5.52    1.07    5.26    5.42    1.19   0.0    31.80
## z         10 21551    3.40    0.66    3.23    3.34    0.73   0.0     6.03
##            range  skew kurtosis    se
## carat       3.30  1.34     1.63  0.00
## cut         0.00   NaN      NaN  0.00
## color       6.00  0.19    -0.82  0.01
## clarity     7.00  0.36    -0.71  0.01
## depth      23.70 -1.44    22.33  0.00
## table      20.00  0.20     1.70  0.01
## price   18480.00  1.84     2.98 25.94
## x           9.65  0.66    -0.42  0.01
## y          31.80  1.30    15.99  0.01
## z           6.03  0.65    -0.36  0.00

There is one more function describeData is this package which also results in first as well as last four (default) values.

describeData(ggplot2::diamonds)

## n.obs =  53940 of which  53940   are complete cases.   Number of variables =  10  of which all are numeric  TRUE  
##          variable # n.obs type    H1      H2   H3      H4   T1        T2
## carat*            1 53940    4  0.23    0.21 0.23    0.29 0.72      0.70
## cut*              2 53940    4 Ideal Premium Good Premium Good Very Good
## color*            3 53940    4     E       E    E       I    D         D
## clarity*          4 53940    4   SI2     SI1  VS1     VS2  SI1       SI1
## depth*            5 53940    4  61.5    59.8 56.9    62.4 63.1      62.8
## table*            6 53940    4    55      61   65      58   55        60
## price*            7 53940    4   326     326  327     334 2757      2757
## x*                8 53940    4  3.95    3.89 4.05    4.20 5.69      5.66
## y*                9 53940    4  3.98    3.84 4.07    4.23 5.75      5.68
## z*               10 53940    4  2.43    2.31 2.31    2.63 3.61      3.56
##               T3    T4
## carat*      0.86  0.75
## cut*     Premium Ideal
## color*         H     D
## clarity*     SI2   SI2
## depth*      61.0  62.2
## table*        58    55
## price*      2757  2757
## x*          6.15  5.83
## y*          6.12  5.87
## z*          3.74  3.64

17.4 Using `skimr`

Package skimr generates beautiful data EDA summary reports which can be customised as per one’s taste. Full descriptions of this package may be seen here. For basic purposes we can use function skim from this package to get data EDA summary reports.

library(skimr)
skim(iris)

17.5 Viewing relationships between different variables

We can use package PerformanceAnalytics to generate and view relationships between different variables in the data. For this purpose function PerformanceAnalytics::chart.Correlation() may be used as shown below.

suppressMessages(library(PerformanceAnalytics))

USArrests %>% 
  select(where(is.numeric)) %>% 
  PerformanceAnalytics::chart.Correlation()

Figure 17.1: Viewing relationships with PerformanceAnalytics

As can be seen that it generates visualization of a Correlation Matrix of the numeric variables in the given data.

There is one more package GGally which also creates beautiful charts for viewing relationships. There are two functions in this package which are particularly useful.

The ggpairs() function of the GGally package allows to build a great scatterplot matrix. Scatterplots of each pair of numeric variable are drawn on the left part of the figure. Pearson correlation is displayed on the right. Variable distribution is available on the diagonal.
The ggcorr() function allows to visualize the correlation of each pair of variable as a square. Note that the method argument allows to pick the correlation type you desire.

See the following example-

suppressMessages(library(GGally))
USArrests %>% 
  select_if(is.numeric) %>% 
  ggcorr(label = TRUE)

USArrests %>% 
  select_if(is.numeric) %>%
  ggpairs()

Figure 17.2: Scatterplot Matrix (Left) and Correlation plot (Right) produced in GGally

16 Data Wrangling in tidyr

Part-III: Probability and Sampling in R