Data Analysis
visitors: 45301 - online: 2 - today: 27

Using R for Data Analysis

Summarizing Data and Basic Statistical Tests

Iris dataframe

In this tutorial we use the dataframe iris, which comes together with the installation of R:

knitr::kable(iris[1:3,], type = "html", digits = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa

This dataframe is composed by four variables and one categorical factor. The aim of this tutorial is to summarize the content of this dataset. As first task, determine the mean of each varibles. In general, to get any statistic from a column of a dataframe, you have to:

  1. Select the column
  2. Apply the statistic
  3. Print or store the result

Calculate the mean column-wise

In practice, this is done in R with the following codes

X1      <- iris[,1]  # Select the first column
X1.mean <- mean(X1)  # Apply the statistic
X1.mean              # Print the result
## [1] 5.843333

This can be repeated on the second column:

X2 <- iris[,2]  # Select all the rows of the first column and store it in the variable "X1"
mean(X2)        # Perform the statistic
## [1] 3.057333

… on the third column:

X3 <- iris[,3]  # Select all the rows of the first column and store it in the variable "X1"
mean(X3)        # Perform the statistic
## [1] 3.758

… and so on:

X4 <- iris[,4]  # Select all the rows of the first column and store it in the variable "X1"
mean(X4)        # Perform the statistic
## [1] 1.199333

And you might collect all the mean in a dataframe:

data.frame(X1.mean = mean(X1), X2.mean = mean(X2), X3.mean = mean(X3), X4.mean = mean(X4))
##    X1.mean  X2.mean X3.mean  X4.mean
## 1 5.843333 3.057333   3.758 1.199333

Although this procedure is simple, however, it is time consuming. Consider when you have 1.000 columns… Moreover, other softwares, such as MS Excel, can do it in a more attractive and visual way. However, R is a great tool especially when you have to repeat things.

Apply function

The way R repeats things is with a special loop function called apply(). This function is able to:

  1. Split the datasets in separated columns
  2. Apply the statistic to each column
  3. Combine the result

With the apply() function you can repeat a function on your data column-wise or row-wise. For instance, what is the max values on each variable?:

apply(iris[,1:4], # this is where data is stored
      2,          # this specify column-wise mode
      max         # this is the function
      ) 
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          7.9          4.4          6.9          2.5

Calculate the mean column-wise with apply()

Now, let’s use the apply() function to determine the mean of each variable:

apply(iris[,1:4], 2,  mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Calculate the standard deviation column-wise

Similarly to the mean, let’s determine now the standard deviation:

apply(iris[,1:4],   # Select the first 4 variables where data are stored 
2,            # Tell to R that the analysis must be performed column-wise
sd)         # Apply the statistical function column-wise 
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377

Combine mean and standard deviation in a table

Now, let’s put the mean and standard deviation together:

rbind(
apply(iris[,1:4], 2, mean),
apply(iris[,1:4], 2, sd)
)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]    5.8433333   3.0573333     3.758000   1.1993333
## [2,]    0.8280661   0.4358663     1.765298   0.7622377

User defined function

Instead of using twice the apply function, you may build a used-defined function:

descriptive <- function(x) {rbind(M = mean(x),
                          SD = sd(x),
                          CV = 100*sd(x) / mean(x),
                          N = length(x))
                    }

And using within the apply function:

X <- apply(iris[,1:4], 2, descriptive)
knitr::kable(X, type = "html", digits = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.84 3.06 3.76 1.20
0.83 0.44 1.77 0.76
14.17 14.26 46.97 63.56
150.00 150.00 150.00 150.00

Or, alternatively, just place the function directly in the apply statement:

X <- data.frame(apply(iris[,1:4], 2, function(x) rbind(M =  mean(x),
                                    SD = sd(x),
                                    CV = 100*sd(x) / mean(x),
                                    N = length(x))))

Tapply function

Each variable is furtherly devided by the categorical factor Species. It is now interesting to repeat the previous statistics for each level of the factor:

levels(iris$Species) 
## [1] "setosa"     "versicolor" "virginica"

The factor Species has three factors. To determine the previous statistics on each factor, you can extend the previous approach:

  1. Split the dataset as many groups as the levels
  2. Apply the statistic
  3. Combine the results

This Split-Apply-Combine procedure can be performed to each group of one variable by the function tapply():

tapply(iris[,1], iris$Species, mean) 
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

… and on the second variable:

tapply(iris[,2], iris$Species, mean) 
##     setosa versicolor  virginica 
##      3.428      2.770      2.974

… the third:

tapply(iris[,3], iris$Species, mean) 
##     setosa versicolor  virginica 
##      1.462      4.260      5.552

… and so on:

tapply(iris[,4], iris$Species, mean) 
##     setosa versicolor  virginica 
##      0.246      1.326      2.026

Nesting Tapply and Apply functions

Clearly, it would be nice if the code can repeat all in one single run. R can do it by combining apply() and tapply() functions:

X <- apply(iris[,1:4], 2, function(x) tapply(x, iris$Species, mean))
knitr::kable(X, type = "html", digits = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.01 3.43 1.46 0.25
versicolor 5.94 2.77 4.26 1.33
virginica 6.59 2.97 5.55 2.03

We might also extend this procedure by adding more statistics to a new function:

X <- apply(iris[,1:4], 2, function(x) tapply(x, iris$Species, function(x) rbind(M =  mean(x),
                                    SD = sd(x),
                                    CV = 100*sd(x) / mean(x),
                                    N = length(x))))

The result is a list of four objects, one for each variables (Sepal.Length, Sepal.Width, Petal.Length and Petal.Width). Each object is splitted in three levels. Thus, there are 12 lists, each with the computed statistics.

Let’s see now how to print all such information in a compact way:

library(data.table)
H<-rbindlist(X, use.names=TRUE, fill=TRUE, idcol="X")
L<-rep(c("M", "SD", "CV", "N"), times = 4)
M<-data.frame(H, L)
M<-M[c(1, 5, 2:4)]
knitr::kable(M, type = "html", digits = 2, title = "Descriptive statistic")
X L setosa versicolor virginica
Sepal.Length M 5.01 5.94 6.59
Sepal.Length SD 0.35 0.52 0.64
Sepal.Length CV 7.04 8.70 9.65
Sepal.Length N 50.00 50.00 50.00
Sepal.Width M 3.43 2.77 2.97
Sepal.Width SD 0.38 0.31 0.32
Sepal.Width CV 11.06 11.33 10.84
Sepal.Width N 50.00 50.00 50.00
Petal.Length M 1.46 4.26 5.55
Petal.Length SD 0.17 0.47 0.55
Petal.Length CV 11.88 11.03 9.94
Petal.Length N 50.00 50.00 50.00
Petal.Width M 0.25 1.33 2.03
Petal.Width SD 0.11 0.20 0.27
Petal.Width CV 42.84 14.91 13.56
Petal.Width N 50.00 50.00 50.00

Reshape data

Finally, we can reshape data:

library(tidyr)
N <- gather(M, species, measurement, setosa:virginica, factor_key=TRUE)
P <- spread(N, L, measurement)
Q <- P[c(1,2,4,6,3,5)]
knitr::kable(Q, type = "html", digits = 2, title = "Descriptive statistic")
X species M SD CV N
Petal.Length setosa 1.46 0.17 11.88 50
Petal.Length versicolor 4.26 0.47 11.03 50
Petal.Length virginica 5.55 0.55 9.94 50
Petal.Width setosa 0.25 0.11 42.84 50
Petal.Width versicolor 1.33 0.20 14.91 50
Petal.Width virginica 2.03 0.27 13.56 50
Sepal.Length setosa 5.01 0.35 7.04 50
Sepal.Length versicolor 5.94 0.52 8.70 50
Sepal.Length virginica 6.59 0.64 9.65 50
Sepal.Width setosa 3.43 0.38 11.06 50
Sepal.Width versicolor 2.77 0.31 11.33 50
Sepal.Width virginica 2.97 0.32 10.84 50

Last, you might decide to sort the dataframe according to a specific column, ascending:

knitr::kable(Q[order(Q$M), ], type = "html", digits = 2, title = "Descriptive statistic")
X species M SD CV N
4 Petal.Width setosa 0.25 0.11 42.84 50
5 Petal.Width versicolor 1.33 0.20 14.91 50
1 Petal.Length setosa 1.46 0.17 11.88 50
6 Petal.Width virginica 2.03 0.27 13.56 50
11 Sepal.Width versicolor 2.77 0.31 11.33 50
12 Sepal.Width virginica 2.97 0.32 10.84 50
10 Sepal.Width setosa 3.43 0.38 11.06 50
2 Petal.Length versicolor 4.26 0.47 11.03 50
7 Sepal.Length setosa 5.01 0.35 7.04 50
3 Petal.Length virginica 5.55 0.55 9.94 50
8 Sepal.Length versicolor 5.94 0.52 8.70 50
9 Sepal.Length virginica 6.59 0.64 9.65 50

or descending:

knitr::kable(Q[order(-Q$M), ], type = "html", digits = 2, title = "Descriptive statistic")
X species M SD CV N
9 Sepal.Length virginica 6.59 0.64 9.65 50
8 Sepal.Length versicolor 5.94 0.52 8.70 50
3 Petal.Length virginica 5.55 0.55 9.94 50
7 Sepal.Length setosa 5.01 0.35 7.04 50
2 Petal.Length versicolor 4.26 0.47 11.03 50
10 Sepal.Width setosa 3.43 0.38 11.06 50
12 Sepal.Width virginica 2.97 0.32 10.84 50
11 Sepal.Width versicolor 2.77 0.31 11.33 50
6 Petal.Width virginica 2.03 0.27 13.56 50
1 Petal.Length setosa 1.46 0.17 11.88 50
5 Petal.Width versicolor 1.33 0.20 14.91 50
4 Petal.Width setosa 0.25 0.11 42.84 50

1 2 3 4 5