In this tutorial we use the dataframe iris
, which comes together with the installation of R:
knitr::kable(iris[1:3,], type = "html", digits = 2)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
This dataframe is composed by four variables and one categorical factor. The aim of this tutorial is to summarize the content of this dataset. As first task, determine the mean of each varibles. In general, to get any statistic from a column of a dataframe, you have to:
In practice, this is done in R with the following codes
X1 <- iris[,1] # Select the first column
X1.mean <- mean(X1) # Apply the statistic
X1.mean # Print the result
## [1] 5.843333
This can be repeated on the second column:
X2 <- iris[,2] # Select all the rows of the first column and store it in the variable "X1"
mean(X2) # Perform the statistic
## [1] 3.057333
… on the third column:
X3 <- iris[,3] # Select all the rows of the first column and store it in the variable "X1"
mean(X3) # Perform the statistic
## [1] 3.758
… and so on:
X4 <- iris[,4] # Select all the rows of the first column and store it in the variable "X1"
mean(X4) # Perform the statistic
## [1] 1.199333
And you might collect all the mean in a dataframe:
data.frame(X1.mean = mean(X1), X2.mean = mean(X2), X3.mean = mean(X3), X4.mean = mean(X4))
## X1.mean X2.mean X3.mean X4.mean
## 1 5.843333 3.057333 3.758 1.199333
Although this procedure is simple, however, it is time consuming. Consider when you have 1.000 columns… Moreover, other softwares, such as MS Excel, can do it in a more attractive and visual way. However, R is a great tool especially when you have to repeat things.
The way R repeats things is with a special loop function called apply()
. This function is able to:
With the apply()
function you can repeat a function on your data column-wise or row-wise. For instance, what is the max values on each variable?:
apply(iris[,1:4], # this is where data is stored
2, # this specify column-wise mode
max # this is the function
)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 7.9 4.4 6.9 2.5
Now, let’s use the apply()
function to determine the mean of each variable:
apply(iris[,1:4], 2, mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
Similarly to the mean, let’s determine now the standard deviation:
apply(iris[,1:4], # Select the first 4 variables where data are stored
2, # Tell to R that the analysis must be performed column-wise
sd) # Apply the statistical function column-wise
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377
Now, let’s put the mean and standard deviation together:
rbind(
apply(iris[,1:4], 2, mean),
apply(iris[,1:4], 2, sd)
)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 5.8433333 3.0573333 3.758000 1.1993333
## [2,] 0.8280661 0.4358663 1.765298 0.7622377
Instead of using twice the apply function, you may build a used-defined function:
descriptive <- function(x) {rbind(M = mean(x),
SD = sd(x),
CV = 100*sd(x) / mean(x),
N = length(x))
}
And using within the apply function:
X <- apply(iris[,1:4], 2, descriptive)
knitr::kable(X, type = "html", digits = 2)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|
5.84 | 3.06 | 3.76 | 1.20 |
0.83 | 0.44 | 1.77 | 0.76 |
14.17 | 14.26 | 46.97 | 63.56 |
150.00 | 150.00 | 150.00 | 150.00 |
Or, alternatively, just place the function directly in the apply statement:
X <- data.frame(apply(iris[,1:4], 2, function(x) rbind(M = mean(x),
SD = sd(x),
CV = 100*sd(x) / mean(x),
N = length(x))))
Now, let’s focus a little on printing the data in a more nice way. First, it might be more convenient to have the variables row-wise and the calculated statistics column-wise. You can easily get it with the transpose function t()
:
X <- t(X)
Then, change the column names and specify the rounding rule:
colnames(X) <- c("Mean", "SD", "CV", "N")
knitr::kable(X, type = "html", digits = 2)
Mean | SD | CV | N | |
---|---|---|---|---|
Sepal.Length | 5.84 | 0.83 | 14.17 | 150 |
Sepal.Width | 3.06 | 0.44 | 14.26 | 150 |
Petal.Length | 3.76 | 1.77 | 46.97 | 150 |
Petal.Width | 1.20 | 0.76 | 63.56 | 150 |
Each variable is furtherly devided by the categorical factor Species
. It is now interesting to repeat the previous statistics for each level of the factor:
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
The factor Species
has three factors. To determine the previous statistics on each factor, you can extend the previous approach:
This Split-Apply-Combine procedure can be performed to each group of one variable by the function tapply()
:
tapply(iris[,1], iris$Species, mean)
## setosa versicolor virginica
## 5.006 5.936 6.588
… and on the second variable:
tapply(iris[,2], iris$Species, mean)
## setosa versicolor virginica
## 3.428 2.770 2.974
… the third:
tapply(iris[,3], iris$Species, mean)
## setosa versicolor virginica
## 1.462 4.260 5.552
… and so on:
tapply(iris[,4], iris$Species, mean)
## setosa versicolor virginica
## 0.246 1.326 2.026
Clearly, it would be nice if the code can repeat all in one single run. R can do it by combining apply()
and tapply()
functions:
X <- apply(iris[,1:4], 2, function(x) tapply(x, iris$Species, mean))
knitr::kable(X, type = "html", digits = 2)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
setosa | 5.01 | 3.43 | 1.46 | 0.25 |
versicolor | 5.94 | 2.77 | 4.26 | 1.33 |
virginica | 6.59 | 2.97 | 5.55 | 2.03 |
We might also extend this procedure by adding more statistics to a new function:
X <- apply(iris[,1:4], 2, function(x) tapply(x, iris$Species, function(x) rbind(M = mean(x),
SD = sd(x),
CV = 100*sd(x) / mean(x),
N = length(x))))
The result is a list of four objects, one for each variables (Sepal.Length, Sepal.Width, Petal.Length and Petal.Width). Each object is splitted in three levels. Thus, there are 12 lists, each with the computed statistics.
Let’s see now how to print all such information in a compact way:
library(data.table)
H<-rbindlist(X, use.names=TRUE, fill=TRUE, idcol="X")
L<-rep(c("M", "SD", "CV", "N"), times = 4)
M<-data.frame(H, L)
M<-M[c(1, 5, 2:4)]
knitr::kable(M, type = "html", digits = 2, title = "Descriptive statistic")
X | L | setosa | versicolor | virginica |
---|---|---|---|---|
Sepal.Length | M | 5.01 | 5.94 | 6.59 |
Sepal.Length | SD | 0.35 | 0.52 | 0.64 |
Sepal.Length | CV | 7.04 | 8.70 | 9.65 |
Sepal.Length | N | 50.00 | 50.00 | 50.00 |
Sepal.Width | M | 3.43 | 2.77 | 2.97 |
Sepal.Width | SD | 0.38 | 0.31 | 0.32 |
Sepal.Width | CV | 11.06 | 11.33 | 10.84 |
Sepal.Width | N | 50.00 | 50.00 | 50.00 |
Petal.Length | M | 1.46 | 4.26 | 5.55 |
Petal.Length | SD | 0.17 | 0.47 | 0.55 |
Petal.Length | CV | 11.88 | 11.03 | 9.94 |
Petal.Length | N | 50.00 | 50.00 | 50.00 |
Petal.Width | M | 0.25 | 1.33 | 2.03 |
Petal.Width | SD | 0.11 | 0.20 | 0.27 |
Petal.Width | CV | 42.84 | 14.91 | 13.56 |
Petal.Width | N | 50.00 | 50.00 | 50.00 |
Finally, we can reshape data:
library(tidyr)
N <- gather(M, species, measurement, setosa:virginica, factor_key=TRUE)
P <- spread(N, L, measurement)
Q <- P[c(1,2,4,6,3,5)]
knitr::kable(Q, type = "html", digits = 2, title = "Descriptive statistic")
X | species | M | SD | CV | N |
---|---|---|---|---|---|
Petal.Length | setosa | 1.46 | 0.17 | 11.88 | 50 |
Petal.Length | versicolor | 4.26 | 0.47 | 11.03 | 50 |
Petal.Length | virginica | 5.55 | 0.55 | 9.94 | 50 |
Petal.Width | setosa | 0.25 | 0.11 | 42.84 | 50 |
Petal.Width | versicolor | 1.33 | 0.20 | 14.91 | 50 |
Petal.Width | virginica | 2.03 | 0.27 | 13.56 | 50 |
Sepal.Length | setosa | 5.01 | 0.35 | 7.04 | 50 |
Sepal.Length | versicolor | 5.94 | 0.52 | 8.70 | 50 |
Sepal.Length | virginica | 6.59 | 0.64 | 9.65 | 50 |
Sepal.Width | setosa | 3.43 | 0.38 | 11.06 | 50 |
Sepal.Width | versicolor | 2.77 | 0.31 | 11.33 | 50 |
Sepal.Width | virginica | 2.97 | 0.32 | 10.84 | 50 |
Last, you might decide to sort the dataframe according to a specific column, ascending:
knitr::kable(Q[order(Q$M), ], type = "html", digits = 2, title = "Descriptive statistic")
X | species | M | SD | CV | N | |
---|---|---|---|---|---|---|
4 | Petal.Width | setosa | 0.25 | 0.11 | 42.84 | 50 |
5 | Petal.Width | versicolor | 1.33 | 0.20 | 14.91 | 50 |
1 | Petal.Length | setosa | 1.46 | 0.17 | 11.88 | 50 |
6 | Petal.Width | virginica | 2.03 | 0.27 | 13.56 | 50 |
11 | Sepal.Width | versicolor | 2.77 | 0.31 | 11.33 | 50 |
12 | Sepal.Width | virginica | 2.97 | 0.32 | 10.84 | 50 |
10 | Sepal.Width | setosa | 3.43 | 0.38 | 11.06 | 50 |
2 | Petal.Length | versicolor | 4.26 | 0.47 | 11.03 | 50 |
7 | Sepal.Length | setosa | 5.01 | 0.35 | 7.04 | 50 |
3 | Petal.Length | virginica | 5.55 | 0.55 | 9.94 | 50 |
8 | Sepal.Length | versicolor | 5.94 | 0.52 | 8.70 | 50 |
9 | Sepal.Length | virginica | 6.59 | 0.64 | 9.65 | 50 |
or descending:
knitr::kable(Q[order(-Q$M), ], type = "html", digits = 2, title = "Descriptive statistic")
X | species | M | SD | CV | N | |
---|---|---|---|---|---|---|
9 | Sepal.Length | virginica | 6.59 | 0.64 | 9.65 | 50 |
8 | Sepal.Length | versicolor | 5.94 | 0.52 | 8.70 | 50 |
3 | Petal.Length | virginica | 5.55 | 0.55 | 9.94 | 50 |
7 | Sepal.Length | setosa | 5.01 | 0.35 | 7.04 | 50 |
2 | Petal.Length | versicolor | 4.26 | 0.47 | 11.03 | 50 |
10 | Sepal.Width | setosa | 3.43 | 0.38 | 11.06 | 50 |
12 | Sepal.Width | virginica | 2.97 | 0.32 | 10.84 | 50 |
11 | Sepal.Width | versicolor | 2.77 | 0.31 | 11.33 | 50 |
6 | Petal.Width | virginica | 2.03 | 0.27 | 13.56 | 50 |
1 | Petal.Length | setosa | 1.46 | 0.17 | 11.88 | 50 |
5 | Petal.Width | versicolor | 1.33 | 0.20 | 14.91 | 50 |
4 | Petal.Width | setosa | 0.25 | 0.11 | 42.84 | 50 |