# Aim of the Unit

This Unit aims to introduce some built-in functions of R that are used to repeat functions over a number of objects.

# Repeating things

One of the most interesting feature of R is the possibility to repeat things. For instance, with a data-frame, it is possible to apply a function :

• over each element
• over columns (column-wise)
• over rows (row-wise)
• over rows/columns by groups

R functions used to repeat things are:

Function Arguments Description
apply() Apply a function row-wise or column-wise.
x Matrix name
MARGIN 1 stands for row-wise, 2 for column-wise.
FuN the function to be applied.
lapply() Works with lists and return a list.
sapply() Work as lapply but return a data-frame.
tapply() Work as sapply but by groups.
INDEX a list of one or more factors.
FuN the function to be applied.
mapply() Work as sapply but for each element of lists

# Apply to each element

The application of a function to each element of a vector is the default operation in R.

Consider the vector:

X <- c(2,3,4)

If you want the squared of each element, just type:

X^2
## [1]  4  9 16

If you have a data-frame:

X <- data.frame(A = c(2,3,4), B = c(3,4,5))
X
##   A B
## 1 2 3
## 2 3 4
## 3 4 5

the squared of each element is given by:

X^2
##    A  B
## 1  4  9
## 2  9 16
## 3 16 25

# Apply

Often, it is more important to calculate statistics row-wise or column-wise. This can be done with the function apply():

Given a dataset:

A <- rnorm(n = 4, mean = 10 , sd = 1)
B <- rnorm(n = 4, mean = 20 , sd = 2)
C <- rnorm(n = 4, mean = 30 , sd = 3)
X <- round(data.frame(A, B, C), digits = 2); X
##       A     B     C
## 1  8.37 21.26 25.45
## 2 10.02 19.38 35.35
## 3  9.23 18.65 26.12
## 4 10.62 18.64 29.51

it is possible to calculate the marginal row means by:

apply(X, MARGIN = 1, mean)
## [1] 18.36000 21.58333 18.00000 19.59000

or the marginal column means by:

apply(X, MARGIN = 2, mean)
##       A       B       C
##  9.5600 19.4825 29.1075

The function apply() can be used also with user-defined functions:

apply(X, MARGIN = 2,
function(x) round(
sd(x)/mean(x),
digits = 3))
##     A     B     C
## 0.102 0.063 0.155

# Lapply

lapply() executes a function over the element of a list and return a list. Since data-frames are a kind of column-wise lists, it works also on data-frames. In such case, the result is the same as: apply(x, MAGIN = 2, FUN = ...), but the output is a list with as many elements as the columns in the data-frame:

LPLY <- lapply(X, function(x) round(
sd(x)/mean(x),
digits = 3))
str(LPLY)
## List of 3
##  $A: num 0.102 ##$ B: num 0.063
##  $C: num 0.155 LPLY ##$A
## [1] 0.102
##
## $B ## [1] 0.063 ## ##$C
## [1] 0.155

Sometimes, it is very useful to use the function sapply() instead, which gives a matrix as output.

SPLY <- sapply(X, function(x) round(
sd(x)/mean(x),
digits = 3))
str(SPLY)
##  Named num [1:3] 0.102 0.063 0.155
##  - attr(*, "names")= chr [1:3] "A" "B" "C"
SPLY
##     A     B     C
## 0.102 0.063 0.155

By the way, the same result could be obtained with the function apply():

apply(X,
MARGIN = 2,
function(x) round(
sd(x)/mean(x),
digits = 3))
##     A     B     C
## 0.102 0.063 0.155

# Tapply

The function tapply() works as sapply() but the result is grouped by a factor. This is especially important in factorial experiments, where one is interested to know the averages over one or more factors.

Given a factorial design

FactorA <- rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorB <- rep(seq(from = 1,
to = 2,
by = 1),
each = 2,
times = 2)
Response1 <- 5*FactorA -
2*FactorB +
FactorA*FactorB +
rnorm(n = length(FactorA),
mean = 0,
sd = 1)
df <- data.frame(FactorA = as.factor(FactorA),
FactorB = as.factor(FactorB),
Response1 = Response1)
df
##   FactorA FactorB Response1
## 1       1       1  4.456599
## 2       1       1  2.906168
## 3       1       2  3.506017
## 4       1       2  3.736970
## 5       2       1 10.219816
## 6       2       1  9.484414
## 7       2       2  9.882215
## 8       2       2 12.331149

it is possible to calculate the averages of the measured variable Response1 grouped by the FactorA:

tapply(df$Response1, df$FactorA, mean)
##         1         2
##  3.651439 10.479398

or by the FactorB:

tapply(df$Response1, df$FactorB, mean)
##        1        2
## 6.766749 7.364088

# Aggregate

A convenient version of tapply() is aggregate(). This splits the data into subsets according to a specific factor, apply a function for each, and returns the result in a convenient form.

aggregate(df$Response1, by = list(df$FactorA),
mean)
##   Group.1         x
## 1       1  3.651439
## 2       2 10.479398

The function can combine even more factors:

aggregate(Response1,
by = list(FactorA = FactorA,
FactorB = FactorB),
data = df,
mean)
##   FactorA FactorB         x
## 1       1       1  3.681384
## 2       2       1  9.852115
## 3       1       2  3.621494
## 4       2       2 11.106682

and can accept also the notation of formula:

aggregate(Response1 ~ FactorA*FactorB,
data = df,
mean)
##   FactorA FactorB Response1
## 1       1       1  3.681384
## 2       2       1  9.852115
## 3       1       2  3.621494
## 4       2       2 11.106682

# Problem n.2: import multiple .csv files simultaneously

Suppose we have a folder containing multiple data.csv files. It is possible to import them all simultaneously with the following script:

# Get the files names
files = list.files(pattern="*.csv")
# First apply read.csv, then rbind
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))