# Aim of the Unit

This Unit aims to build a dataset from scratch

# Function for dataframes

A data-frame is a fundamental data structure in R. It is a collections of variables (by columns) and samples (by rows).

Data-frames can be built by the following functions:

Function Arguments Description
c() Combine values into a vector or list (i.e. c(1:5) gives: 1, 2, 3, 4, 5)
rbind() Combine vectors or data-frames by rows
cbind() Combine vectors or data-frames by columns
rep() Replicate a list (or vector, matrix, etc.)
times number of times to repeat a list
length.out exact number of elements in the output
each number of times to repeat each element of a list
seq() Create regular sequences of numbers or characters
from Starting value of a sequence
to End value of a sequence
by Increment value of a sequence
length.out Exact length of a sequence
sample() Take a random sample from a vector or list

In this Unit, you will learn how to build a ${2}^{3}$$2^3$ experimental design, with three categorical factors and three responses, each affected by some experimental noise, like the following table:

FactorA FactorB FactorC Response1 Response2 Response3
1 1 1 3 6 9
1 1 2 5 10 15
1 2 1 2 5 8
1 2 2 4 9 14
2 1 1 6 10 14
2 1 2 8 14 20
2 2 1 6 10 14
2 2 2 8 14 20

# Factors

An Experimenter aims to understand the effect of a factor on a measurable variables. Factors may be fixed at certain levels (i.e. “all the experiments are performed at three different temperatures”), or taken randomly (the measurable variable is recorded at different temperatures randomply chosen). In R, factors are handled with the following functions:

Function Arguments Description
factor() Encode a vector as a factor.
x a vector of data
levels a vector with a unique set of data.
labels a character vector of labels.
ordered are the levels are in the order given?
as.factor() Coerces its argument to a factor.
is.factor() Return TRUE or FALSE whether its argument is a type of factor.
as.ordered() Coerces its argument to an ordered factor.
is.ordered() Return TRUE or FALSE whether its argument is a type of ordered factor.
levels() levels(x) returns the value of the levels; levels(x)<-c("a",...) sets the attribute.
labels() Set labels for use in printing or plotting.

## Function factor()

For instance, the sequence in the column name FactorA can be reproduced with:

FactorA <- rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorA
##  1 1 1 1 2 2 2 2

However, the object FactorA is a vector:

is.factor(FactorA)
##  FALSE
is.vector(FactorA)
##  TRUE

To encode such object as factor, use the function factor or as.factor:

FactorA <- factor(FactorA)
str(FactorA)
##  Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2

## Labels

It is possible to use more expressive factor names with the argument labels:

FactorA <- factor(FactorA,
levels = c(1, 2),
labels = c("red", "green"))
str(FactorA)
##  Factor w/ 2 levels "red","green": 1 1 1 1 2 2 2 2

Although the factors appear now as characters, the levels have the same value as before.

# Responses

Experimental responses can be simulated as linear combination of the factor levels. Moreover, it is possible to add some experimental noise to simulate experimental uncertainty.

## Random numbers

Responses can be simulated with random numbers. In R, there are several functions for the generation of random numbers. A couple of these functions are the following:

Function Arguments Description
rnorm() Normally distributed random numbers.
n Number of values to be drawn.
mean Mean of the gaussian population.
sd Standard deviation of the gaussian population.
runif() Uniform distribution of random numbers.
min Lower limit of the distribution.
max Upper limit of the distribution.
Response1 <- rnorm(n = length(FactorA),
mean = 10,
sd = 1)
Response1
##  10.594103 10.087919  9.918756 10.982557  9.927781  9.877853 10.521454
##   9.820152

# Assemble a data-frame

Any number of responses and factors can be set as shown before. Here is the code for a ${2}^{3}$$2^3$ factorial experiment:

FactorA <- rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorB <- rep(seq(from = 1,
to = 2,
by = 1),
each = 2,
times = 2)
FactorC <- rep(seq(from = 1,
to = 2,
by = 1),
times = 4)
Response1 <- 2*FactorA -
2*FactorB +
2*FactorC +
FactorA*FactorB
Response2 <- 3*FactorA -
2*FactorB +
4*FactorC +
FactorA*FactorB
Response3 <- 4*FactorA -
2*FactorB +
6*FactorC +
FactorA*FactorB
df <- data.frame(A = as.factor(FactorA),
B = as.factor(FactorB),
C = as.factor(FactorC),
R1 = Response1,
R2 = Response2,
R3 = Response3)
df
##   A B C R1 R2 R3
## 1 1 1 1  3  6  9
## 2 1 1 2  5 10 15
## 3 1 2 1  2  5  8
## 4 1 2 2  4  9 14
## 5 2 1 1  6 10 14
## 6 2 1 2  8 14 20
## 7 2 2 1  6 10 14
## 8 2 2 2  8 14 20

# Data overview

A preliminary overview of data is given by the following functions:

Function Description
head() Shows only the first six rows of the dataset
tail() Shows only the last six rows of the dataset
class() Shows the type of data
dim() Shows the dimension of the dataset (rows x columns)
ncol() Number of columns
nrow() Number of rows
length() Number of elements in a vector
str() Shows the structure of the dataset
names() Shows the column names of a dataset
summary.default() Shows some basic information on the dataset, such as name of columns, length and class

For instance, to check the first few rows of the data-frame:

head(df)
##   A B C R1 R2 R3
## 1 1 1 1  3  6  9
## 2 1 1 2  5 10 15
## 3 1 2 1  2  5  8
## 4 1 2 2  4  9 14
## 5 2 1 1  6 10 14
## 6 2 1 2  8 14 20

To check data structure:

str(df)
## 'data.frame':    8 obs. of  6 variables:
##  $A : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2 ##$ B : Factor w/ 2 levels "1","2": 1 1 2 2 1 1 2 2
##  $C : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 ##$ R1: num  3 5 2 4 6 8 6 8
##  $R2: num 6 10 5 9 10 14 10 14 ##$ R3: num  9 15 8 14 14 20 14 20

To have a summary of the data-frame:

summary.default(df)
##    Length Class  Mode
## A  8      factor numeric
## B  8      factor numeric
## C  8      factor numeric
## R1 8      -none- numeric
## R2 8      -none- numeric
## R3 8      -none- numeric

To change columnames:

names(df) <- c("FactorA",
"FactorB",
"FactorC",
"Resp1",
"Resp2",
"Resp3")
df
##   FactorA FactorB FactorC Resp1 Resp2 Resp3
## 1       1       1       1     3     6     9
## 2       1       1       2     5    10    15
## 3       1       2       1     2     5     8
## 4       1       2       2     4     9    14
## 5       2       1       1     6    10    14
## 6       2       1       2     8    14    20
## 7       2       2       1     6    10    14
## 8       2       2       2     8    14    20