This Unit aims to build a dataset from scratch
A dataframe is a fundamental data structure in R. It is a collections of variables (by columns) and samples (by rows).
Dataframes can be built by the following functions:
Function  Arguments  Description 

c() 
Combine values into a vector or list (i.e. c(1:5) gives: 1, 2, 3, 4, 5) 

rbind() 
Combine vectors or dataframes by rows  
cbind() 
Combine vectors or dataframes by columns  
rep() 
Replicate a list (or vector, matrix, etc.)  
times 
number of times to repeat a list  
length.out 
exact number of elements in the output  
each 
number of times to repeat each element of a list  
seq() 
Create regular sequences of numbers or characters  
from 
Starting value of a sequence  
to 
End value of a sequence  
by 
Increment value of a sequence  
length.out 
Exact length of a sequence  
sample() 
Take a random sample from a vector or list 
In this Unit, you will learn how to build a ${2}^{3}$ experimental design, with three categorical factors and three responses, each affected by some experimental noise, like the following table:
FactorA  FactorB  FactorC  Response1  Response2  Response3 

1  1  1  3  6  9 
1  1  2  5  10  15 
1  2  1  2  5  8 
1  2  2  4  9  14 
2  1  1  6  10  14 
2  1  2  8  14  20 
2  2  1  6  10  14 
2  2  2  8  14  20 
An Experimenter aims to understand the effect of a factor on a measurable variables. Factors may be fixed at certain levels (i.e. “all the experiments are performed at three different temperatures”), or taken randomly (the measurable variable is recorded at different temperatures randomply chosen). In R, factors are handled with the following functions:
Function  Arguments  Description 

factor() 
Encode a vector as a factor.  
x 
a vector of data  
levels 
a vector with a unique set of data.  
labels 
a character vector of labels.  
ordered 
are the levels are in the order given?  
as.factor() 
Coerces its argument to a factor.  
is.factor() 
Return TRUE or FALSE whether its argument is a type of factor.  
as.ordered() 
Coerces its argument to an ordered factor.  
is.ordered() 
Return TRUE or FALSE whether its argument is a type of ordered factor.  
levels() 
levels(x) returns the value of the levels; levels(x)<c("a",...) sets the attribute. 

labels() 
Set labels for use in printing or plotting. 
factor()
For instance, the sequence in the column name FactorA
can be reproduced with:
FactorA < rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorA
## [1] 1 1 1 1 2 2 2 2
However, the object FactorA
is a vector:
is.factor(FactorA)
## [1] FALSE
is.vector(FactorA)
## [1] TRUE
To encode such object as factor, use the function factor
or as.factor
:
FactorA < factor(FactorA)
str(FactorA)
## Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2
It is possible to use more expressive factor names with the argument labels
:
FactorA < factor(FactorA,
levels = c(1, 2),
labels = c("red", "green"))
str(FactorA)
## Factor w/ 2 levels "red","green": 1 1 1 1 2 2 2 2
Although the factors appear now as characters, the levels have the same value as before.
Experimental responses can be simulated as linear combination of the factor levels. Moreover, it is possible to add some experimental noise to simulate experimental uncertainty.
Responses can be simulated with random numbers. In R, there are several functions for the generation of random numbers. A couple of these functions are the following:
Function  Arguments  Description 

rnorm() 
Normally distributed random numbers.  
n 
Number of values to be drawn.  
mean 
Mean of the gaussian population.  
sd 
Standard deviation of the gaussian population.  
runif() 
Uniform distribution of random numbers.  
min 
Lower limit of the distribution.  
max 
Upper limit of the distribution. 
Response1 < rnorm(n = length(FactorA),
mean = 10,
sd = 1)
Response1
## [1] 10.594103 10.087919 9.918756 10.982557 9.927781 9.877853 10.521454
## [8] 9.820152
Any number of responses and factors can be set as shown before. Here is the code for a ${2}^{3}$ factorial experiment:
FactorA < rep(seq(from = 1,
to = 2,
by = 1),
each = 4)
FactorB < rep(seq(from = 1,
to = 2,
by = 1),
each = 2,
times = 2)
FactorC < rep(seq(from = 1,
to = 2,
by = 1),
times = 4)
Response1 < 2*FactorA 
2*FactorB +
2*FactorC +
FactorA*FactorB
Response2 < 3*FactorA 
2*FactorB +
4*FactorC +
FactorA*FactorB
Response3 < 4*FactorA 
2*FactorB +
6*FactorC +
FactorA*FactorB
df < data.frame(A = as.factor(FactorA),
B = as.factor(FactorB),
C = as.factor(FactorC),
R1 = Response1,
R2 = Response2,
R3 = Response3)
df
## A B C R1 R2 R3
## 1 1 1 1 3 6 9
## 2 1 1 2 5 10 15
## 3 1 2 1 2 5 8
## 4 1 2 2 4 9 14
## 5 2 1 1 6 10 14
## 6 2 1 2 8 14 20
## 7 2 2 1 6 10 14
## 8 2 2 2 8 14 20
A preliminary overview of data is given by the following functions:
Function  Description 

head() 
Shows only the first six rows of the dataset 
tail() 
Shows only the last six rows of the dataset 
class() 
Shows the type of data 
dim() 
Shows the dimension of the dataset (rows x columns) 
ncol() 
Number of columns 
nrow() 
Number of rows 
length() 
Number of elements in a vector 
str() 
Shows the structure of the dataset 
names() 
Shows the column names of a dataset 
summary.default() 
Shows some basic information on the dataset, such as name of columns, length and class 
For instance, to check the first few rows of the dataframe:
head(df)
## A B C R1 R2 R3
## 1 1 1 1 3 6 9
## 2 1 1 2 5 10 15
## 3 1 2 1 2 5 8
## 4 1 2 2 4 9 14
## 5 2 1 1 6 10 14
## 6 2 1 2 8 14 20
To check data structure:
str(df)
## 'data.frame': 8 obs. of 6 variables:
## $ A : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2
## $ B : Factor w/ 2 levels "1","2": 1 1 2 2 1 1 2 2
## $ C : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2
## $ R1: num 3 5 2 4 6 8 6 8
## $ R2: num 6 10 5 9 10 14 10 14
## $ R3: num 9 15 8 14 14 20 14 20
To have a summary of the dataframe:
summary.default(df)
## Length Class Mode
## A 8 factor numeric
## B 8 factor numeric
## C 8 factor numeric
## R1 8 none numeric
## R2 8 none numeric
## R3 8 none numeric
To change columnames:
names(df) < c("FactorA",
"FactorB",
"FactorC",
"Resp1",
"Resp2",
"Resp3")
df
## FactorA FactorB FactorC Resp1 Resp2 Resp3
## 1 1 1 1 3 6 9
## 2 1 1 2 5 10 15
## 3 1 2 1 2 5 8
## 4 1 2 2 4 9 14
## 5 2 1 1 6 10 14
## 6 2 1 2 8 14 20
## 7 2 2 1 6 10 14
## 8 2 2 2 8 14 20