Manage data

Select

Portions of a dataset can be selected with the synthax X[row,column]. For instance:

Synthax Result
iris[,] Select all rows and all columns
iris[1,1] Select only the first row and the first column
iris[1,] Select the first row
iris[,1] Select the first column
iris[1:10,1] Select the first ten elements, in the first column
iris[,1:3] Select all the rows, in the first three columns
iris[,c(1,3)] Select all the rows, in the 1st and 3rd column

Filter

Selection of the elements stored in a dataset can be performed according to specific creteria.

Synthax Result
iris[iris[,1]>6,] Select values that in the first column show values HIGHER THAN 6.
iris[iris[,1]>6 & iris[,1]<4,] Select values that in the first column show values higher than 6 AND lower than 4.
iris[iris[,1]==6,] Select values that in the first column show values EQUAL to 6.
iris[iris[,1]>7.5 | iris[,1]<4.5,] Select values that in the first column show values higher than … OR lower than …
iris[iris[,5] != "virginica",] Select values that in the fourth column show values NOT EQUAL to virginica.
iris[,grep("S", names(iris), value=TRUE)] Select variables whose names contain the letter “S”.

It follows a list of operators:

Synthax Result
!= x NOT equal
x & y AND
x | y OR
xor(x, y) Result
> x higher than
>= x equal or higher than
< x lower than
<= x equal or lower than

It follows some metacharacters used by the grep() function to find characters:

Metacharacter Meaning
. matches any single character (i.e.: grep("Sep.", names(iris), value=TRUE))
^ start with (i.e.: grep("^S", ...))
\s space characters
\d number characters
[:upper:] upper case letters
[:lower:] lower case letters
[a-f] matches any letter between … and …
[^a] matches any letter except …
| or …

Subset

Filtering can be also obtained with the subset function:

subset(x = iris, subset = iris[,1]>7.5, select = 1:5)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 106          7.6         3.0          6.6         2.1 virginica
## 118          7.7         3.8          6.7         2.2 virginica
## 119          7.7         2.6          6.9         2.3 virginica
## 123          7.7         2.8          6.7         2.0 virginica
## 132          7.9         3.8          6.4         2.0 virginica
## 136          7.7         3.0          6.1         2.3 virginica

Replace

The replacement of a value with an other use the squared bracket syntax. Given the vector:

d <- c(4, 1, 6, 7, 2, 5); d
## [1] 4 1 6 7 2 5

Replace the second element with the value “4”:

d[2] <- 4; d
## [1] 4 4 6 7 2 5

Replace all the element higher than “5” with the value “4”:

d[d[]>5] <- 4; d
## [1] 4 4 4 4 2 5

Given a data.frame:

A <- c(1:5)
B <- A^2
C <- B-A
D <- data.frame(cbind(A, B, C)); D
##   A  B  C
## 1 1  1  0
## 2 2  4  2
## 3 3  9  6
## 4 4 16 12
## 5 5 25 20

Replace all the value higher than “9” with the character “NA”:

D[D[]>9] <- NA; D
##   A  B  C
## 1 1  1  0
## 2 2  4  2
## 3 3  9  6
## 4 4 NA NA
## 5 5 NA NA

Sort

Dataset can be sorted ascending or descending by the function order():

head(iris[order(iris[,1]),])
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 14          4.3         3.0          1.1         0.1  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 4           4.6         3.1          1.5         0.2  setosa

Transform

Add new columns with the function transform(). The new columns can be a combination of other columns:

# Add new variables
head(transform(iris, A = cumsum(iris[,1]), 
                     B = cumsum(iris[,2]), 
                     C = cumsum(iris[,1])))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    A    B    C
## 1          5.1         3.5          1.4         0.2  setosa  5.1  3.5  5.1
## 2          4.9         3.0          1.4         0.2  setosa 10.0  6.5 10.0
## 3          4.7         3.2          1.3         0.2  setosa 14.7  9.7 14.7
## 4          4.6         3.1          1.5         0.2  setosa 19.3 12.8 19.3
## 5          5.0         3.6          1.4         0.2  setosa 24.3 16.4 24.3
## 6          5.4         3.9          1.7         0.4  setosa 29.7 20.3 29.7

Stack

A wide-format dataset:

A <- rnorm(n = 4, mean = 10 , sd = 1)
B <- rnorm(n = 4, mean = 20 , sd = 2)
C <- rnorm(n = 4, mean = 30 , sd = 3)
X <- round(data.frame(A, B, C), digits = 2); X
##       A     B     C
## 1 10.47 18.21 32.19
## 2 10.08 19.09 31.08
## 3  7.32 22.94 27.08
## 4 10.70 24.09 31.01

can be reshaped in long-format by the function stack():

Xst <- stack(X); Xst
##    values ind
## 1   10.47   A
## 2   10.08   A
## 3    7.32   A
## 4   10.70   A
## 5   18.21   B
## 6   19.09   B
## 7   22.94   B
## 8   24.09   B
## 9   32.19   C
## 10  31.08   C
## 11  27.08   C
## 12  31.01   C

Reshape it back is possible with the function unstack():

unstack(Xst)
##       A     B     C
## 1 10.47 18.21 32.19
## 2 10.08 19.09 31.08
## 3  7.32 22.94 27.08
## 4 10.70 24.09 31.01

We can also resape only a selection of columns:

stack(X, select = c("A", "B"))
##   values ind
## 1  10.47   A
## 2  10.08   A
## 3   7.32   A
## 4  10.70   A
## 5  18.21   B
## 6  19.09   B
## 7  22.94   B
## 8  24.09   B

The stack() function automatically names the new variables values and ind. Such names can be changed with the function names():

names(Xst) <- c("Colore", "Sapore"); Xst
##    Colore Sapore
## 1   10.47      A
## 2   10.08      A
## 3    7.32      A
## 4   10.70      A
## 5   18.21      B
## 6   19.09      B
## 7   22.94      B
## 8   24.09      B
## 9   32.19      C
## 10  31.08      C
## 11  27.08      C
## 12  31.01      C

More advanced functions for reshaping data can be found in the base-R with reshape() or in other external packages, such as data.table and dplyr.