Portions of a dataset can be selected with the synthax X[row,column]
. For instance:
Synthax | Result |
---|---|
iris[,] |
Select all rows and all columns |
iris[1,1] |
Select only the first row and the first column |
iris[1,] |
Select the first row |
iris[,1] |
Select the first column |
iris[1:10,1] |
Select the first ten elements, in the first column |
iris[,1:3] |
Select all the rows, in the first three columns |
iris[,c(1,3)] |
Select all the rows, in the 1st and 3rd column |
Selection of the elements stored in a dataset can be performed according to specific creteria.
Synthax | Result |
---|---|
iris[iris[,1]>6,] |
Select values that in the first column show values HIGHER THAN 6. |
iris[iris[,1]>6 & iris[,1]<4,] |
Select values that in the first column show values higher than 6 AND lower than 4. |
iris[iris[,1]==6,] |
Select values that in the first column show values EQUAL to 6. |
iris[iris[,1]>7.5 | iris[,1]<4.5,] |
Select values that in the first column show values higher than … OR lower than … |
iris[iris[,5] != "virginica",] |
Select values that in the fourth column show values NOT EQUAL to virginica. |
iris[,grep("S", names(iris), value=TRUE)] |
Select variables whose names contain the letter “S”. |
It follows a list of operators:
Synthax | Result |
---|---|
!= x |
NOT equal |
x & y |
AND |
x | y |
OR |
xor(x, y) |
Result |
> x |
higher than |
>= x |
equal or higher than |
< x |
lower than |
<= x |
equal or lower than |
It follows some metacharacters used by the grep()
function to find characters:
Metacharacter | Meaning |
---|---|
. |
matches any single character (i.e.: grep("Sep.", names(iris), value=TRUE) ) |
^ |
start with (i.e.: grep("^S", ...) ) |
\s |
space characters |
\d |
number characters |
[:upper:] |
upper case letters |
[:lower:] |
lower case letters |
[a-f] |
matches any letter between … and … |
[^a] |
matches any letter except … |
| |
or … |
Filtering can be also obtained with the subset
function:
subset(x = iris, subset = iris[,1]>7.5, select = 1:5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 106 7.6 3.0 6.6 2.1 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 136 7.7 3.0 6.1 2.3 virginica
The replacement of a value with an other use the squared bracket syntax. Given the vector:
d <- c(4, 1, 6, 7, 2, 5); d
## [1] 4 1 6 7 2 5
Replace the second element with the value “4”:
d[2] <- 4; d
## [1] 4 4 6 7 2 5
Replace all the element higher than “5” with the value “4”:
d[d[]>5] <- 4; d
## [1] 4 4 4 4 2 5
Given a data.frame:
A <- c(1:5)
B <- A^2
C <- B-A
D <- data.frame(cbind(A, B, C)); D
## A B C
## 1 1 1 0
## 2 2 4 2
## 3 3 9 6
## 4 4 16 12
## 5 5 25 20
Replace all the value higher than “9” with the character “NA”:
D[D[]>9] <- NA; D
## A B C
## 1 1 1 0
## 2 2 4 2
## 3 3 9 6
## 4 4 NA NA
## 5 5 NA NA
Dataset can be sorted ascending or descending by the function order()
:
head(iris[order(iris[,1]),])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 14 4.3 3.0 1.1 0.1 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 4 4.6 3.1 1.5 0.2 setosa
Add new columns with the function transform()
. The new columns can be a combination of other columns:
# Add new variables
head(transform(iris, A = cumsum(iris[,1]),
B = cumsum(iris[,2]),
C = cumsum(iris[,1])))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species A B C
## 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 5.1
## 2 4.9 3.0 1.4 0.2 setosa 10.0 6.5 10.0
## 3 4.7 3.2 1.3 0.2 setosa 14.7 9.7 14.7
## 4 4.6 3.1 1.5 0.2 setosa 19.3 12.8 19.3
## 5 5.0 3.6 1.4 0.2 setosa 24.3 16.4 24.3
## 6 5.4 3.9 1.7 0.4 setosa 29.7 20.3 29.7
A wide-format dataset:
A <- rnorm(n = 4, mean = 10 , sd = 1)
B <- rnorm(n = 4, mean = 20 , sd = 2)
C <- rnorm(n = 4, mean = 30 , sd = 3)
X <- round(data.frame(A, B, C), digits = 2); X
## A B C
## 1 10.47 18.21 32.19
## 2 10.08 19.09 31.08
## 3 7.32 22.94 27.08
## 4 10.70 24.09 31.01
can be reshaped in long-format by the function stack()
:
Xst <- stack(X); Xst
## values ind
## 1 10.47 A
## 2 10.08 A
## 3 7.32 A
## 4 10.70 A
## 5 18.21 B
## 6 19.09 B
## 7 22.94 B
## 8 24.09 B
## 9 32.19 C
## 10 31.08 C
## 11 27.08 C
## 12 31.01 C
Reshape it back is possible with the function unstack()
:
unstack(Xst)
## A B C
## 1 10.47 18.21 32.19
## 2 10.08 19.09 31.08
## 3 7.32 22.94 27.08
## 4 10.70 24.09 31.01
We can also resape only a selection of columns:
stack(X, select = c("A", "B"))
## values ind
## 1 10.47 A
## 2 10.08 A
## 3 7.32 A
## 4 10.70 A
## 5 18.21 B
## 6 19.09 B
## 7 22.94 B
## 8 24.09 B
The stack()
function automatically names the new variables values
and ind
. Such names can be changed with the function names()
:
names(Xst) <- c("Colore", "Sapore"); Xst
## Colore Sapore
## 1 10.47 A
## 2 10.08 A
## 3 7.32 A
## 4 10.70 A
## 5 18.21 B
## 6 19.09 B
## 7 22.94 B
## 8 24.09 B
## 9 32.19 C
## 10 31.08 C
## 11 27.08 C
## 12 31.01 C
More advanced functions for reshaping data can be found in the base-R with reshape()
or in other external packages, such as data.table
and dplyr
.