v <- c(11, 13, 17, 15, 9)
v[2][1] 13
Author: {Karl Brand, Elizabeth Ribble and Sten Willemsen # Manipulating / Selecting Data}
Often we want to calculate some things only for a specific subgroup of our patient group. (For example we want to calculate the average for the variables age and weight, but only for the women in the data set and not for the men). So it is important to select specific variables and observations. In R making these selections is called indexing. We start by showing how we can make selections within a vector. Afterwards we will see how selecting variables and observations in more complex data structures such as matrices, data.frames and lists.
There are three ways in which we can make a selection:
The easiest way of selecting elements in a vector is by specify positions or indices i in numbers of the data we wish to select using square brackets which are the extract' function, i.e.,object[i]`.
v <- c(11, 13, 17, 15, 9)
v[2][1] 13
We can also do this for multiple elements at the same time:
v[c(2, 3)][1] 13 17
i <- c(1, 2)
v[i][1] 11 13
The vector of indices does not have to be sorted and indices may be duplicated. For example:
v[c(3, 2, 2)][1] 17 13 13
When we use a vector with negative integers we will select all observations except those on the specified positions:
v[-2][1] 11 17 15 9
Note that positive and negative indexes cannot be combined.
When the elements in a vector all have a name we can use these names to select the elements.
bp_with_name <- c(sys=135, dia=85)
bp_with_name['dia']dia
85
We can also give names to the vector by using the names function: Here we make use of the names attribute:
ages <- c(12, 3, 45)
names(ages) <- c("Kim" , 'Arthur', 'Mark')
ages Kim Arthur Mark
12 3 45
str(ages) Named num [1:3] 12 3 45
- attr(*, "names")= chr [1:3] "Kim" "Arthur" "Mark"
ages["Mark"]Mark
45
names(ages)[1] "Kim" "Arthur" "Mark"
Note that we can also change the names of a data.frame in the same way we would do so for a list using the names attribute:
mygenes <- data.frame(samp1 = c(33, 22, 12),
samp2 = c(44, 111, 13),
samp3 = c(33, 53, 65))
names(mygenes) <- c("samp10", "samp20", "samp30")
mygenes samp10 samp20 samp30
1 33 44 33
2 22 111 53
3 12 13 65
## but let's change it back...
names(mygenes) <- c("samp1", "samp2", "samp3")Note that the colnames function also performs the same job for data frames.
The third way to select elements from a vector is by using a vector of logical values (i.e.. TRUE/FALSE values) between the square brackets. In this way we select all values for which the value between the brackets is TRUE.
some_values<- c('foo', 'bar', 'baz')
some_values[c(TRUE, FALSE, TRUE)][1] "foo" "baz"
This way of selecting is especially useful when we use some comparison (using <, <=, != , etc.) or condition to select variables.
ages <- c(55, 78, 92, 44)
sex <- factor(c('Male', 'Female', 'Male', 'Female'))
ages[ages > 65][1] 78 92
# When we use a factor as filter we may compare to a character value
ages[sex == 'Female'] [1] 78 44
To understand why this works look at when the condition between the brackets is TRUE.
ages > 65[1] FALSE TRUE TRUE FALSE
which(ages > 65) # gives the TRUE indices[1] 2 3
Note that when we aply filtering on a factor variable the possible levels are not changed:
levels(sex[sex == 'Female'] )[1] "Female" "Male"
Making selections on a matrix works more or less the same as for vectors. But because we have two dimensions we need two indices, the first for the rows and the second for columns
m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), ncol = 3)
m[1, ] # first row[1] 1 5 9
m[ , 2] # second column[1] 5 6 7 8
m[c(2, 4), 1:2] # combination of rows and colums [,1] [,2]
[1,] 2 6
[2,] 4 8
m[, -c(1, 3)] # negative indices [1] 5 6 7 8
m[m[,1]>=3,] #logical indices [,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
Note that when a single row or column is selected the object is converted to a vector; a frequent source of errors. if you want to prevent this you can use drop=FALSE.
m[1, , drop=FALSE] [,1] [,2] [,3]
[1,] 1 5 9
We can also use names but a matrix can have rownames as well as colnames:
rownames(m) <- LETTERS[1:4]
m [,1] [,2] [,3]
A 1 5 9
B 2 6 10
C 3 7 11
D 4 8 12
m["A", ][1] 1 5 9
There exist a special way of using a matrix with two columns to select individual elements out of a matrix based on their two-dimensional coordinates.
a <- matrix(c(2,3,3,2), ncol = 2)
a [,1] [,2]
[1,] 2 3
[2,] 3 2
m[a][1] 10 7
Indexing for arrays is similar as for matrices.
a <- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),dim = c(2,3,4))
dimnames(a) <- list(NULL, c('A', 'B', 'C'), c('a', 'b', 'c', 'd'))
a, , a
A B C
[1,] 1 3 5
[2,] 2 4 6
, , b
A B C
[1,] 7 9 11
[2,] 8 10 12
, , c
A B C
[1,] 1 3 5
[2,] 2 4 6
, , d
A B C
[1,] 7 9 11
[2,] 8 10 12
colnames(a) # still 2nd dimension[1] "A" "B" "C"
a[1, 2, ] # 3-dim array requires 3 positions between the bracketsa b c d
3 9 3 9
a[,,'a'] A B C
[1,] 1 3 5
[2,] 2 4 6
To make selections in a list, we can use single square brackets, double square brackets and the dollar sign. The use of single square brackets works in the same way as it does for vectors.
mylist <- list(
foo=c(1,2, 3),
bar=c('a', 'b'),
baz=list(TRUE, c(2, 4))
)
mylist[c(1,3)]$foo
[1] 1 2 3
$baz
$baz[[1]]
[1] TRUE
$baz[[2]]
[1] 2 4
mylist[c(1,2,3)==2]$bar
[1] "a" "b"
mylist['bar']$bar
[1] "a" "b"
We can also use double square brackets and the dollar sign for a list. There are two important differences between using single and double square brackets: 1. Using double square brackets only allows us to select a single element from the list. 2. The result of a selection with double brackets is the element itself, while if we make the selection with single brackets the result is a list consisting of the selected elements.
mylist[[1]][1] 1 2 3
mode(mylist[[1]]) # a vector[1] "numeric"
mode(mylist[1]) # a list with with a numeric vector as its single element [1] "list"
There is another way to select a variable from a list which is by using the dollar sign (‘$’). This is an alternative to using double square brackets in combination with a name. When we use this the result is always a vector.
mylist$foo # results is a vector[1] 1 2 3
Instead of using the dollar sign we can use double square brackets. We now need to put the name between quotes like for single square brackets. We can also use the position of the variable using these double square brackets.
mydata <- data.frame(id=c(1, 2, 3, 4, 5),
sex=factor(c('M', 'F', 'M', 'F', 'F')),
weight=c(77, 44, 56, 88, 49),
treatm=c('A', 'A', 'B', 'B', 'A'))
mydata[['treatm']][1] "A" "A" "B" "B" "A"
mydata[[1]][1] 1 2 3 4 5
It is not possible to select more than one element using double brackets; The result will always be a vector (instead of a data.frame)
data.frameSelecting observations and variables in a data.frame works more or less the same for data.frames as it does for lists. However because a data.frame is two dimensional we can two indices between the square brackets. The first one corresponds to the observations (rows) and the second corresponds to the variables (columns). So, as an example, we can select the first two observations from the third variable in the data.frame using the syntax:
mydata <- data.frame(id=c(1, 2, 3, 4, 5),
sex=factor(c('M', 'F', 'M', 'F', 'F')),
weight=c(77, 44, 56, 88, 49))
mydata[c(1, 2), 3][1] 77 44
When the first or second position is left blank all rows or columns are selected. For example:
mydata[, 3] # sex (3rd variable) for all patients[1] 77 44 56 88 49
mydata[c(1,2), ] # all variables for the first two patients id sex weight
1 1 M 77
2 2 F 44
mydata[c(-3,-4), 'sex'] # Negative numbers and names can also be used[1] M F F
Levels: F M
We have to be careful when we want to select a single variable from a data.frame, as we do above. The result will now no longer be a data.frame but it is transformed to a vector. When we want to prevent this we can use drop=FALSE, as follows:
mydata[ , 3][1] 77 44 56 88 49
mydata[ , 3, drop=FALSE] weight
1 77
2 44
3 56
4 88
5 49
We can also use a single index between the square brackets. This works as if the data.frame was a list of variables (it’s columns).
mydata[1] id
1 1
2 2
3 3
4 4
5 5
mydata[[1]][1] 1 2 3 4 5
A data.frame has row.names (note the dot) as well as variable names we can use for selection. Let’s look at an example where row names are gene symbols and column names are sample IDs:
mygenes <- data.frame(samp1 = c(33, 22, 12),
samp2 = c(44, 111, 13),
samp3 = c(33, 53, 65))
row.names(mygenes) <- c("CRP", "BRCA1", "HOXA")
names(mygenes)[1] "samp1" "samp2" "samp3"
mygenes samp1 samp2 samp3
CRP 33 44 33
BRCA1 22 111 53
HOXA 12 13 65
mygenes["CRP", ] samp1 samp2 samp3
CRP 33 44 33
mygenes[, "samp1"][1] 33 22 12
mygenes[, c("samp1", "samp3")] samp1 samp3
CRP 33 33
BRCA1 22 53
HOXA 12 65
mygenes["HOXA", "samp2"][1] 13
Besides square brackets ([, [[), other useful functions exist for selecting data: duplicated, match, %in%, grep, is.na and $.
To select e.g. rows that are not duplicated:
mm <- matrix(c(1, 1, 2, 2), nrow = 4, byrow = TRUE)
mm [,1]
[1,] 1
[2,] 1
[3,] 2
[4,] 2
mm[!duplicated(mm), ][1] 1 2
The above can also be done with unique, but the use of duplicated might be more appropriate in more complex situations:
unique(mm) [,1]
[1,] 1
[2,] 2
Calling match returns the position of the first match of its first argument in the second argument:
match(c("a", "b"), c("a", "c", "a", "b", "a", "b"))[1] 1 4
whereas \%in\% tells you whether the elements of the first argument appear in the second argument:
c("a", "b", "d") %in% c("a", "c", "a", "b", "a", "b")[1] TRUE TRUE FALSE
Recall our data frame mygenes:
mygenes samp1 samp2 samp3
CRP 33 44 33
BRCA1 22 111 53
HOXA 12 13 65
is.data.frame(mygenes)[1] TRUE
Note that since mygenes is a data frame, it is therefore also an array, which means we can select by the name of the elements in the array:
mygenes[match(c("samp1", "samp3"), colnames(mygenes))] samp1 samp3
CRP 33 33
BRCA1 22 53
HOXA 12 65
mygenes[colnames(mygenes) %in% c("samp1", "samp4")] samp1
CRP 33
BRCA1 22
HOXA 12
However, in this case we could just use the names…
mygenes[c("samp1", "samp3")] samp1 samp3
CRP 33 33
BRCA1 22 53
HOXA 12 65
But this gives an error:
mygenes[c("samp1", "samp30")] ## not runwhere this does not:
mygenes[colnames(mygenes) %in% c("samp1", "samp30")] samp1
CRP 33
BRCA1 22
HOXA 12
We can also use functions like grep to search for the names we are interested in:
mygenes[grep(2, names(mygenes))] samp2
CRP 44
BRCA1 111
HOXA 13
mygenes[grep("A", row.names(mygenes)), ] samp1 samp2 samp3
BRCA1 22 111 53
HOXA 12 13 65
If we want to find or exclude data with missing values, we can use is.na:
z <- c(1:4, NA, 5:10)
z [1] 1 2 3 4 NA 5 6 7 8 9 10
is.na(z) [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
which(is.na(z))[1] 5
z[!is.na(z)] [1] 1 2 3 4 5 6 7 8 9 10