<- c(11, 13, 17, 15, 9)
v 2] v[
[1] 13
Author: {Karl Brand, Elizabeth Ribble and Sten Willemsen # Manipulating / Selecting Data}
Often we want to calculate some things only for a specific subgroup of our patient group. (For example we want to calculate the average for the variables age and weight, but only for the women in the data set and not for the men). So it is important to select specific variables and observations. In R making these selections is called indexing. We start by showing how we can make selections within a vector. Afterwards we will see how selecting variables and observations in more complex data structures such as matrices, data.frame
s and list
s.
There are three ways in which we can make a selection:
The easiest way of selecting elements in a vector is by specify positions or indices i in numbers of the data we wish to select using square brackets which are the extract' function, i.e.,
object[i]`.
<- c(11, 13, 17, 15, 9)
v 2] v[
[1] 13
We can also do this for multiple elements at the same time:
c(2, 3)] v[
[1] 13 17
<- c(1, 2)
i v[i]
[1] 11 13
The vector of indices does not have to be sorted and indices may be duplicated. For example:
c(3, 2, 2)] v[
[1] 17 13 13
When we use a vector with negative integers we will select all observations except those on the specified positions:
-2] v[
[1] 11 17 15 9
Note that positive and negative indexes cannot be combined.
When the elements in a vector all have a name we can use these names to select the elements.
<- c(sys=135, dia=85)
bp_with_name 'dia'] bp_with_name[
dia
85
We can also give names to the vector by using the names
function: Here we make use of the names
attribute:
<- c(12, 3, 45)
ages names(ages) <- c("Kim" , 'Arthur', 'Mark')
ages
Kim Arthur Mark
12 3 45
str(ages)
Named num [1:3] 12 3 45
- attr(*, "names")= chr [1:3] "Kim" "Arthur" "Mark"
"Mark"] ages[
Mark
45
names(ages)
[1] "Kim" "Arthur" "Mark"
Note that we can also change the names of a data.frame
in the same way we would do so for a list
using the names
attribute:
<- data.frame(samp1 = c(33, 22, 12),
mygenes samp2 = c(44, 111, 13),
samp3 = c(33, 53, 65))
names(mygenes) <- c("samp10", "samp20", "samp30")
mygenes
samp10 samp20 samp30
1 33 44 33
2 22 111 53
3 12 13 65
## but let's change it back...
names(mygenes) <- c("samp1", "samp2", "samp3")
Note that the colnames
function also performs the same job for data frames.
The third way to select elements from a vector is by using a vector of logical values (i.e.. TRUE/FALSE values) between the square brackets. In this way we select all values for which the value between the brackets is TRUE.
<- c('foo', 'bar', 'baz')
some_valuesc(TRUE, FALSE, TRUE)] some_values[
[1] "foo" "baz"
This way of selecting is especially useful when we use some comparison (using <
, <=
, !=
, etc.) or condition to select variables.
<- c(55, 78, 92, 44)
ages <- factor(c('Male', 'Female', 'Male', 'Female'))
sex > 65] ages[ages
[1] 78 92
# When we use a factor as filter we may compare to a character value
== 'Female'] ages[sex
[1] 78 44
To understand why this works look at when the condition between the brackets is TRUE.
> 65 ages
[1] FALSE TRUE TRUE FALSE
which(ages > 65) # gives the TRUE indices
[1] 2 3
Note that when we aply filtering on a factor variable the possible levels are not changed:
levels(sex[sex == 'Female'] )
[1] "Female" "Male"
Making selections on a matrix works more or less the same as for vectors. But because we have two dimensions we need two indices, the first for the rows and the second for columns
<- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), ncol = 3)
m 1, ] # first row m[
[1] 1 5 9
2] # second column m[ ,
[1] 5 6 7 8
c(2, 4), 1:2] # combination of rows and colums m[
[,1] [,2]
[1,] 2 6
[2,] 4 8
-c(1, 3)] # negative indices m[,
[1] 5 6 7 8
1]>=3,] #logical indices m[m[,
[,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
Note that when a single row or column is selected the object is converted to a vector; a frequent source of errors. if you want to prevent this you can use drop=FALSE
.
1, , drop=FALSE] m[
[,1] [,2] [,3]
[1,] 1 5 9
We can also use names but a matrix can have rownames as well as colnames:
rownames(m) <- LETTERS[1:4]
m
[,1] [,2] [,3]
A 1 5 9
B 2 6 10
C 3 7 11
D 4 8 12
"A", ] m[
[1] 1 5 9
There exist a special way of using a matrix with two columns to select individual elements out of a matrix based on their two-dimensional coordinates.
<- matrix(c(2,3,3,2), ncol = 2)
a a
[,1] [,2]
[1,] 2 3
[2,] 3 2
m[a]
[1] 10 7
Indexing for arrays is similar as for matrices.
<- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),dim = c(2,3,4))
a dimnames(a) <- list(NULL, c('A', 'B', 'C'), c('a', 'b', 'c', 'd'))
a
, , a
A B C
[1,] 1 3 5
[2,] 2 4 6
, , b
A B C
[1,] 7 9 11
[2,] 8 10 12
, , c
A B C
[1,] 1 3 5
[2,] 2 4 6
, , d
A B C
[1,] 7 9 11
[2,] 8 10 12
colnames(a) # still 2nd dimension
[1] "A" "B" "C"
1, 2, ] # 3-dim array requires 3 positions between the brackets a[
a b c d
3 9 3 9
'a'] a[,,
A B C
[1,] 1 3 5
[2,] 2 4 6
To make selections in a list, we can use single square brackets, double square brackets and the dollar sign. The use of single square brackets works in the same way as it does for vectors.
<- list(
mylist foo=c(1,2, 3),
bar=c('a', 'b'),
baz=list(TRUE, c(2, 4))
)c(1,3)] mylist[
$foo
[1] 1 2 3
$baz
$baz[[1]]
[1] TRUE
$baz[[2]]
[1] 2 4
c(1,2,3)==2] mylist[
$bar
[1] "a" "b"
'bar'] mylist[
$bar
[1] "a" "b"
We can also use double square brackets and the dollar sign for a list
. There are two important differences between using single and double square brackets: 1. Using double square brackets only allows us to select a single element from the list
. 2. The result of a selection with double brackets is the element itself, while if we make the selection with single brackets the result is a list
consisting of the selected elements.
1]] mylist[[
[1] 1 2 3
mode(mylist[[1]]) # a vector
[1] "numeric"
mode(mylist[1]) # a list with with a numeric vector as its single element
[1] "list"
There is another way to select a variable from a list
which is by using the dollar sign (‘$’). This is an alternative to using double square brackets in combination with a name. When we use this the result is always a vector.
$foo # results is a vector mylist
[1] 1 2 3
Instead of using the dollar sign we can use double square brackets. We now need to put the name between quotes like for single square brackets. We can also use the position of the variable using these double square brackets.
<- data.frame(id=c(1, 2, 3, 4, 5),
mydata sex=factor(c('M', 'F', 'M', 'F', 'F')),
weight=c(77, 44, 56, 88, 49),
treatm=c('A', 'A', 'B', 'B', 'A'))
'treatm']] mydata[[
[1] "A" "A" "B" "B" "A"
1]] mydata[[
[1] 1 2 3 4 5
It is not possible to select more than one element using double brackets; The result will always be a vector (instead of a data.frame
)
data.frame
Selecting observations and variables in a data.frame
works more or less the same for data.frames
as it does for lists. However because a data.frame
is two dimensional we can two indices between the square brackets. The first one corresponds to the observations (rows) and the second corresponds to the variables (columns). So, as an example, we can select the first two observations from the third variable in the data.frame
using the syntax:
<- data.frame(id=c(1, 2, 3, 4, 5),
mydata sex=factor(c('M', 'F', 'M', 'F', 'F')),
weight=c(77, 44, 56, 88, 49))
c(1, 2), 3] mydata[
[1] 77 44
When the first or second position is left blank all rows or columns are selected. For example:
3] # sex (3rd variable) for all patients mydata[,
[1] 77 44 56 88 49
c(1,2), ] # all variables for the first two patients mydata[
id sex weight
1 1 M 77
2 2 F 44
c(-3,-4), 'sex'] # Negative numbers and names can also be used mydata[
[1] M F F
Levels: F M
We have to be careful when we want to select a single variable from a data.frame
, as we do above. The result will now no longer be a data.frame
but it is transformed to a vector
. When we want to prevent this we can use drop=FALSE
, as follows:
3] mydata[ ,
[1] 77 44 56 88 49
3, drop=FALSE] mydata[ ,
weight
1 77
2 44
3 56
4 88
5 49
We can also use a single index between the square brackets. This works as if the data.frame
was a list of variables (it’s columns).
1] mydata[
id
1 1
2 2
3 3
4 4
5 5
1]] mydata[[
[1] 1 2 3 4 5
A data.frame
has row.names
(note the dot) as well as variable names we can use for selection. Let’s look at an example where row names are gene symbols and column names are sample IDs:
<- data.frame(samp1 = c(33, 22, 12),
mygenes samp2 = c(44, 111, 13),
samp3 = c(33, 53, 65))
row.names(mygenes) <- c("CRP", "BRCA1", "HOXA")
names(mygenes)
[1] "samp1" "samp2" "samp3"
mygenes
samp1 samp2 samp3
CRP 33 44 33
BRCA1 22 111 53
HOXA 12 13 65
"CRP", ] mygenes[
samp1 samp2 samp3
CRP 33 44 33
"samp1"] mygenes[,
[1] 33 22 12
c("samp1", "samp3")] mygenes[,
samp1 samp3
CRP 33 33
BRCA1 22 53
HOXA 12 65
"HOXA", "samp2"] mygenes[
[1] 13
Besides square brackets ([
, [[
), other useful functions exist for selecting data: duplicated
, match
, %in%
, grep
, is.na
and $
.
To select e.g. rows that are not duplicated:
<- matrix(c(1, 1, 2, 2), nrow = 4, byrow = TRUE)
mm
mm
[,1]
[1,] 1
[2,] 1
[3,] 2
[4,] 2
!duplicated(mm), ] mm[
[1] 1 2
The above can also be done with unique, but the use of duplicated might be more appropriate in more complex situations:
unique(mm)
[,1]
[1,] 1
[2,] 2
Calling match
returns the position of the first match of its first argument in the second argument:
match(c("a", "b"), c("a", "c", "a", "b", "a", "b"))
[1] 1 4
whereas \%in\%
tells you whether the elements of the first argument appear in the second argument:
c("a", "b", "d") %in% c("a", "c", "a", "b", "a", "b")
[1] TRUE TRUE FALSE
Recall our data frame mygenes:
mygenes
samp1 samp2 samp3
CRP 33 44 33
BRCA1 22 111 53
HOXA 12 13 65
is.data.frame(mygenes)
[1] TRUE
Note that since mygenes
is a data frame, it is therefore also an array, which means we can select by the name of the elements in the array:
match(c("samp1", "samp3"), colnames(mygenes))] mygenes[
samp1 samp3
CRP 33 33
BRCA1 22 53
HOXA 12 65
colnames(mygenes) %in% c("samp1", "samp4")] mygenes[
samp1
CRP 33
BRCA1 22
HOXA 12
However, in this case we could just use the names…
c("samp1", "samp3")] mygenes[
samp1 samp3
CRP 33 33
BRCA1 22 53
HOXA 12 65
But this gives an error:
c("samp1", "samp30")] ## not run mygenes[
where this does not:
colnames(mygenes) %in% c("samp1", "samp30")] mygenes[
samp1
CRP 33
BRCA1 22
HOXA 12
We can also use functions like grep
to search for the names we are interested in:
grep(2, names(mygenes))] mygenes[
samp2
CRP 44
BRCA1 111
HOXA 13
grep("A", row.names(mygenes)), ] mygenes[
samp1 samp2 samp3
BRCA1 22 111 53
HOXA 12 13 65
If we want to find or exclude data with missing values, we can use is.na
:
<- c(1:4, NA, 5:10)
z z
[1] 1 2 3 4 NA 5 6 7 8 9 10
is.na(z)
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
which(is.na(z))
[1] 5
!is.na(z)] z[
[1] 1 2 3 4 5 6 7 8 9 10