Indexing and Subsetting

Author: {Karl Brand, Elizabeth Ribble and Sten Willemsen # Manipulating / Selecting Data}

Often we want to calculate some things only for a specific subgroup of our patient group. (For example we want to calculate the average for the variables age and weight, but only for the women in the data set and not for the men). So it is important to select specific variables and observations. In R making these selections is called indexing. We start by showing how we can make selections within a vector. Afterwards we will see how selecting variables and observations in more complex data structures such as matrices, data.frames and lists.

Indexing a vector

There are three ways in which we can make a selection:

Using numeric values, to make selections based on the position in a vector
Using character values, to select values based on their name
Using logical values (TRUE / FALSE), to make selections based on a condition

Using numeric values

The easiest way of selecting elements in a vector is by specify positions or indices i in numbers of the data we wish to select using square brackets which are the extract' function, i.e.,object[i]`.

v <- c(11, 13, 17, 15, 9)
v[2]

[1] 13

We can also do this for multiple elements at the same time:

v[c(2, 3)]

[1] 13 17

i <- c(1, 2)
v[i]

[1] 11 13

The vector of indices does not have to be sorted and indices may be duplicated. For example:

v[c(3, 2, 2)]

[1] 17 13 13

When we use a vector with negative integers we will select all observations except those on the specified positions:

v[-2]

[1] 11 17 15  9

Note that positive and negative indexes cannot be combined.

Using character values

When the elements in a vector all have a name we can use these names to select the elements.

bp_with_name <- c(sys=135, dia=85)
bp_with_name['dia']

dia 
 85

We can also give names to the vector by using the names function: Here we make use of the names attribute:

ages <- c(12, 3, 45)
names(ages) <- c("Kim" , 'Arthur', 'Mark')
ages

   Kim Arthur   Mark 
    12      3     45

str(ages)

 Named num [1:3] 12 3 45
 - attr(*, "names")= chr [1:3] "Kim" "Arthur" "Mark"

ages["Mark"]

Mark 
  45

names(ages)

[1] "Kim"    "Arthur" "Mark"

Note that we can also change the names of a data.frame in the same way we would do so for a list using the names attribute:

mygenes <- data.frame(samp1 = c(33, 22, 12), 
                      samp2 = c(44, 111, 13), 
                      samp3 = c(33, 53, 65))
names(mygenes) <- c("samp10", "samp20", "samp30")
mygenes

  samp10 samp20 samp30
1     33     44     33
2     22    111     53
3     12     13     65

## but let's change it back...
names(mygenes) <- c("samp1", "samp2", "samp3")

Note that the colnames function also performs the same job for data frames.

Using logical values

The third way to select elements from a vector is by using a vector of logical values (i.e.. TRUE/FALSE values) between the square brackets. In this way we select all values for which the value between the brackets is TRUE.

some_values<- c('foo', 'bar', 'baz')
some_values[c(TRUE, FALSE, TRUE)]

[1] "foo" "baz"

This way of selecting is especially useful when we use some comparison (using <, <=, != , etc.) or condition to select variables.

ages <- c(55, 78, 92, 44)
sex <- factor(c('Male', 'Female', 'Male', 'Female'))
ages[ages > 65]

[1] 78 92

# When we use a factor as filter we may compare to a character value
ages[sex == 'Female']

[1] 78 44

To understand why this works look at when the condition between the brackets is TRUE.

ages > 65

[1] FALSE  TRUE  TRUE FALSE

which(ages > 65)  # gives the TRUE indices

[1] 2 3

Note that when we aply filtering on a factor variable the possible levels are not changed:

levels(sex[sex == 'Female'] )

[1] "Female" "Male"

Making selections in a matrix or array

Making selections on a matrix works more or less the same as for vectors. But because we have two dimensions we need two indices, the first for the rows and the second for columns

m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), ncol = 3)
m[1, ] # first row

[1] 1 5 9

m[ , 2] # second column

[1] 5 6 7 8

m[c(2, 4), 1:2] # combination of rows and colums

     [,1] [,2]
[1,]    2    6
[2,]    4    8

m[, -c(1, 3)]   # negative indices

[1] 5 6 7 8

m[m[,1]>=3,] #logical indices

     [,1] [,2] [,3]
[1,]    3    7   11
[2,]    4    8   12

Note that when a single row or column is selected the object is converted to a vector; a frequent source of errors. if you want to prevent this you can use drop=FALSE.

m[1, , drop=FALSE]

     [,1] [,2] [,3]
[1,]    1    5    9

We can also use names but a matrix can have rownames as well as colnames:

rownames(m) <- LETTERS[1:4]
m

  [,1] [,2] [,3]
A    1    5    9
B    2    6   10
C    3    7   11
D    4    8   12

m["A", ]

[1] 1 5 9

There exist a special way of using a matrix with two columns to select individual elements out of a matrix based on their two-dimensional coordinates.

a <- matrix(c(2,3,3,2), ncol = 2)
a

     [,1] [,2]
[1,]    2    3
[2,]    3    2

m[a]

[1] 10  7

Indexing for arrays is similar as for matrices.

a <- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),dim = c(2,3,4))
dimnames(a) <- list(NULL, c('A', 'B', 'C'), c('a', 'b', 'c', 'd'))
a

, , a

     A B C
[1,] 1 3 5
[2,] 2 4 6

, , b

     A  B  C
[1,] 7  9 11
[2,] 8 10 12

, , c

     A B C
[1,] 1 3 5
[2,] 2 4 6

, , d

     A  B  C
[1,] 7  9 11
[2,] 8 10 12

colnames(a) # still 2nd dimension

[1] "A" "B" "C"

a[1, 2, ] # 3-dim array requires 3 positions between the brackets

a b c d 
3 9 3 9

a[,,'a']

     A B C
[1,] 1 3 5
[2,] 2 4 6

Making selections in a list

To make selections in a list, we can use single square brackets, double square brackets and the dollar sign. The use of single square brackets works in the same way as it does for vectors.

mylist <- list(
  foo=c(1,2, 3),
  bar=c('a', 'b'),
  baz=list(TRUE, c(2, 4))
  )
mylist[c(1,3)]

$foo
[1] 1 2 3

$baz
$baz[[1]]
[1] TRUE

$baz[[2]]
[1] 2 4

mylist[c(1,2,3)==2]

$bar
[1] "a" "b"

mylist['bar']

$bar
[1] "a" "b"

We can also use double square brackets and the dollar sign for a list. There are two important differences between using single and double square brackets: 1. Using double square brackets only allows us to select a single element from the list. 2. The result of a selection with double brackets is the element itself, while if we make the selection with single brackets the result is a list consisting of the selected elements.

mylist[[1]]

[1] 1 2 3

mode(mylist[[1]]) # a vector

[1] "numeric"

mode(mylist[1])   # a list with with a numeric vector as its single element

[1] "list"

There is another way to select a variable from a list which is by using the dollar sign (‘$’). This is an alternative to using double square brackets in combination with a name. When we use this the result is always a vector.

mylist$foo      # results is a vector

[1] 1 2 3

Instead of using the dollar sign we can use double square brackets. We now need to put the name between quotes like for single square brackets. We can also use the position of the variable using these double square brackets.

mydata <- data.frame(id=c(1, 2, 3, 4, 5),
                     sex=factor(c('M', 'F', 'M', 'F', 'F')),
                     weight=c(77, 44, 56, 88, 49),
                     treatm=c('A', 'A', 'B', 'B', 'A'))
mydata[['treatm']]

[1] "A" "A" "B" "B" "A"

mydata[[1]]

[1] 1 2 3 4 5

It is not possible to select more than one element using double brackets; The result will always be a vector (instead of a data.frame)

Selecting observations and variables in a `data.frame`

Selecting observations and variables in a data.frame works more or less the same for data.frames as it does for lists. However because a data.frame is two dimensional we can two indices between the square brackets. The first one corresponds to the observations (rows) and the second corresponds to the variables (columns). So, as an example, we can select the first two observations from the third variable in the data.frame using the syntax:

mydata <- data.frame(id=c(1, 2, 3, 4, 5),
                     sex=factor(c('M', 'F', 'M', 'F', 'F')),
                     weight=c(77, 44, 56, 88, 49))
mydata[c(1, 2), 3]

[1] 77 44

When the first or second position is left blank all rows or columns are selected. For example:

mydata[, 3]       # sex (3rd variable) for all patients

[1] 77 44 56 88 49

mydata[c(1,2), ]  # all variables for the first two patients

  id sex weight
1  1   M     77
2  2   F     44

mydata[c(-3,-4), 'sex'] # Negative numbers and names can also be used

[1] M F F
Levels: F M

We have to be careful when we want to select a single variable from a data.frame, as we do above. The result will now no longer be a data.frame but it is transformed to a vector. When we want to prevent this we can use drop=FALSE, as follows:

mydata[ , 3]

[1] 77 44 56 88 49

mydata[ , 3, drop=FALSE]

We can also use a single index between the square brackets. This works as if the data.frame was a list of variables (it’s columns).

mydata[1]

mydata[[1]]

[1] 1 2 3 4 5

A data.frame has row.names (note the dot) as well as variable names we can use for selection. Let’s look at an example where row names are gene symbols and column names are sample IDs:

mygenes <- data.frame(samp1 = c(33, 22, 12), 
                      samp2 = c(44, 111, 13), 
                      samp3 = c(33, 53, 65))
row.names(mygenes) <- c("CRP", "BRCA1", "HOXA")
names(mygenes)

[1] "samp1" "samp2" "samp3"

mygenes

      samp1 samp2 samp3
CRP      33    44    33
BRCA1    22   111    53
HOXA     12    13    65

mygenes["CRP", ]

    samp1 samp2 samp3
CRP    33    44    33

mygenes[, "samp1"]

[1] 33 22 12

mygenes[, c("samp1", "samp3")]

      samp1 samp3
CRP      33    33
BRCA1    22    53
HOXA     12    65

mygenes["HOXA", "samp2"]

[1] 13

Other useful functions

Besides square brackets ([, [[), other useful functions exist for selecting data: duplicated, match, %in%, grep, is.na and $.

To select e.g. rows that are not duplicated:

mm <- matrix(c(1, 1, 2, 2), nrow = 4, byrow = TRUE)


mm

     [,1]
[1,]    1
[2,]    1
[3,]    2
[4,]    2

mm[!duplicated(mm), ]

[1] 1 2

The above can also be done with unique, but the use of duplicated might be more appropriate in more complex situations:

unique(mm)

     [,1]
[1,]    1
[2,]    2

Calling match returns the position of the first match of its first argument in the second argument:

match(c("a", "b"), c("a", "c", "a", "b", "a", "b"))

[1] 1 4

whereas \%in\% tells you whether the elements of the first argument appear in the second argument:

c("a", "b", "d") %in% c("a", "c", "a", "b", "a", "b")

[1]  TRUE  TRUE FALSE

Recall our data frame mygenes:

mygenes

      samp1 samp2 samp3
CRP      33    44    33
BRCA1    22   111    53
HOXA     12    13    65

is.data.frame(mygenes)

[1] TRUE

Note that since mygenes is a data frame, it is therefore also an array, which means we can select by the name of the elements in the array:

mygenes[match(c("samp1", "samp3"), colnames(mygenes))]

      samp1 samp3
CRP      33    33
BRCA1    22    53
HOXA     12    65

mygenes[colnames(mygenes) %in% c("samp1", "samp4")]

      samp1
CRP      33
BRCA1    22
HOXA     12

However, in this case we could just use the names…

mygenes[c("samp1", "samp3")]

      samp1 samp3
CRP      33    33
BRCA1    22    53
HOXA     12    65

But this gives an error:

mygenes[c("samp1", "samp30")] ## not run

where this does not:

mygenes[colnames(mygenes) %in% c("samp1", "samp30")]

      samp1
CRP      33
BRCA1    22
HOXA     12

We can also use functions like grep to search for the names we are interested in:

mygenes[grep(2, names(mygenes))]

      samp2
CRP      44
BRCA1   111
HOXA     13

mygenes[grep("A", row.names(mygenes)), ]

      samp1 samp2 samp3
BRCA1    22   111    53
HOXA     12    13    65

If we want to find or exclude data with missing values, we can use is.na:

z <- c(1:4, NA, 5:10)
z

 [1]  1  2  3  4 NA  5  6  7  8  9 10

is.na(z)

 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

which(is.na(z))

[1] 5

z[!is.na(z)]

 [1]  1  2  3  4  5  6  7  8  9 10