# Chapter 7 Element 2: Objects

## 7.1 Take-home Messages

In R, everything that exists is an object:

- There are four common user-defined atomic vector
*types*you’ll encounter in R. - Vectors are 1-dimensional collections of 0 or more elements of a common data type.
- Data frames are 2-dimensional collections of 0 or more vectors.
- The
*class*of an object dictates how it will be handled by different functions.

Objects are anything which^{24} can be assigned a name. They can refer to a wide variety of different data types, including constants, various data structures, functions or graphs.^{25} Work in R is centered around objects and their classes, and as such is referred to as an *object-oriented language*.^{26}

R can store data in a variety of formats. In this workshop we will focus on the two most commonly used structures: vectors and data frames. However, as your R programming skills develop, you will likely encounter lists, matrices and arrays. These are described later in the chapter but will not be dealt with in class. The differences between the main data structures in R can be summarised as:^{27}

Homogeneous | Heterogeneous | |
---|---|---|

Dimensions | Data Types | Data Types |

1 | Vector | List |

2 | Matrix | Data frame |

n | Array |

In addition to these 5 data structures, we’ll explore some fundamental properties of objects, such as *type*, *attributes*, and *class*, which will help you understand how R deals with objects later on.

## 7.2 Homogeneous Data-types

### 7.2.1 Vectors

Vectors are one-dimensional groups of individual data values. We have already encountered vectors on page **??** when we defined `foo1`

and `foo2`

. Remember that these objects only contain the numerical series that we generated with the `seq()`

function:

We will focus on for common types of vectors in this workshop: Logical, double (numeric), integer and character.

`# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99`

`# [1] 1 7 13 19 25 31`

An important property of vectors, is that they can only contain one data type. There are *many* different data types in R. There are 6 user-defined *atomic vector types*.^{28} The four most common are:

Typ | Description |
---|---|

`Logical` |
TRUE/T/1 or FALSE/F/0 |

`Integer` |
Whole numbers |

`Double` |
Numeric (i.e. real numbers, including fractions) |

`Character` |
Words |

To find out the *type* of a vector, we can use the `typeof()`

function.

`# [1] "double"`

You may wonder why the type of `foo2`

is `double`

(aka numeric) and not `integer`

. This is due to how R predicted the type of data. If we want to force these values to be logical we’d have to use the special integer signifier `L`

:

`# [1] "integer"`

For the most part this will not make a difference in your work, but you may encounter problems when doing math on type `double`

due to floating point errors, discussed on page 9.7. For our purposes, we are not bothered if an `integer`

is stored as a `double`

.

We can use `c()`

to create vectors of other data types:

`# [1] "character"`

`# [1] "logical"`

#### 7.2.1.1 Atomic vector type hierarchy and coercion

When we assign a vector to an object, R automatically determines the atomic vector type.^{30} The atomic vector types have a hierarchical organisation according to increasing complexity: logical < integer < double < character. This kind of makes sense since logical can only be 0 or 1, and character can be everything. This means that R defaults to the highest level in the hierarchy.

```
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
# [11] "bob"
```

`# [1] "character"`

Since `test`

is a character vector, I can’t do math on it!

`# [1] NA`

A family of *coercion* functions are used to coerce one type to another.

`# [1] 1 2 3 4 5 6 7 8 9 10 NA`

Everything that looked like a number is converted to a number, but the contaminating character is replaced by an NA. Now I can do math, but I need to accommodate for the missing values.^{31}

`# [1] 5.5`

We can determine the atomic vector type of an object by using `typeof()`

, but sometimes we want to ask specifically if an object is of a specific type, in particular when we shift from interactive programming to scripting and writing our own functions.^{32} For this purpose there is a long list of `is.`

functions that all return a logical vector.

`# [1] TRUE`

`# [1] FALSE`

`# [1] TRUE`

`# [1] TRUE`

### 7.2.2 Matrices

A matrix is a 2-dimensional vector, which means every element needs to be the same type. You probably won’t encounter matrices very often, but they are convenient if you have a 2-dimensional grid of numbers on which you want to perform math on. To understand how matrices work, it’s useful to get familiar with attributes.

#### 7.2.2.1 Setting object attributes

Almost all objects in R^{33} can have attributes, which can be accessed in a variety of ways. Table 7.2 lists the most typical attributes.

Description | Attribute name | Accessor functions |
---|---|---|

Class | `class` |
`class()` |

Column names | `names` |
`names()` |

Dimensions | `dim` |
`dim()` |

Row names | `row.names` |
`row.names()` |

Dimension names | `dimnames` |
`dimnames()` |

Commentary | `comment` |
`comment()` |

Matrices (and arrays, see below.) are just vectors with a `dim`

attribute. Attributes can be obtained using `attributes()`

and set in two ways. First, using `attr(x, name)`

, where `x`

in an object and `name`

is, typically, one of the attribute names in table 7.2.

`# NULL`

`# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99`

```
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 22 43 64 85
# [2,] 8 29 50 71 92
# [3,] 15 36 57 78 99
```

```
# $dim
# [1] 3 5
```

`# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99`

However, `name`

can be anything you want, so you are free to make your own attributes.

```
# $owner
# [1] "Rick"
```

`# [1] "Rick"`

Second, and more typical, is to default to the special accessor functions when they are available.

```
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 22 43 64 85
# [2,] 8 29 50 71 92
# [3,] 15 36 57 78 99
# attr(,"owner")
# [1] "Rick"
```

In the case of matrices, we have yet another method, `matrix()`

, which gives us more control, e.g. filling row-wise, instead of the default column-wise.

`# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99`

```
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 8 15 22 29
# [2,] 36 43 50 57 64
# [3,] 71 78 85 92 99
```

There are a some very convenient^{34} functions for working with columns and rows:{#matrices}

`# [1] 36 43 50 57 64`

`# [1] 108 129 150 171 192`

`# [1] 15 50 85`

`# [1] 15 50 85`

#### 7.2.2.2 Matrix algebra

If you use the `*`

operator to multiply two vectors – which we can now understand as 1 x n matrices – R will implement its particular type of vector recycling. If you want to do true matrix algebra, you’ll have to use the `\%*\%`

operator for inner multiplication or the `\%o\%`

operator for outer multiplication.

```
# [,1]
# [1,] 91
```

```
# [,1] [,2]
# [1,] 10 20
# [2,] 20 40
# [3,] 30 60
# [4,] 40 80
# [5,] 50 100
# [6,] 60 120
```

### 7.2.3 Arrays

Arrays are *n*-dimensional vectors, which is to say a stack of *n* matrices.^{35} Just like matrices, they have to be all of the same type. Arrays are useful if you are working on longitudinal data but are not very convenient to work with. You are more likely to arrange your data into a data frame. A useful example is when working with images, the R (red), G (green), and B (blue) channels may be stored in a separate matrix, which together are stored as a multi-dimensional array (MDA). We can create arrays like matrices, changing the dimensions of a vector, or using the `array()`

function for more control. Note that when we index arrays, there are three dimensions.

```
aa <- c(11:14, 21:24, 31:34)
dim(aa) = c(2, 2, 3)
ar <- array(c(11:14, 21:24, 31:34),
dim = c(2, 2, 3))
ar
```

```
# , , 1
#
# [,1] [,2]
# [1,] 11 13
# [2,] 12 14
#
# , , 2
#
# [,1] [,2]
# [1,] 21 23
# [2,] 22 24
#
# , , 3
#
# [,1] [,2]
# [1,] 31 33
# [2,] 32 34
```

```
# [,1] [,2]
# [1,] 11 13
# [2,] 12 14
```

`# [1] 11 13`

`# [1] 11 12`

```
# [,1] [,2] [,3]
# [1,] 11 21 31
# [2,] 13 23 33
```

```
# [,1] [,2] [,3]
# [1,] 11 21 31
# [2,] 12 22 32
```

Some extra material from the BioC course:

## 7.3 Heterogeneous Data-types

### 7.3.1 Lists

After 1-dimensional vectors, a 1-dimensional list is the most basic type of data structure in R. It’s basically a heterogeneous vector – where each element can be a different type. You won’t typically store data in a list, mostly because they are a bit cumbersome to work with, but at some point you will certainly encounter lists. Since they are a convenient way of storing heterogeneous data, many functions provide their results as a list.^{36} Actually, we’ve seen exactly that scenario already.

`# [1] "list"`

`Plant.lm`

is a list, and it has two attributes:

```
# $names
# [1] "coefficients" "residuals" "effects" "rank"
# [5] "fitted.values" "assign" "qr" "df.residual"
# [9] "contrasts" "xlevels" "call" "terms"
# [13] "model"
#
# $class
# [1] "lm"
```

Which should be accessed with the appropriate accessor functions

```
# [1] "coefficients" "residuals" "effects" "rank"
# [5] "fitted.values" "assign" "qr" "df.residual"
# [9] "contrasts" "xlevels" "call" "terms"
# [13] "model"
```

`# [1] "lm"`

Anything that is names can be accessed via the $ notation.

```
# (Intercept) grouptrt1 grouptrt2
# 5.03 -0.37 0.49
```

We can see a similar phenomenon with the ANOVA table, stored in `Plant.anova`

.

`# [1] "list"`

```
# $names
# [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"
#
# $row.names
# [1] "group" "Residuals"
#
# $class
# [1] "anova" "data.frame"
#
# $heading
# [1] "Analysis of Variance Table\n" "Response: weight"
```

`# [1] "anova" "data.frame"`

Here the two objects are both `list`

types, but their classes are different.^{37} A class is simply an attribute that tells R what to do with this object. For example we can see that when we call print

```
#
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
#
# Coefficients:
# (Intercept) grouptrt1 grouptrt2
# 5.032 -0.371 0.494
```

```
#
# Call:
# lm(formula = weight ~ group, data = PlantGrowth)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.071 -0.418 -0.006 0.263 1.369
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 5.032 0.197 25.53 <2e-16 ***
# grouptrt1 -0.371 0.279 -1.33 0.194
# grouptrt2 0.494 0.279 1.77 0.088 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.62 on 27 degrees of freedom
# Multiple R-squared: 0.264, Adjusted R-squared: 0.21
# F-statistic: 4.85 on 2 and 27 DF, p-value: 0.0159
```

```
# Analysis of Variance Table
#
# Response: weight
# Df Sum Sq Mean Sq F value Pr(>F)
# group 2 3.77 1.883 4.85 0.016 *
# Residuals 27 10.49 0.389
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
# Df Sum Sq Mean Sq F value
# Min. : 2.0 Min. : 3.8 Min. :0.39 Min. :4.8
# 1st Qu.: 8.2 1st Qu.: 5.4 1st Qu.:0.76 1st Qu.:4.8
# Median :14.5 Median : 7.1 Median :1.14 Median :4.8
# Mean :14.5 Mean : 7.1 Mean :1.14 Mean :4.8
# 3rd Qu.:20.8 3rd Qu.: 8.8 3rd Qu.:1.51 3rd Qu.:4.8
# Max. :27.0 Max. :10.5 Max. :1.88 Max. :4.8
# NA's :1
# Pr(>F)
# Min. :0.02
# 1st Qu.:0.02
# Median :0.02
# Mean :0.02
# 3rd Qu.:0.02
# Max. :0.02
# NA's :1
```

^{38}This is a major feature of object-oriented programming (OOP) – functions behave differently given the class of an object.

We can make a list from scratch using `list()`

:

```
# $a
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
#
# $b
# [1] 1 7 13 19 25 31
```

As we saw in the section on functions, `lapply()`

and `sapply()`

reiterate over each element in a list:

```
# $a
# [1] 50
#
# $b
# [1] 16
```

```
# a b
# 50 16
```

### 7.3.2 Data Frames

A data frame is a special type of list, where every every element is a vector of the same length. This means that data frames are two-dimensional tables, like you find in Excel.

**Variables** are vectors stored in vertical columns. Each variable can be it’s own data type (see table 7.1).

**Observations** are stored in horizontal rows.

This simple description leaves much room for interpretation. *How* you organise information in a data frame is important since it can influence how easily you can carry out downstream calculations. A very popular way to organise data frames is called *tidy data*, whereby data frames have

- One observation per row,
- One variable per column, and are composed of
- One observational unit.

We will discus the best methods for formatting data in the section on *tidy data* on page 14.

Regardless of how you organise information in a data frame, there are two note-worthy aspects of data frames:

- Each column must be the same length.
- All data within a column must be of the same type.

This makes a lot of sense, since a data frame is just a list of vectors. You can see this in action by using `data.frame()`

to make a data frame from the vectors `foo2`

, `foo3`

and `foo4`

.

```
# foo4 foo3 foo2
# 1 TRUE Liver 1
# 2 FALSE Brain 7
# 3 FALSE Testes 13
# 4 TRUE Muscle 19
# 5 TRUE Intestine 25
# 6 FALSE Heart 31
```

`# [1] "list"`

The following table lists some of the most commonly used functions for working with data frames:

Functio | Description |
---|---|

`data.frame()` |
Create a data frame |

`str()` |
Structure of a data frame |

`summary()` |
Calculate summary statistics |

`dim()` |
Dimensions (number of rows and columns) |

`names()` |
Column Names |

`row.names()` |
Row names |

`nrow()` |
Number of rows |

`ncol()` |
Number of columns |

`cbind()` |
Add columns |

`rbind()` |
Add rows |

`subset()` |
Extract a subset of records |

`head()` |
Show the head of the data frame |

`tail()` |
Show the tail of the data frame |

Let’s consider the `foo_df`

data frame further. There are 3 attributes associated with this data frame (see table 7.2).

```
# $names
# [1] "foo4" "foo3" "foo2"
#
# $class
# [1] "data.frame"
#
# $row.names
# [1] 1 2 3 4 5 6
```

Which we can access by:

`# [1] "foo4" "foo3" "foo2"`

`# [1] "1" "2" "3" "4" "5" "6"`

`# [1] "data.frame"`

There are 6 rows (observations) and 3 columns (variables).

Take special note of the following unique featues of working with data frames:

1.Variable names of a data frame (i.e. column headers) are themselves a character vector.
2. To access specific columns by name, use `$`

notation to specify the name of the column, e.g. `\$healthy`

.
3. To select only those rows of interest, use the `subset()`

function and specify the data frame and the logical test of interest corresponding to specific criteria.

Since the column names are just a character vector stored as an attribute, we can modify them independently of the data in the data frame.

`# [1] "foo4" "foo3" "foo2"`

`# [1] "healthy" "tissue" "quantity"`

`# [1] TRUE FALSE FALSE TRUE TRUE FALSE`

`# [1] 6 3`

We called that it’s always useful to run these two functions:

```
# Rows: 6
# Columns: 3
# $ healthy <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE
# $ tissue <chr> "Liver", "Brain", "Testes", "Muscle", "Intestine…
# $ quantity <dbl> 1, 7, 13, 19, 25, 31
```

```
# healthy tissue quantity
# Mode :logical Length:6 Min. : 1.0
# FALSE:3 Class :character 1st Qu.: 8.5
# TRUE :3 Mode :character Median :16.0
# Mean :16.0
# 3rd Qu.:23.5
# Max. :31.0
```

#### 7.3.2.1 stringsAsFactors

R version 4.0, released (April, 2020), changed the default value of the `data.frame()`

argument `stringsAsFactors`

to `FALSE`

. Thus, if you have run this code on an older version of R what you actually see is this:

```
pre4_foo_df <- data.frame(foo4,foo3,foo2, stringsAsFactors = TRUE)
names(pre4_foo_df) <- myNames
glimpse(pre4_foo_df)
```

```
# Rows: 6
# Columns: 3
# $ healthy <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE
# $ tissue <fct> Liver, Brain, Testes, Muscle, Intestine, Heart
# $ quantity <dbl> 1, 7, 13, 19, 25, 31
```

We’ll discuss factors in 11. For what we’ll be doing, it does not make a difference.

## 7.4 Merging Data Frames

Merging two data frames according to a common variable is a typical command in data analysis. There are many varieties of merge and redundancy between functions. This guide presents functions in four packages:

`base`

package functions will likely meet all your regular needs`dplyr`

– If you’re using a`dplyr`

workflow, see 14, default to these functions^{39}.

Beginning with two data frames:

```
df1 <- data.frame(Protein = c("CATB", "PSB1", "STUB1", "HUWE1"),
Value = c(2.0, 0.1, 0.5, 0.3))
df2 <- data.frame(Protein = c("UBE4b", "SEP15", "CATB", "PSB1"),
GO = c(rep("Protein Folding", 2), rep("Proteolysis", 2)))
```

Use merge functions (and optional parameters) from either `base`

package or `dplyr`

to complete the following exercises.

### 7.4.1 Inner join

**Exercise 7.1 (Inner join)**Return only the rows in which the left table has matching keys in the right table.

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 PSB1 0.1 Proteolysis
```

```
library(dplyr)
#inner
inner_join(df1, df2) # return all rows from x where there are matching values in y, and all columns from x and y
```

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 PSB1 0.1 Proteolysis
```

R automatically joins the frames by common variable names. To specify only the variables of interest, use `merge(df1, df2, by = "Protein")`

. Additionally, use the `by.x`

and `by.y`

parameters if the matching variables have different names in the two data frames.

### 7.4.2 Outer join

**Exercise 7.2 (Outer join)**Return all rows from both tables, join records from the left which have matching keys in the right table.

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 HUWE1 0.3 <NA>
# 3 PSB1 0.1 Proteolysis
# 4 SEP15 NA Protein Folding
# 5 STUB1 0.5 <NA>
# 6 UBE4b NA Protein Folding
```

```
# join(df1, df2, type = "full") # return all rows from x, and all columns from x and y
full_join(df1, df2)
```

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 PSB1 0.1 Proteolysis
# 3 STUB1 0.5 <NA>
# 4 HUWE1 0.3 <NA>
# 5 UBE4b NA Protein Folding
# 6 SEP15 NA Protein Folding
```

### 7.4.3 Left join

**Exercise 7.3 (Left join)**Return all rows from the left table, and any rows with matching keys from the right table.

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 HUWE1 0.3 <NA>
# 3 PSB1 0.1 Proteolysis
# 4 STUB1 0.5 <NA>
```

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 PSB1 0.1 Proteolysis
# 3 STUB1 0.5 <NA>
# 4 HUWE1 0.3 <NA>
```

### 7.4.4 Right join

**Exercise 7.4 (Right join)**Return all rows from the right table, and any rows with matching keys from the left table.

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 PSB1 0.1 Proteolysis
# 3 SEP15 NA Protein Folding
# 4 UBE4b NA Protein Folding
```

```
# #right outer (just reverse argument order)
# left_join(df2, df1)
#right join
right_join(df1, df2) # return all rows from x, and all columns from x and y
```

```
# Protein Value GO
# 1 CATB 2.0 Proteolysis
# 2 PSB1 0.1 Proteolysis
# 3 UBE4b NA Protein Folding
# 4 SEP15 NA Protein Folding
```

### 7.4.5 Anti-join & Semi-join

**Exercise 7.5 (Semi-join)**Return all the rows from df1, where there matching values in df2, keeping just columns from df1.

```
# Protein Value
# 1 CATB 2.0
# 2 PSB1 0.1
```

**Exercise 7.6 (Anti-join)**Return all the rows from df1 where there matching values in df2, keeping just columns from df1.

```
# Protein Value
# 1 STUB1 0.5
# 2 HUWE1 0.3
```

### 7.4.6 Cross-join

A cross join results in all combinations of all variables.

```
# Protein.x Value Protein.y GO
# 1 CATB 2.0 UBE4b Protein Folding
# 2 PSB1 0.1 UBE4b Protein Folding
# 3 STUB1 0.5 UBE4b Protein Folding
# 4 HUWE1 0.3 UBE4b Protein Folding
# 5 CATB 2.0 SEP15 Protein Folding
# 6 PSB1 0.1 SEP15 Protein Folding
# 7 STUB1 0.5 SEP15 Protein Folding
# 8 HUWE1 0.3 SEP15 Protein Folding
# 9 CATB 2.0 CATB Proteolysis
# 10 PSB1 0.1 CATB Proteolysis
# 11 STUB1 0.5 CATB Proteolysis
# 12 HUWE1 0.3 CATB Proteolysis
# 13 CATB 2.0 PSB1 Proteolysis
# 14 PSB1 0.1 PSB1 Proteolysis
# 15 STUB1 0.5 PSB1 Proteolysis
# 16 HUWE1 0.3 PSB1 Proteolysis
```

In R,

*objects*are synonymous with*variables*of other programming languages.↩︎*Objects*are like files on your computer: They contain a wide variety of different data structures, can be modified, and their name may not relate to their contents. However, the extension of a file, like the class of an object, tells us about how we should handle this file. e.g. Which program will open it?↩︎Object names can contain letters, numbers,

`.`

or`_`

, as long as it starts with a letter, or`.`

followed by a letter. Note that object names are case sensitive.↩︎Although lists, matrices and arrays are presented here, it is worthwhile to return to this section after we have completed the workshop and explored working with data frame in depth, including indexing and

*Split-Apply-Combine*. At that point you should be able to better understand how these data structures are handled in R.↩︎The remaining two,

`complex`

and`raw`

are unlikely to be encountered by beginners and we will not discuss them.↩︎Notice that R automatically assigns a type to each vector. A vector’s type determines what can be done with the data. Wrongly classified vectors are one of the most common causes of problems when programming in R!↩︎

Recall that we can influence this, as we saw above, using

`L`

to force an`integer`

type.↩︎Note that we can’t coerce a character to a logical vector, but if we did only have 0s and 1s, we could.↩︎

You’ll notice in the auto-complete that there are

*many*`is.`

functions, which can test for more than use atomic vector types. In the examples here, the output length is always 1, since we are just testing for the type of an object, but that is not always the case. e.g. see`is.na()`

on page**??**.↩︎The exception is the

`NULL`

type object.↩︎We typically use them on matrices since every value will be numeric or integer, but they also work on data frames.↩︎

Sometimes you’ll see arrays referred to as

*Multi-dimensional Arrays*.↩︎In some cases, multiple arguments are given in list form.↩︎

Notice also that an object can have more than one class associated with it.↩︎

You’ll see that the output for

`summary(Plant.anova)`

looks very similar to what we’ll get for a data frame, which makes sense, since one of the classes is`data frame`

.↩︎`plyr`

is outdated, but you may still find it used from time to time. It has some handy functions for working with data structures other than data frames. However, avoid mixing`plyr`

and`dplyr`

in a single R environment, loading both packages can cause conflicts. It’s better to use`::`

notation to call specific functions.`data.table`

is a more advanced data handling package, which is out of the scope of this workshop. Let these functions serve as a preview.↩︎