Chapter 2 Functions and Data Types

2.1 Functions

A function in R is a set of steps, operations, and procedures that are done to data in a specific order. R has some functions that are built into the language (many of which we’ll go through in this book), but you are also able to write your own functions as well.

Functions take arguments, which are data points and other items that the function needs to do its job. Think of them like a variable, where you can change the value that they take each time you need to run the function. These can be anything from data points themselves, to colors and sizes for plots, to even other functions if necessary. After performing the steps and calculations that they’re supposed to do, they return, or give out, the information.

Note: in R, arguments of functions may have pre-defined values. In this case, unless you specify differently when you call (use) the function, this pre-defined, or default, value will be used instead.

To write your own functions, you need to make use of the function() function. You give your function a name, then specify its arguments as the arguments in function(), and include your steps inside of a set of curly braces ({}). To tell R what to return as the output of your new function, you have two options: you can either just leave it as the last line inside of the curly braces, or you can explicitly state it inside the parenthesis of return().

We’ll put a simple example of a user-defined function here to illustrate how simple and useful writing a function can be, although you may not completely understand what’s going on right now. And that’s okay, and honestly that’s expected at this point. We haven’t covered what’s going on here (it’s only Chapter 2!), but if we don’t introduce functions conceptually, we can’t refer to and/or write and teach them them as we go through the book. They’re helpful tools that can save you a ton of time as you get better in R.

The function we’re going to build is called doubler. It will take one argument, x, and return whatever the double of x is. See if you can match the parts in this function to the process we just outlined and with the information about doubler() that we just gave you!

doubler = function(x){
  2 * x
}

Then, to call the function, you simply put the name of the function, then without a space, put a set of parenthesis. Inside of these parenthesis, specify the arguments required to make the function run properly. Here’s an example of how to call the doubler() function we just wrote:

doubler(x = 2)

## [1] 4

doubler(4)

## [1] 8

doubler(100)

## [1] 200

As you can see, x can be any number, and doubler() just takes the number (x) and doubles it.

Pro tip: you don’t always need to specify the name of an argument. In the second example of using doubler(), R interprets 4 to be what x is supposed to be. When there’s more than one argument needed for a function, you can either give them in the same order the function looks for them (which you’ll learn about here), or you can specify them by name.

2.2 The Types

Not all data is of the same type, or usage format. What this means is that different kinds of information from a dataset get evaluated differently in R. To check what type a piece of data is, you can use the class() function. Let’s go through a few of the most common types of data:

numeric: Numeric data is data that is only numbers. These can be positive, negative, 0, decimals, or even infinity ($\infty$)

character: Character data is anything that involves a letter or special character. These will be denoted by '' (single quotation marks) or "" (double quotation marks). Characters are also called strings. It’s important to note that 2 is of type numeric, while '2' is of type character. A quick check using the class() function:

class(2)

## [1] "numeric"

class('2')

## [1] "character"

logical: Logical data, also known as boolean data, is just a series of TRUE or FALSE values. While this may not necessarily seem like the most useful form of data right now, it’s important to know that this type of data exists. R evaluates T to be TRUE and F to be FALSE, so it’s equally valid to use T and F in place of TRUE and FALSE, but it’s better practice to use TRUE and FALSE since we may want to use T and F as variables. More on this in a little
factor: Factor data is simply categories. This type of data is really useful for later when we want to split the information on variables such as gender, location, or a variety of other categorical features in the data

NA: This isn’t actually its own type of data, but it represents a missing value. These can become pesky, but there are ways to work around them. We can choose to replace them with 0 or any other value we want, we can ignore them in our computations, or we can do something completely different with them altogether. The important thing to remember about NA values is that they exist and should be acknowledged.

2.3 Vectors, Lists, and Data Frames

Each of the data types listed above describes a single point of data, called a scalar. However, we usually we don’t have data given to us as one-by-one pieces of information. We’re normally given whole datasets at a time, or at least groups of related data, and they’re much easier to work with.

Vectors

A vector is a grouped set of data. Think about it as if it were the answer to a single question from a survey from all students in the class, or the heights of all basketball players in the NBA. We’ve actually been working with vectors all along! We’ll discuss it more in chapter 4.

R actually treats every value as a vector. That’s why, as you may have noticed, lines of output begin with [1]. This indicates the index (position) in the vector that is at the start of the line. Any time we’ve had any type of data, R has just treated it as a vector of length 1.

One important thing to note about vectors is that all members of the vector must be of the same type. If they aren’t, they will be coerced (changed) to be of the same type. To create a vector, we can use the c() function. This function combines the elements (individual data points) and turns them into a vector. Separate the parts with a comma (,). Here’s a few examples:

c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

c(TRUE, FALSE, T, F, T, FALSE)

## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

c('R', 'is', 'fun')

## [1] "R"   "is"  "fun"

c(1, 'apple', 2, 'banana')

## [1] "1"      "apple"  "2"      "banana"

Note how in the last example, everything appears inside of double quotation marks. This is an example of coersion in action. 1 and 2 are recognized as type integer, and 'apple' and 'banana' are recognized as type character. Since it’s easier to change a number to a character than it is to go the other way, 1 and 2 become characters.

It may seem cumbersome, time-consuming, and tedious to type out numbers in order as we did in the first vector. : to the rescue! Another way that we can create that vector is by putting the first number we’d like in our vector on the left side of the :, and the last number on the right.

c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

As you can see, they both produce the same output. Let’s say we wanted to only include even numbers. Luckily, there’s a function that allows us to do that as well, and that’s the seq() function. To use it, we start by putting the first number in our vector, then the last number we want in the vector, and finally the amount we want to increment by in each place. To get the even numbers between 1 and 10, we want the following sequence:

seq(2, 10, 2)

## [1]  2  4  6  8 10

Now, let’s say that we want to actually get at the contents of a part of a vector. We can accesss it by using its index. Unlike some other languages (Python, for example), R starts its indexing at 1, not 0. If we want to access the 3$^{\text{rd}}$ element of the vector we just created, use the index of the part we’re interested in (3) and put it inside of a single set of square brackets ([]) next to the vector. This will return the element in that location.

seq(2, 10, 2)[3]

## [1] 6

To figure out how many elements are contained in a vector, we can use the length() function. This information is also displayed in the Environment tab we configured in chapter 1, assuming that the vector is stored as something in the global environment.

The last function we should mention here is the rep() function. Similar to seq(), this function allows you to create a vector of the same number, repeated any number of times. Want the number 10 to be repeated 30 times? rep() makes this easy, as you can write rep(10, 30). The first argument is the number or vector you’d like to repeat, and the second is the number of times you’d like to repeat it. If the first argument is a vector, and you’d like to repeat each element a certain number of times, include the each = argument, with the number of times that you’d like each element to repeat. Let’s see rep() in action:

rep(2, 10)

##  [1] 2 2 2 2 2 2 2 2 2 2

rep(c(1, 2, 3), 3)

## [1] 1 2 3 1 2 3 1 2 3

rep(c(1, 2, 3), 3, each = 2)

##  [1] 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3

Lists

On the surface, there’s not much difference between a list and a vector. The biggest difference is that a list can contain different types of data, whereas a vector cannot. To create a list, we can simply use the list() function, again putting all the different parts we want included inside the parenthesis, separated by a comma.

list(1:10)

## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10

list(1, 'apple', 2, 'banana')

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "apple"
## 
## [[3]]
## [1] 2
## 
## [[4]]
## [1] "banana"

Lists can be made up of vectors as well. That is, each element of a list is able to be a vector, since a list doesn’t care what type of data each of its elements is. To access a list’s elements, we want to use a double set of square brackets ([[]]) with the index we’d like to access.

list(c('apple', 'banana'), c(1.25, 2.50), 3)[[2]]

## [1] 1.25 2.50

The above example also illustrates that lists don’t need all list elements to be of the same length. Note that the third element of the list is only of length 1 (it’s just the number 3), but the other two elements are of length 2.

Lastly, lists are able to have named elements. To name an element, all you have to do is type NAME OF ELEMENT = before each element, where NAME OF ELEMENT is whatever name you’d like to assign it. In the above example, let’s say we wanted to call the first element fruits, the second element prices, and the third element aisle. Then, our list would look like this:

list(fruits = c('apple', 'banana'), prices = c(1.25, 2.50), aisle = 3)

## $fruits
## [1] "apple"  "banana"
## 
## $prices
## [1] 1.25 2.50
## 
## $aisle
## [1] 3

We can then use a $ to go into the list and “pull out” that element (the vector with the corresponding name). We can then use vector indexing rules to get a particular element from the vector. If we wanted to get 'banana' from our list above, we have two options.

list(fruits = c('apple', 'banana'), prices = c(1.25, 2.50), aisle = 3)$fruits[2]

## [1] "banana"

list(fruits = c('apple', 'banana'), prices = c(1.25, 2.50), aisle = 3)[[1]][2]

## [1] "banana"

As you can see, both options return 'banana', so these options are equivalent.

Data Frames

The last major type of combined data storing we need to talk about is a data frame. You can think of a data frame as a big table with the data you’d like. Each row of data is called an observation, and each column represents a feature or a variable. We’ll use a few of our own data frames throughout the semester, but it’s good to know that R comes with some of its own data frames already. This data comes from the crabs data frame in the MASS package (see chapter 3 for more information on packages). The first 6 rows are shown below.

sp	sex	index	FL	RW	CL	CW	BD
B	M	1	8.1	6.7	16.1	19.0	7.0
B	M	2	8.8	7.7	18.1	20.8	7.4
B	M	3	9.2	7.8	19.0	22.4	7.7
B	M	4	9.6	7.9	20.1	23.1	8.2
B	M	5	9.8	8.0	20.3	23.0	8.2
B	M	6	10.8	9.0	23.0	26.5	9.8

To see a data frame, you’ll want to use the View() command. This will open the data frame in the Source pane.

A few things about data frames:

They’re really just an easy-to-see list. You can access any column (feature) by using the $ operator. The syntax (way to write the code) is: df_name$column_name, where df_name and column_name are the data frame name and column name respectively
All columns (or list elements) must be of the same length. They may contain NA values, but their lengths must be the same
To get the number of rows of a data frame, use the nrow() function. To get the number of columns, you can either use length() (since, as stated before, it’s just a list of vectors), or ncol(). This information is also in the Environment tab.
You can make your own with the data.frame() function. Just put the vectors you’d like to include, separated by commas, inside of the parenthesis. Just like with lists, you can name the columns of a data frame as you create it. As long as the vectors are of the same length, you’ll be making data frames in no time!

2.4 Coercing To Other Types

The last point we’ll make about different kinds of data is that you can coerce it yourself to be of another type. There are a lot of functions, the as._() functions, that are helpful here. Have a character string that’s just a number? No problem! we saw before that class('2') was of type character. What about if we wanted it to be of type numeric?

class(as.numeric('2'))

## [1] "numeric"

Awesome.

2.5 `identical()`

This is as good a time as any to introduce the identical() function. What this does is checks if the things supplied to it are the identical. It returns TRUE if the arguments are identical, and FALSE if they’re different. Examples, with some of the syntax described above, are as follows:

identical(1, 1)

## [1] TRUE

x = 1:10 # This has length 10
y = 2:10 # This has length 9
identical(x, y)

## [1] FALSE

identical(x[10], y[9])

## [1] TRUE

This function is very helpful when you want to check if two vectors or lists have the same information. It’s also particularly useful when you want to check if the outputs or results of different functions are the same if you’re reorganizing/rewriting code.

2.6 The `which()` Family

In addition to the identical() function, another useful family of functions is the which() family: which(), which.min(), and which.max(). When used on a vector, they’ll give you the index of the the specified condition. which() can be used for any condition you’d like, while which.min() and which.max() return the minimum and maximum values of the vector. Here’s a quick example:

z = c(1, 16, 8, 9, 5, 12, 4, 13, 6, 11, 3, 14, 7, 10, 2, 15)
which(z == 4) # Returns the index where z is 4 (should be 7)

## [1] 7

which.min(z) # Should return 1

## [1] 1

which.max(z) # Should return 2 (biggest value is 16 at index 2)

## [1] 2

Since the columns of a data frame are really just vectors, the which() family of functions, when used on a data frame, will return the row corresponding to the condition. Not something we’ll spend a ton of time with right now, but a good thing to keep in mind going forward.