Hi!
My brain cannot function properly when there is so much fluffiness around. And how is yours 🧠?
Fortunately, we can learn how to make R do stuff for us.
So far we have been using the in-built functions in R, e.g. mean() and got a nice, neat output in the blink of an eye. However, if we were to calculate it manually using a calculator, we would have to add all the values of interest together and divide it by their total number. So our calculations of the average laying date would look like this: (4 + (-2) + 4 + 6 + (-1) + …)/152.
If we wanted to replicate it in R, we would have to sum all the elements and divide this number by their total count (grab the training dataset first) :
### Loading libraries ###
library(readr)
### Loading the dataset ###
blue_tits <- read.csv("../data/blue_tits.csv")
### Two ways of calculating the average laying date ###
mean(blue_tits$Laying_day) # the in-built mean() function
VS
sum(blue_tits$Laying_day)/length(blue_tits$Laying_day)
# using sum() and length() functions
Hopefully, you got the same results! Please remember that this is a simplified way of calculating the mean — the “Laying_day” column doesn’t have missing values that would otherwise mess up the calculations. To account for that, we would have to add one more step. Let’s test it on the “Hatching_day” column:
column <- blue_tits$Hatching_day # so we don't have to type so much
sum <- sum(column, na.rm = TRUE)
count <- length(column[!is.na(column)])
mean <- sum/count
VS
mean(column, na.rm = TRUE)
If you don’t remember how to get rid of the missing values or what indices are, you can read about these topics in the previous tutorials.
Do you see where it is going? What about a way of making it more robust? Here is when the R functions come to help.
The general scaffold for the R function is:
my_function <- function() { }
In our case, it will look like:
column <- blue_tits$Hatching_day
my_mean_function <- function(column) {
sum <- sum(column, na.rm = TRUE)
count <- length(column[!is.na(column)])
mean <- sum/count
# We have to explicitly return the mean value
return(mean)
}
Of course, this function works not only on columns. That’s just the way we called it to make it clearer in our specific case. Equally good we could use it on a vector of random values.
random_vector <- c(1,23,4,67,9,76,4,32)
my_mean_function(random_vector)
Did you also get 27 as the output? Great!
Okay, this may seem to be just a nice little exercise. After all, we can just use the mean() function, right 🤔?
But what if R doesn’t have an in-built function we need? For example, if we wanted to check what the mode of egg-laying date is. First, let’s think about what steps we would have to follow to calculate such value(s):
Mode is the most frequent number (or numbers) in your list of values. So we have to calculate the frequency with which each egg-laying date appears in our dataset using the table() function:
values <- blue_tits$Laying_day
frequency_table <- table(values)
Then we have to find the number that occurs most often using the which.max():
which.max(frequency_table)
What we want from our function is to return the mode — we are not exactly interested in its index. Since which.max() returns the index of the most frequent value and the table() function returns both the numbers and their frequencies we can combine them with the function names(), which returns only the unique numbers from our “frequency_table”, and [ ].
Okay, that was a lot, so let’s break it down:
This is our frequency_table:
frequency_table[which.max(freq_table)]
This will give us both the number that occurs the most often and its frequency. In our case, it would be 6 & 13.
names(frequency_table)
This will give us all of the unique values from the column “Hatching_day”.
After combining these two bits we get:
names(frequency_table)[which.max(frequency_table)]
And here is our final function:
my_mode <- function(x) {
frequency_table <- table(x)
mode <- names(freq_table)[which.max(freq_table)]
return(mode)
}
Check it by running:
values <- blue_tits$Laying_day
my_mode(values)
Did it work 🙃?
Have a great evening,
Aga
PS: Here is the survey in which you can tell me what R topic you find particularly confusing and why you want to learn it so that we can shape this space together!