Hi,
Sometimes, being a zoologist means waking up at ungodly hours when it feels like only you and some deer have things to do.
But other times, it means spending an entire day in the office, just like everyone else. Recently, I’ve been working with large datasets that include lots of grouping factors (e.g., data collected over many years, months, and days). My goal is to see whether trends differ across years. It means that my current favourite functions are ⭐group_by()⭐ and ⭐summarise()⭐ from the dplyr package.
First, though, I have to say one more thing about good coding practices. If you use multiple packages in R, you might get tired at the mere thought of checking the versions of each package. Or perhaps you worry that, when trying to replicate someone else’s work, you’ll spend hours searching for the exact package versions they used. Making a note of — or even just paying attention to — the package version is a good habit, but the best approach is to get familiar with the concept of a conda environment. I won’t dive into this topic now, but thanks, Stanislav, for bringing it up.
As always, you can get our mini dataset here and explore some basic but very useful things you can do with group_by() and summarise(). My brain sometimes likes to confuse the latter with summary() — another very useful function that provides a good overview of the descriptive statistics of your data.
### Libraries ###
library(readr)
library(dplyr)
### Loading the data ###
blue_tits <- read_csv("../data/blue_tits.csv") # Remember to use the correct file pathway
### Descriptive statistics ###
data_summary <- summary(blue_tits)
Often, we need to explore more complex trends. For example, if we wanted to calculate how many nests had females laying eggs on the same day, we could type:
output <-
summarise(blue_tits,
count = n(),
.by = Laying_day
)
Alternatively, we could use:
output_2 <- summarise(group_by(blue_tits, Laying_day), count = n())
I like to break tasks into smaller steps, so the above command could also look like this:
grouped_bt <- group_by(blue_tits, Laying_day)
output_3 <- summarise(grouped_bt, count = n())
In all of these examples, the function n() calculates the instances of birds laying eggs on the same day. The first and second outputs will be almost identical:
However, in the second case, the laying dates are automatically arranged in ascending order. Additionally, if you use the str() function, you’ll notice that after grouping with group_by(), the output is a tibble rather than a traditional data frame (though a tibble is essentially a type of data frame).
When using group_by(), it’s crucial to think about our intentions. Do you want to group the data only for a specific operation? Or do you want the data to remain grouped for subsequent steps? If it’s the former, remember to ungroup your data afterwards using the ungroup() function. When you view grouped_bt in RStudio, you might not notice any difference at first, but try running these commands:
group_vars(grouped_bt) # And
group_vars(blue_tits)
Did you notice the difference?
Understanding when and why to (un)group your data is important, as it can lead to unexpected behaviours when trying to execute a command. While the following example doesn’t make much biological sense, it illustrates my point well:
behaviour <- summarise(blue_tits, total = sum(Laying_day))
behaviour_2 <- summarise(grouped_bt, total = sum(Laying_day))
If you would like to play with more examples, consider the mutate() function we met in the past, which also behaves differently depending on whether the data is grouped or not:
example <- mutate(grouped_bt, Latest_LD = max(Laying_day))
example_2 <- mutate(blue_tits, Latest_LD = max(Laying_day))
Wish you a great evening!
Aga
PS: Here is the survey in which you can tell me what R topic you find particularly confusing and why you want to learn it so that we can shape this space together!