Hi!
I’m looking for a fellow biologist to help me with some calculations.
Do you know what 2.8×10^(-7) represents? Is a good estimate of the mutation rate in human somatic cells per generation?
Apparently, my body consists of roughly 28 trillion cells, which means that over my lifetime I can expect around 7.84 million mutations to pop up here and there in my genetic code.
This number seems quite big! Fortunately, most mutations are neutral. Or at least “nearly neutral”.
Okay, I know mutations are a fascinating but sometimes a bit daunting topic. That’s why I hasten to tell you that the 🌟mutate()🌟 function in R is much nicer and you may find it quite useful. It helps with, well, mutating an existing data frame so that it can accommodate all the necessary information.
For my PhD, I record the number of chicks that survive to 7 and 14 days post-hatching — two checkpoints of their development. Those that make it to two weeks are lucky enough to be weighed, measured in every possible way and ringed.
So, how would this look using mutate()?
Knowing the hatch date, which we record each year when visiting nests, I can create two additional columns with values that represent these two checkpoints. To get started, grab the data and load it:
### Loading the data ###
library(readr)
library(dplyr)
blue_tits <- read_csv("../data/blue_tits.csv") # Remember to use the correct file pathway
Then you can create two new columns by typing:
bt_new_df <- mutate(blue_tits, First_Check = Hatching_day + 7,
Final_check = Hatching_day + 14 )
If we want to be fancy we can also do nested functions (uuu, some jargon here. You can use it to surprise your supervisor if you have one. Not sure whether your friends will get excited, but joking aside, it’s always good to know!).
We can add one more column to our new data frame (we will overwrite it). If we are interested in whether a given bird hatched earlier or later than the “average” bird, we can type:
bt_new_df <- mutate(bt_new_df, Early_or_Late = Hatching_day - mean(bt_new_df$Hatching_day, na.rm = TRUE))
It’s important to add the “na.rm = TRUE” bit as well, otherwise our column will be empty! It is because the missing values mess up with the R calculations (I would say it’s a bit like when we were taught not to divide anything by zero — in this case, remove NAs before performing calculations).
Negative values will denote early birds, while positive values will represent late birds. This is useful because it allows us to compare the two groups and, for example, test whether birds that hatch earlier in the season are heavier than those that hatch later (chick weight is often used as a health proxy 💪).
We can also refine our line of code to make it more realistic — since decimal places aren’t necessary for days, let’s round it up.
mean_hatching_day <- mean(bt_new_df$Hatching_day, na.rm = TRUE)
bt_new_df <- mutate(bt_new_df, Early_or_Late = round(Hatching_day - mean_hatching_day))
If we want a nice, neat list of the birds in each group, we can save them as:
early_birds <- filter(bt_new_df, Early_or_Late < 0 )
late_birds <- filter(bt_new_df, Early_or_Late > 0 )
And, at the end of the day, if you decide you don’t need the “Early_or_Late”) column after all, you can use mutate() to remove it by setting it to NULL:
final_bt_df <- mutate(bt_new_df, Early_or_Late = NULL)
head(final_bt_df) #check whether we've succeeded
Carpe diem,
Aga
PS: Here is the survey in which you can tell me what R topic you find particularly confusing and why you want to learn it so that we can shape this space together!