They don't exist but they still make you suffer

Meet missing values!

Oct 02, 2024

Hi, beautiful creature!

The moment you start collecting biological (or other kinds of) data yourself, chances are, your spreadsheets will have some holes. Sometimes it will be caused by far-from-ideal data collection protocol, sometimes it will arise due to the nature of your research 🤷.

For example, in my work, I put down the dates when birds build nests, lay eggs and take care of their chicks 🐣. My missing values come from three scenarios:

A) Not every nest box is occupied.

B) Not every bird finishes building its nest after moving into a nest box.

C) Not every parent raises chicks that fledge, because some couples abandon their nests or they are predated upon (sad, I know).

Missing values are not inherently bad, but they can cause a lot of pain if not accounted for. They mess up with our calculations, they make certain commands not run in R and sometimes they make it impossible to visualise the data 🥲.

Okay, so how do you deal with them? There are many ways of doing that but let me show you 3 simple ones.

Grab the dataset here, and let’s start:

If you have just empty cells

EDIT: Apologies, I made some mistakes here, i.e. in the original version of the tutorial. I usually run every single command line in RStudio myself, but I wrote this bit on a train and clearly messed up a few concepts. This shows missing values can indeed make one suffer… Sadly, there is no such thing as na.remove() function — instead, you should use ⭐na.omit()⭐ from the stats package. There is, however, an argument called na.rm, which, when set to TRUE, makes R ignore NAs when performing calculations.

dataset_without_NAs <- na.omit(name_of_your_dataset) 

#For example

blue_tits_cleaned <- na.omit(blue_tits_cleaned)

This removes all the rows with NAs.

mean(blue_tits$Hatching_day, na.rm = TRUE) # remeber to use the correct file names of the original dataset

This is an example of calculating an average hatching day using the na.rm argument, which doesn’t affect the dimensions of the original data frame (so no rows are disappearing).

Sadly, again, the na.omit() cannot be used to remove rows that contain NAs only in a specific column (and even if it could, I still got the code wrong here 🤦🤦🤦…). The correct way to do it is:

data_frame <- data_frame[!(is.na(data_frame$column)), ]

#For example

bt_cleaned <- bt_cleaned[!(is.na(bt_cleaned$Hatching_day)), ]

This removes the rows that have NAs in one of the columns (the “Hatching_day” column in this case). Here, we are using the is.na() function to trace all the missing values…

… and then we apply “!” to exclude them from our data frame by keeping every row in a column of interest which DOES NOT contain an NA:

I’ve also realised that my dataset contains NAs only in the “Hatching_day” column so to fully appreciate this way of removing missing values, we could introduce some NAs to the “Laying_day” column:

vector <- c("A333", "NA", 44)

bt_cleaned <- rbind(bt_cleaned, vector)

Now, after removing the NAs from “Hatching_day”, you should still see one missing value in the “Laying_day” column. Alternatively, we can remove one missing value from the “Laying_day” and keep all the other ones in “Hatching_day” 🙃.

You can also achieve the same goal using the subset() command (see step 3).

If you already populated your empty cells with NAs

Ha, this one is easy, same as above 🙃!

If you decided to be creative and populate empty cells with “na” or something like this
```
your_data_without_some_NAs <- subset(name_of_your_dataset, name_of_the_column != "na")
```

Remember, R is case-sensitive, so “na” is not equal to “Na” or “nA”.

If you decide to populate all the empty cells with “puppies” 🐶 to soothe the pain of data analysis, the command will look like this :

your_data_without_some_NAs <- subset(name_of_your_dataset, name_of_the_column != "puppy")

If you want to get rid of the NAs from the entire dataset, it is worth converting whatever you currently have to denote empty cells to NAs (recognised by R on a higher level) by typing:

name_of_your_dataset[name_of_your_dataset == "na"] <- NA

and then using the function from step 1.

If you don’t know the [ ] brackets yet, fear not I will cover this topic soon!

Share with a friend who works with messy datasets with missing values

Just remember:

Always keep the original dataset that contains empty cells and work on subsequent versions. This will save you in case you get carried away and remove too many rows.
Missing values are NOT the same as 0s! For example, in my dataset, 1 denotes the 1st day of April, so 0 denotes the 31st of March.
0 is a perfectly valid value that affects the outcome of calculations, so don't try to replace your missing values with 0s!

Now you are ready to deal with your patchy datasets!

To take a break from coding, I recommend picking up “The greatest show on Earth” by Richard Dawkins.

Or taking a cup of hot chocolate in a LEGO café:

Don’t miss me too much,

Aga

PS: Here is the survey in which you can tell me what R topic you find particularly confusing and why you want to learn it so that we can shape this space together!

DaRwinian Lass

Discussion about this post