Spic & Span: Cleaning & Screening Your Data in R

Learn how to Clean and Screen Data in R

This post was written by Kelly Spellman as part of her coursework for RS 932 at the MGH Institute of Health Professions.

Tutorial Outline:

  • Converting variables to numeric, factors, etc
  • Checking data for range, NAs, extreme values, data entry error
  • Re-coding
  • Excluding NAs
  • Re-labeling Factors
  • Collapsing Factors
  • Re-leveling Factors

First, we’ll need to load some packages so that we can take a look at and manipulate our data. Let’s call the following packages: library(psych) library(tidyverse) library(forcats)

Note: If you don’t have the above packages installed yet, you can get them by running the following code: install.packages("packageName") You’ll need to insert the name of the package where I’ve written “packageName” See? I told you I would walk you through this step by step.

Next, we’ll need to load some data and take a look at it.

For the purposes of this tutorial, I’ve created a simple dataset below in R, which you can copy and paste into your console so that you can follow along with the tutorial.
This fictitional dataset includes information about 25 adults receiving services in a therapy clinic. The info included is their height, weight, gender, and diagnosis.

set.seed(1)
height <- sample(65:73, 25, replace=TRUE)
weight <- c(200, 185, 160, 220, 134, 175, 172, 155, 190, 122, 
            NA, 207, 156, 187, 137, 250, 175, 202, 214, 145, 
            177, 180, 144, 209, 18)
gender <- c("male", "female", "female", "male", "female", 
            "male", "female", "female", "male", "female", 
            "male", "male", "female", "male", "female", 
            "male", "female", "male", "male", "female", 
            "male", "female", "female", "male", "male")
diagnosis <- c("CVA", "TBI", "hip fracture", "NA", "Multiple Sclerosis", 
               "CVA", "TBI", "hip fracture", "knee replacement", "Multiple Sclerosis", 
               "CVA", "TBI", "hip fracture", "knee replacement", "Multiple Sclerosis", 
               "CVA", "TBI", "hip fracture", "knee replacement", "Multiple Sclerosis", 
               "CVA", "TBI", "hip fracture", "knee replacement", "Multiple Sclerosis")

clinic.data <- data.frame(height, weight, gender, diagnosis, stringsAsFactors = FALSE)

First, let’s get an idea about what the structure of our data looks like. To do that, let’s use the str function:

str(clinic.data)
## 'data.frame':	25 obs. of  4 variables:
##  $ height   : int  73 68 71 65 66 71 66 67 65 69 ...
##  $ weight   : num  200 185 160 220 134 175 172 155 190 122 ...
##  $ gender   : chr  "male" "female" "female" "male" ...
##  $ diagnosis: chr  "CVA" "TBI" "hip fracture" "NA" ...

This output is telling us that we have 25 observations of 4 variables.

  • The variable height is an integer.
  • The vairable weight is numeric.
  • The variables gender and diagnosis are character variables.

Converting variables to numeric, factors, etc

In order to have a better look at our data (and to eventually run our statistical analyses), it would be best if we first convert these variables to numeric and factor variables.
To do that, we’re going to mutate these variables using the as_factor and as.numeric functions in a tunnel. Don’t be scared. Here we go:

clinic.data <- clinic.data %>%
  mutate(gender = as_factor(gender),
         diagnosis = as_factor(diagnosis), 
         weight = as.numeric(weight))

Checking data for range, NAs, extreme values, data entry error

Great!
Now let’s take a look at our data in a more organized way to really see what’s going on in there.
There are lots of different ways to do this in R. Here, I”m going to use summary and describe.

summary(clinic.data)
##      height       weight         gender                diagnosis
##  Min.   :65   Min.   : 18.0   male  :13   CVA               :5  
##  1st Qu.:67   1st Qu.:152.5   female:12   TBI               :5  
##  Median :69   Median :176.0               hip fracture      :5  
##  Mean   :69   Mean   :171.4               NA                :1  
##  3rd Qu.:71   3rd Qu.:200.5               Multiple Sclerosis:5  
##  Max.   :73   Max.   :250.0               knee replacement  :4  
##               NA's   :1
describe(clinic.data)
##            vars  n   mean    sd median trimmed   mad min max range  skew
## height        1 25  69.00  2.72     69   69.00  2.97  65  73     8  0.11
## weight        2 24 171.42 44.88    176  175.20 37.06  18 250   232 -1.40
## gender*       3 25   1.48  0.51      1    1.48  0.00   1   2     1  0.08
## diagnosis*    4 25   3.32  1.82      3    3.29  2.97   1   6     5  0.18
##            kurtosis   se
## height        -1.27 0.54
## weight         3.33 9.16
## gender*       -2.07 0.10
## diagnosis*    -1.52 0.36

You can see here that the summary function takes a look at what kind of data you have input (numeric vs. factor) and gives you relevant output. The numeric data output contains minimal values, quartiles, median, mean, maximal values, and how many NAs there are (if applicable). The factor data output gives you a table with the name of all the factors included, and a count of how many times they appear in your dataset.
The describe function gives you all of the same numeric information that summary did, plus even more: an n, standard deviation, trimmed median, range, skew, kurtosis and standard error. The function is not so useful for factors, which it more or less treats like numeric values. There are other things you can do with describe, such as use describe.by to sort your numeric data by factor data, but that’s a little outside of the realm of this tutorial, so we won’t go much further into that.

This is output really useful stuff. We can use it to check that our range is as expected, look for data entry errors, missing data, and extreme values. Let’s focus now on the summary output.

  • Right off the bat, height looks pretty good. A range of 65 - 73 inches seems pretty standard for adults, and we don’t have any NA’s. Great! Moving on.
  • Weight is a little more problematic. We’ve already established that these are adults, so the minimum value of 18 for weight doesn’t make any sense. We also have one NA.
  • Gender looks fine, so we’ll move on to diagnosis.
  • Here we can see that there are 5 diagnoses that are fairly evenly distributed in this dataset, and one NA.

Now that we’ve identified some problems, (a data entry error of 18 lbs, an NA for weight, an NA for diagnosis), we have to decide what to do with these errors.

Re-coding

Let’s imagine a situation where for sure you knew that you absolutely had 5 people with each diagnosis in your dataset. You are therefore certain that the NA from diagnosis was supposed to be coded as “knee replacement”.
There are a couple of ways to do this. One way is to pull up the data and look for it. We can do that easily here, because there are only 25 people.

clinic.data$diagnosis
##  [1] CVA                TBI                hip fracture       NA                
##  [5] Multiple Sclerosis CVA                TBI                hip fracture      
##  [9] knee replacement   Multiple Sclerosis CVA                TBI               
## [13] hip fracture       knee replacement   Multiple Sclerosis CVA               
## [17] TBI                hip fracture       knee replacement   Multiple Sclerosis
## [21] CVA                TBI                hip fracture       knee replacement  
## [25] Multiple Sclerosis
## Levels: CVA TBI hip fracture NA Multiple Sclerosis knee replacement

In this output we can see that the NA is the fourth entry. If you wanted to fix that manually, you could use the code: clinic.data$diagnosis[4] = "knee replacement", which re-codes the fourth diagnosis entry as “knee replacement”.
This would work for our small dataset here pretty well, but in the real world, when you have way more than just 25 subjects you are analyzing, this is going to get tedious and leave you prone to errors.

Another way to find your NA is to ask R where it is using the which function:

which(clinic.data$diagnosis == "NA")
## [1] 4

We got the same answer!
Okay, now that we’ve found it, let’s go in and fix it. When we do, we’re going to save it to a new dataset that we’ll call clinic.data.clean. This is good practice, so that if we ever need to go back and look at the original data or use it for something else, it still exists in its untouched form.

clinic.clean <- clinic.data   #this creates a new copy of the dataset called "clinic.clean", which we will manipulate

clinic.clean$diagnosis[4] = "knee replacement"

summary(clinic.clean)  #let's check our work
##      height       weight         gender                diagnosis
##  Min.   :65   Min.   : 18.0   male  :13   CVA               :5  
##  1st Qu.:67   1st Qu.:152.5   female:12   TBI               :5  
##  Median :69   Median :176.0               hip fracture      :5  
##  Mean   :69   Mean   :171.4               NA                :0  
##  3rd Qu.:71   3rd Qu.:200.5               Multiple Sclerosis:5  
##  Max.   :73   Max.   :250.0               knee replacement  :5  
##               NA's   :1

Looks good!

Excluding NAs

Now, let’s tackle that person who supposedly weighs 18 pounds.
There are many ways to handle this missing data. For our purposes here, let’s assume that you have no idea what that person’s true weight is, and that in your statistical analysis the weight is an important variable. We’re going to have to drop that person from our analysis, because they unfortunately don’t have the critical information that we need.
One way to get rid of all of your subjects with missing data is to use the function na.omit. I’m not going to run that here, but basically, you could execute the following code: clinic.clean <- clinic.clean %>% na.omit() and that would drop all cases with missing data.
Instead, we’re going to use the subset function. It’s a useful little function that can be used for more than just this purpose. You could, say, use it to create a subset of all cases who weight more than 200 lbs if you wanted, or a subset of only people with TBI. Here, we’re going to use it do drop NA cases in the weight column.

clinic.clean <- subset(clinic.clean, weight != "NA")

summary(clinic.clean)
##      height          weight         gender                diagnosis
##  Min.   :65.00   Min.   : 18.0   male  :12   CVA               :4  
##  1st Qu.:66.75   1st Qu.:152.5   female:12   TBI               :5  
##  Median :69.00   Median :176.0               hip fracture      :5  
##  Mean   :69.00   Mean   :171.4               NA                :0  
##  3rd Qu.:71.00   3rd Qu.:200.5               Multiple Sclerosis:5  
##  Max.   :73.00   Max.   :250.0               knee replacement  :5

Take a look in the top right corner of your screen, at the “Environment” window. You should have just seen clinic.clean drop to 24 observations of 4 variables.

Re-labeling Factors

Let’s imagine that for some reason I don’t like the labels that came in automatically when I used as_factor earlier and I want to call them something else. I’m going to use the levels function to replace “Multiple Sclerosis” with “MS”.

levels(clinic.clean$diagnosis)[levels(clinic.clean$diagnosis)=="Multiple Sclerosis"] <- "MS"

summary(clinic.clean)
##      height          weight         gender              diagnosis
##  Min.   :65.00   Min.   : 18.0   male  :12   CVA             :4  
##  1st Qu.:66.75   1st Qu.:152.5   female:12   TBI             :5  
##  Median :69.00   Median :176.0               hip fracture    :5  
##  Mean   :69.00   Mean   :171.4               NA              :0  
##  3rd Qu.:71.00   3rd Qu.:200.5               MS              :5  
##  Max.   :73.00   Max.   :250.0               knee replacement:5

Collapsing Factors

Let’s pretend that for the purposes of your analyses, you don’t really care very much about specific diagnoses, and instead want to compare two general groups of diagnoses: those with orthopedic diagnoses and those with neurologic diagnosis. To do this, we’re going to use the forcats package to collapse some of these categories.

clinic.clean$diagnosis <- fct_collapse(clinic.clean$diagnosis,
  ortho = c("hip fracture", "knee replacement"),
  neuro = c("CVA", "TBI", "MS"))

fct_count(clinic.clean$diagnosis)
## # A tibble: 3 x 2
##   f         n
##   <fct> <int>
## 1 neuro    14
## 2 ortho    10
## 3 NA        0

Now we have just two categories: ortho and neuro. You can see that I used a new funciton to visualize this: fct_count from the forcats package.

Re-leveling Factors

Lastly, we need to change the order of our levels, because I want the ortho group to be my reference group for my analysis. I’m again going to use the forcats package for this one. There are lots of complicated things you can do to re-level, and if you want to learn more you should certainly take a look at the forcats resources available on the web. Here, we’re going to keep it simple, using the fct_relevel function, and we’ll tell it which category we want to be first. Then we’ll check the order using fct_count.

clinic.clean$diagnosis <- fct_relevel(clinic.clean$diagnosis, "ortho")

fct_count(clinic.clean$diagnosis)
## # A tibble: 3 x 2
##   f         n
##   <fct> <int>
## 1 ortho    10
## 2 neuro    14
## 3 NA        0

We did it! Pat yourself on the back, that is one squeaky clean dataset!


This tutorial wa created by Kelly Spellmanat MGH Institute of Health Professions. It is released under a Creative Commons Attribution-ShareAlike 4.0 Unported https://creativecommons.org/licenses/by-sa/4.0/.
Avatar
Annie B. Fox
Assistant Professor of Quantitative Methods

My research interests include mental illness stigma, women’s health, and quantitative methods