Before we can run any data analyses or create figures in R, we need to import that data into R. Preparing and cleaning the data for analyses is essential and often more time consuming than the statistical analyses themselves. It is unusual for raw data to be in the correct format and without errors. Data cleaning is the process of identifying and fixing any problems so data can be analysed easily.
Importing data is a major challenge for beginners. This module will give instructions for one of most common ways we import data, and some of challenges you may face and how to overcome them. Some of these issues can be avoided by following good data entry and management practices (first read Data entry to get help with this.
Importing data as a data frame
We recommend entering data into a spreadsheet program (e.g., Excel), and saving that data as a comma separated values (.csv) file. These are easily read into R and shared among users with different spreadsheet programs.
In this module, we are going to use a sample data set on feeding behaviour of a marine snail to demonstrate how to import the data, and the most common issues that arise with data import and cleaning in R.
Second, import the data file to a data frame called Snail_feeding with the
Snail_feeding <- read.csv("Snail_feeding.csv")
Cleaning data frames
When you use
read.csv, R uses several default arguments which can be altered to ensure your data are imported with fewer errors. Have a look at the help file within R (by typing
?read.csv to familiarise yourself with some of these arguments.
The ones that are particularly useful are:
header = T – specifying this at T (i.e., TRUE) will ensure that the values in the first row of your data are treated as variable names.
strip.white = T – this will remove trailing and leading white space from factors. This is a common typing error made during data entry (i.e., “males” vs. “males_”). If we set this as TRUE, they both become “males”, otherwise R will think there are two different levels.
na.strings = "" – this will ensure that empty cells are replaced by NA (the way R records missing data). Annoyingly, R imports missing values within characters/factors types as a value “”. Using
na.strings = "" will insert NAs instead. In addition, if you have coded missing values as something other then blank space you can define that missing value using this argument (i.e
na.strings = c("", "-", "*")).
Putting all these together in the
read.csv function will give us a cleaner data frame.
Snail_feeding <- read.csv("Snail_feeding.csv", header = T, strip.white = T, na.strings = "")
Checking the data
If something is a character when it should be numeric you might see messages such as “‘x’ must be numeric” or “non-numeric argument to binary operator”. Likewise, if something is a factor when it should be character, some character operations might fail. To avoid some of these issues, check your data using
summary before analyses.
str enables you to check the structure of your data and that your variables are the correct type (i.e., numeric, characters, integers, or factors). See Data types and structure for explanations of these different types.
str(Snail_feeding) ## 'data.frame': 768 obs. of 12 variables: ## $ Snail.ID: int 1 1 1 1 1 1 1 1 1 1 ... ## $ Sex : Factor w/ 5 levels "female","female s",..: 3 3 5 3 3 3 3 3 3 3 ... ## $ Size : Factor w/ 2 levels "large","small": 2 2 2 2 2 2 2 2 2 2 ... ## $ Feeding : logi FALSE FALSE FALSE FALSE FALSE TRUE ... ## $ Distance: Factor w/ 104 levels "0","0,356452",..: 19 91 24 15 38 88 73 64 89 63 ... ## $ Depth : num 1.66 1.26 1.43 1.46 1.21 1.56 1.62 162 1.96 1.93 ... ## $ Temp : int 21 21 18 19 21 21 20 20 19 19 ... ## $ X : logi NA NA NA NA NA NA ... ## $ X.1 : logi NA NA NA NA NA NA ... ## $ X.2 : logi NA NA NA NA NA NA ... ## $ X.3 : logi NA NA NA NA NA NA ... ## $ X.4 : logi NA NA NA NA NA NA ...
summary allows you to look at basic statistics for each of your variables and can be used to identify any obvious typos (i.e., extreme maximums or minimums relative to the mean or median or extra groups within a categorical vector).
summary(Snail_feeding) ## Snail.ID Sex Size Feeding Distance ## Min. : 1.00 female :383 large:383 Mode :logical 0.48 : 15 ## 1st Qu.: 4.75 female s: 1 small:384 FALSE:502 0.13 : 12 ## Median : 8.50 male :381 NA's : 1 TRUE :266 0.18 : 12 ## Mean : 8.50 Male : 2 NA's :0 0.32 : 12 ## 3rd Qu.:12.25 males : 1 0.42 : 12 ## Max. :16.00 0.45 : 12 ## (Other):693 ## Depth Temp X X.1 ## Min. : 1.000 Min. :18.00 Mode:logical Mode:logical ## 1st Qu.: 1.260 1st Qu.:19.00 NA's:768 NA's:768 ## Median : 1.510 Median :19.00 ## Mean : 1.716 Mean :19.49 ## 3rd Qu.: 1.760 3rd Qu.:20.00 ## Max. :162.000 Max. :21.00 ## NA's :6 ## X.2 X.3 X.4 ## Mode:logical Mode:logical Mode:logical ## NA's:768 NA's:768 NA's:768 ## ## ## ## ##
Note that for the factor of sex, errors in data entry resulted in five levels (female, female s, male, Male and males) when there should be only two. See below for the fix.
Common problems when importing data
You don’t get the number of columns or rows you expect
If you see a bunch of extra columns (X, X.1,X.2,X.3 etc) or rows in your data frame filled with NAs its likely because a character (most likely white space(i.e. space bar or tab) was entered in cells beyond your actual data in Excel. This problem can be avoided during data entry by removing all the colors/formatting and emptying columns/rows except for those needed.
If the problem is still there we can tackle it in R, by indexing within square brackets
dataframe[row, column]. For example, the following code will replace the data frame in this example with one that only includes the first 7 columns and removes the 5 unwanted ones (X, X.1,X.2,X.3 etc).
Snail_feeding <- Snail_feeding[,1:7]
If you had very many columns and didn’t want to count them, you could use logic (&, or, ==, >, < , !=) to remove the unwanted rows and columns. We don’t want to remove all NAs, only additional rows and columns that are entirely filled with NAs. These have the property of the number of NAs being equal to the number of rows in a given column (or number of columns in a given row).
We do this by selecting columns from our data frame where the column sum,
colSums, of all the NAs,
>is.na, in our data frame are not equal,
!= to the number of rows,
nrow, of our data frame. Using the same logic you can do this for rows as well.
# Selects rows that are not entirely NA Snail_feeding <- Snail_feeding[, colSums(is.na(Snail_feeding)) != nrow(Snail_feeding)] # Select columns that are not entirely NA Snail_feeding <- Snail_feeding[rowSums(is.na(Snail_feeding)) != ncol(Snail_feeding), ]
The columns are not the type of data you expect
Characters as factors When data frames are imported into R, characters are automatically coerced as factors. For statistical work this makes a lot of sense; we are more likely to want to work over factors (defined earlier) than over characters. Factors are fantastic when one is doing statistical analysis and actually exploring the data. However, prior to that when one is reading, cleaning, troubleshooting, merging and generally manipulating the data, factors are a total pain.
You can use the argument
stringsAsFactors=FALSE when importing the data to avoid automatic coercion to factors or you can use
as.character to convert individual vectors to characters.
Tip: Use factors as late as possible in the analysis process. Set
stringsAsFactors=FALSE when importing your data, and when you need factors someplace within your script, then coerce the data to a factor.
Factors as integers In this data frame, Snail ID is an integer where it should be a factor or character. This is because in the original data sheet Snail ID was coded using numbers. This is a common problem and is easy to fix. Simply use
as.character(), and then
class() to check our coercion worked. Remember to use
$ to access the vector from within the data frame.
Snail_feeding$Snail <- as.factor(Snail_feeding$Snail) class(Snail_feeding$Snail) ##  "factor"
Numerics as characters (factors) Due to the automatic coercion in R, any non-numeric digits within a numerical variable will result in the whole variable being coerced to a character (R imports as factors). In this example, the variable Distance has been imported as a factor with 768 levels, when we would like it to be a numerical variable. This indicates a typo somewhere within the distance vector. Such typos are common (e.g., accidentally using comma instead of full stop when entering decimals), but a little trickier to solve.
With a small data set, it is quickest to go back to the original data and find the error. Once fixed, import again and the variable should now be numeric.
With a larger data set, we have the problem of needing to find the typos. Unfortunately, we can’t just convert our data straight into numeric from a vector of type Factor. This is because of the levels attribute (see Data types and structure). If you try and convert a factor directly to a numerical variable, the values become a number that corresponds to the number of levels (1:n, ordered alphabetically or ascending) rather than the actual value.
Thus, we must first convert to a character and then to a numeric. When converting to a numeric, you will get a warning message, NAs introduced by coercion. This is because the non numeric values (i.e., our typos) cannot be coerced to a numeric and get replaced by NA.
We can use this to our advantage,
is.na in combination with
which will identify where the typos or missing values are within the vector.
Snail_feeding$Distance <- as.character(Snail_feeding$Distance) Snail_feeding$Distance <- as.numeric(Snail_feeding$Distance) ## Warning: NAs introduced by coercion which(is.na(Snail_feeding$Distance)) ##  682 755
This tells us that something weird happened in rows 682 and 755. Now we have identified where the problem is, it is simple to replace values within our data frame with correct values (go back to your original datasheet). You could either fix the error in the data set and import again, or replace values in R by indexing within square bracketsand assigning,
<-, the new value. Use
which(is.na()) to check if your fix worked.
Snail_feeding[682,"Distance"] <- 0.356452 Snail_feeding[755,"Distance"]<- 0.42336 which(is.na(Snail_feeding$Distance)) ## integer(0)
You have more variable levels then expected
One of the most important steps in any data analyses or processing task is to verify that your data values are correct. For example a variable called “Sex” would be expected to have only two levels. However in our data frame it has five levels (see
You can check the levels of a factor, or unique character values with
levels (for factors only), or
unique (for characters and factors).
levels(Snail_feeding$Sex) ##  "female" "female s" "male" "Male" "males" unique (Snail_feeding$Sex) ##  male males Male female female s ## Levels: female female s male Male males
There are several typos which we can correct with
which and the logical operators
== (equals) and
| (or) to identify and replace typos.
To replace any values that are are “males” or “Male” with “male”, we would use:
Snail_feeding$Sex[which(Snail_feeding$Sex == "males" | Snail_feeding$Sex == "Male")] <- "male"
To replace any values that are “female s” with “female”, we would use:
Snail_feeding$Sex[which(Snail_feeding$Sex == "female s")] <- "female"
Check it worked using
unique, but also look what happens when you check
unique(Snail_feeding$Sex) ##  male female ## Levels: female female s male Male males levels(Snail_feeding$Sex) ##  "female" "female s" "male" "Male" "males"
When we use
unique to check our categories there are just male and female, however when we look at levels we still have the five different levels including the typos. This is because of the behavior of factors. Once the levels have been defined, they will still be there regardless if they are included in any samples. As our extra levels were typos not true levels, we should remove these from the attributes.
factor will remove extra levels from a vector.
Snail_feeding$Sex <- factor(Snail_feeding$Sex) levels(Snail_feeding$Sex) ##  "female" "male"
summary above is a useful tool to check for potential typos within our numeric variables. Compare the Max or Min against the Median of each numeric variable. If either value is an order of magnitude greater or less then your median it could be a typo. For example look at the max depth, this looks like the decimal point has been forgotten.
Again we, can use logical operators and indexing to identify potential numeric typos. Given all the values of our depth variable appear to be between 1 and 2 (based on inter quartile ranges from summary), we will look for rows with a depth greater then 2.
Snail_feeding[which(Snail_feeding$Depth > 2), ] ## Snail.ID Sex Size Feeding Distance Depth Temp Snail ## 8 1 male small TRUE 0.6 162 20 1
There is only 1 row. After confirming with our original data that this is in fact a typo we will replace it with the real value. Selecting the row and column we replace the value 162 with 1.62.
Snail_feeding[which(Snail_feeding$Depth > 2),6] <- 1.62
Why do this in R ?
You might be wondering why go through the trouble of fixing this in R? Why not just go to the .csv file and fix all the problems directly. It is usually OK to do this if you know the problem is a typo etc. However to maintain data integrity it is important to have a record of every change that has been made to your original data set. By doing all the manipulations and fixes in R, you are also keeping a record of all the changes that are occurring to your data.
Also, you may want to start to explore data before you have collected the entire data set. If you set up a script that checks all of these things before making graphs and running analyses, it is quick to run the script again on the full data once you have it all. The script acts to remind you of all the things to check and is better at picking up typos than you are in a huge spreadsheet.
The final check
Finally, once you have checked and fixed all the problems in your data re run
summary. Actually it is a good idea to run these functions regularly throughout data cleaning to keep tabs on any changes you make.
str(Snail_feeding) ## 'data.frame': 768 obs. of 8 variables: ## $ Snail.ID: int 1 1 1 1 1 1 1 1 1 1 ... ## $ Sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ... ## $ Size : Factor w/ 2 levels "large","small": 2 2 2 2 2 2 2 2 2 2 ... ## $ Feeding : logi FALSE FALSE FALSE FALSE FALSE TRUE ... ## $ Distance: num 0.17 0.87 0.22 0.13 0.36 0.84 0.69 0.6 0.85 0.59 ... ## $ Depth : num 1.66 1.26 1.43 1.46 1.21 1.56 1.62 1.62 1.96 1.93 ... ## $ Temp : int 21 21 18 19 21 21 20 20 19 19 ... ## $ Snail : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... summary(Snail_feeding) ## Snail.ID Sex Size Feeding Distance ## Min. : 1.00 female:384 large:383 Mode :logical Min. :0.0000 ## 1st Qu.: 4.75 male :384 small:384 FALSE:502 1st Qu.:0.2775 ## Median : 8.50 NA's : 1 TRUE :266 Median :0.5100 ## Mean : 8.50 NA's :0 Mean :0.5120 ## 3rd Qu.:12.25 3rd Qu.:0.7500 ## Max. :16.00 Max. :1.0000 ## ## Depth Temp Snail ## Min. :1.000 Min. :18.00 1 : 48 ## 1st Qu.:1.260 1st Qu.:19.00 2 : 48 ## Median :1.510 Median :19.00 3 : 48 ## Mean :1.507 Mean :19.49 4 : 48 ## 3rd Qu.:1.760 3rd Qu.:20.00 5 : 48 ## Max. :2.000 Max. :21.00 6 : 48 ## NA's :6 (Other):480
Author: Keryn Bain
##  "Mon May 16 10:05:13 2016"