When you started to learn R, one of the first things you probably figured out was how to get your data into R. You had to use some method to read your lovely excel datasets into R so you could achieve what you wanted. Maybe read.csv or read.table, or maybe you use Hadley Wickham’s recent package readr.

This tutorial is intended to ‘go back to the basics’ a little bit, and learn how to ‘create’ our own data.

Why would you want to create your own data?

You probably already have your data in a nice fancy excel document and, as such, there may seem like no good reason to ‘make up’ data. There are, however, a number of good reasons that this can come in handy:

  1. To work on small-scale problems before using all your data
  2. You might want to run some sort of simulation study
  3. You might want to simulate data you expect to collect, to ensure you have the right methods listed in your proposal
  4. You might get stuck and want to ask for help

The focus here will be on the last point. It is pretty inevitable you’ll get stuck with some sort of coding problem. We all do! When this happens, and you want to send some code to a friend, colleague, or even ask a question on the internet, you need to provide a reproducible example!

By doing this, you avoid the need to send your code file and entire data set as attachments to a friend or colleague to have a look. A reproducible example means that someone can quickly, and efficiently copy and paste just the code that you send and reproduce your error or issue you are having.

It’s important to note here that in order for someone to help you, they don’t need the whole dataset. They only need to be able to see the problem and have the associated question to fix/solve!

 

This tutorial is intended to help people who are relatively new to R create a reproducible example and also make fake data for other purposes.

So, how do you create a ‘fake’ dataset?

In the simplest case, you can create multiple vectors and then combine them into a data.frame.

factor <- c("a", "b", "c", "d", "e")

value <- c(1, 2, 3, 4, 5)

df <- data.frame(factor, value)

df
##   factor value
## 1      a     1
## 2      b     2
## 3      c     3
## 4      d     4
## 5      e     5

Alternatively, you can do it all in one step (noting that you are now using ‘=’, not ‘<-’ when specifying the vectors with a data.frame call:

df <- data.frame(
  factor = c("a", "b", "c", "d", "e"),
  value = c(1, 2, 3, 4, 5)
)

df
##   factor value
## 1      a     1
## 2      b     2
## 3      c     3
## 4      d     4
## 5      e     5

This technique may work for a variety of situations, but it also may be too simple at times. For instance, if you have multiple vectors it may be complicated to make many of vectors and then combine them, or if you have some complicated experimental design (like a hierarchical blocked design) that you would like to replicate.

A nice shortcut is to use sample, rnorm or runif to create some data.

sample creates RANDOM data from the specified size with or without replacement. For example, 10 random numbers without replacement:

data <- sample(10)

data
##  [1]  6 10  5  8  3  9  1  2  4  7

Or, 10 random numbers with replacement:

data <- sample(10, replace=TRUE)

data
##  [1]  7  3  7  3  9  2  5 10  7  8

You can create a vector and then sample from it:

factor <- c("a", "b", "c", "d", "e")

data <- sample(factor, replace=TRUE)

data
## [1] "d" "b" "a" "d" "a"

You can also sample n number of times, making it a very convenient function. For example, draw from the four suits of cards, 100 times:

suits <- c("Hearts", "Spades", "Clubs", "Hearts")

cards <- sample(suits, size=100, replace=TRUE)

Note that in the above examples, it doesn’t return data frames, which may or may not matter. Use as.data.frame for this, if necessary.

data <- as.data.frame(sample(10))

Creating data from a known distribution

rnorm creates data from a normal distribution

data <- rnorm(100)

By default, rnorm draws from a population with mean = 0, and sd = 1. We can change either of these to get a sample from a normal distributed with specified mean and standard deviation. For example, to get 100 numbers from a normal distribution with a mean of 25 and s.d. of 1.5:

data <- rnorm(100, mean=25, sd=1.5)

 

runif creates data from a uniform distribution

data <- runif(100)

head(data)
## [1] 0.868665082 0.007564901 0.838849667 0.178191374 0.565068135 0.314484094

Similarly to above, runif draws from a distribution with min=0 and max=1. We can change this to whatever we want. For example, to get 100 random numbers between -10 and 5:

data <- runif(100, min=-10, max =5)

Luckily, R has just about every distribution built in to draw from. This is really helpful if you are theorizing data before you start collection of data! A comprehensive list is here.

What if your problem is a little more complicated?

For instance, what if your data come from four replicates from each of five sites and you want to recreate a vector for the repeating factor values.

You could do this:

site <- c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d", "d", "e", "e", "e", "e")

Much better, is to take advantage of the rep function. It replicates values in a vector or list. The same outcome as above is achieved with this.

site <- c(rep("a", 4), rep("b", 4), rep("c", 4), rep("e", 4))

Alternatively:

site <- rep(c("a", "b", "c", "d"), each=4)

Or, you can replicate until a certain length of the vector is reached. To get 50 replicates from each of the four sites, we would use:

site <- rep(c("a", "b", "c", "d"), length=50)

This becomes increasingly valuable as you increase the number of repetitions and/or factors to include!

Creating all combinations of multiple categorical factors

expand.grid is very useful for creating a data frame that has every combination of all levels from multiple factors. For example, if we had sampled four sites from each of four regions in each of three states, we could use this to create

study <- expand.grid(state=c("NSW", "VIC", "QLD"),
                       region=c("N", "E", "S", "W"),
                       site=c("a", "b", "c", "d"))

We could then add some data to simluate species richness at each site:

study$richness <- rnorm(nrow(study), mean=15, sd=3)

The nrow argument is used in order to replace the correct amount of data into the dataframe (in this case, 48, the number of rows in your study design).

What are some other options to make a reproducible example?

You could use a built in dataset that is loaded in base R, in order to reproduce your problem. You can quickly see the list of built-in datasets.

library(help="datasets")

Then load a dataset using:

data(iris)

 

What if you NEED to use your own data?

Maybe you have ultra-complicated data and can’t figure out how to reproduce the problem using fake data. Well, that’s what dput is for. dput is commonly used to write an object to a file or to recreate it.

Let’s give an example. Say you are working with the quakes dataset.

data(quakes)

head(quakes)
##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

If it was just the structure of the data that we wanted to reproduce, then we could just use head combined with dput.

dput(head(quakes))
## structure(list(lat = c(-20.42, -20.62, -26, -17.97, -20.42, -19.68
## ), long = c(181.62, 181.03, 184.1, 181.66, 181.96, 184.31), depth = c(562L, 
## 650L, 42L, 626L, 649L, 195L), mag = c(4.8, 4.2, 5.4, 4.1, 4, 
## 4), stations = c(41L, 15L, 43L, 19L, 11L, 12L)), .Names = c("lat", 
## "long", "depth", "mag", "stations"), row.names = c(NA, 6L), class = "data.frame")

We can then copy and paste this output into an email, etc. However, be sure to name the df first in order to create an object for whoever will be using it!

reproduced_df <- structure(list(lat = c(-20.42, -20.62, -26, -17.97, -20.42, -19.68
), long = c(181.62, 181.03, 184.1, 181.66, 181.96, 184.31), depth = c(562L, 
650L, 42L, 626L, 649L, 195L), mag = c(4.8, 4.2, 5.4, 4.1, 4, 
4), stations = c(41L, 15L, 43L, 19L, 11L, 12L)), .Names = c("lat", 
"long", "depth", "mag", "stations"), row.names = c(NA, 6L), class = "data.frame")

What if its only certain rows we are having trouble with?

tmp <- quakes[30:40,]
dput(tmp)
## structure(list(lat = c(-19.84, -22.58, -16.32, -15.55, -23.55, 
## -16.3, -25.82, -18.73, -17.64, -17.66, -18.82), long = c(182.37, 
## 179.24, 166.74, 185.05, 180.8, 186, 179.33, 169.23, 181.28, 181.4, 
## 169.33), depth = c(328L, 553L, 50L, 292L, 349L, 48L, 600L, 206L, 
## 574L, 585L, 230L), mag = c(4.4, 4.6, 4.7, 4.8, 4, 4.5, 4.3, 4.5, 
## 4.6, 4.1, 4.4), stations = c(17L, 21L, 30L, 42L, 10L, 10L, 13L, 
## 17L, 17L, 17L, 11L)), .Names = c("lat", "long", "depth", "mag", 
## "stations"), row.names = 30:40, class = "data.frame")

This is a really great way to send code to someone to ask for help!

What about really complex problems?

This tutorial is mainly intended for new R users, and it is likely the tips and tricks above will help other people to help you a large majority of the time. However, in the case it doesn’t, it might be necessary to give some extra information. sessionInfo() gives a summary of the R version currently running, the operating system and which packages are loaded

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.1.0 magrittr_1.5    rprojroot_1.2   tools_3.3.2    
##  [5] htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.12    stringi_1.1.5  
##  [9] rmarkdown_1.6   knitr_1.16      stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10.1

 

Other important notes

  1. Be sure to clearly define what you are after. Do you have a purely statistics question, or do you have a coding question?
  2. It is a good idea to include any neccessary packages that you are using in which the problem occurs
  3. You should always note what you have already tried, as far as code, and/or any reference sites you are using.

Concluding remarks

  1. There are a number of reasons we may want to use fake data
  2. It is pretty easy to create fake data
  3. If you send the easiest possible reproducible example to someone, the greater the likelihood they will help you, and more efficiently
  4. A lot of the time, by simplifying the problem, you may even solve it yourself!
  5. Learn how to use with dput, but don’t forget to name the object when copying the code from the R console.

Where can you get further help?

All of this information isn’t really useful unless you have someone to answer your question after you’ve made your nice reproducible example. One alternative to asking for help from colleagues and friends is to use online websites, such as Stack Overflow. The tips in this tutorial will help you to ask a question that is not removed or banned. Also, when you ask a question be sure to show you have done previous research.

There is more help on the web for making reproducible examples. For instance, see here, here, here, or here.

Author: Corey T. Callaghan
Last updated:

## [1] "Fri Sep 15 16:04:16 2017"