9 Getting data in and out of R

Till now, we have created our own datasets or we have used sample datasets available in R. In practical usage, there will hardly be any case when we get the data imported in R by itself. So we have to import and load our data sets in R, for our analytics tasks. Even after completing the analytics, the summarised data, other reports, charts, etc. need to be exported. This chapter is intended to do all these tasks.

First section deals about functions related to reading external data i.e. importing objects into R. This is followed by another section dealing with writing data to external files i.e. exporting data out of R.

9.1 Importing external data in R - Base R methods

Base R has many functions which can fulfill nearly any of our jobs to import external data into R. However, there are packages which are customised to do certain tasks in easier way and thus, we will also learn two of the packages from tidyverse also.

There are some important functions in base R, which are used frequently to import external data into R environment. Let us discuss these one by one.

Reading tables through `read.table()` and/or `read.csv()`

Basically these two functions are most commonly used functions in R, to get tabular data out of flat files. The two functions namely read.table() and read.csv() are used respectively to read tabular data out of flat files having simple text format (.txt) and comma separated values (.csv) formats respectively. There are three more cousins to these functions.

read.csv2() to read csv files where ; is used as delimiter and , is used for decimals instead.
read.delim() to read delimited files where tab character has been used as delimiter and . as decimal.
read.delim2(), similarly to read delimited files where tab character has been used as delimiter and , as decimal.

Most important arguments to these functions are -

file the name of the file along with complete path as string⁹.
header a logical value indicating if first line of the file has to be read as header or not.
sep a string indicating a separator value to separate columns.
colClasses a character vector indicating type of the columns if these are to be read explicitly in these types/formats only.
skip an integer, indicating how many rows (from beginning) are to be skipped.

Readers may check results of ?read.table() to get a complete list of arguments of these functions.

Example: Let us try to download data related to World Happiness Report 201=21 which is available on data.world portal.

wh2021 <- read.csv("https://query.data.world/s/qbsbmxlfj54sl4mq3y6uxsr3pkhhmo")
# Check dimensions
dim(wh2021)
# Check column names
colnames(wh2021)

We can see that dataset named wh2021 having 20 columns and 149 rows is now available in our environment.

Tip: Use "clipboard" in file argument for pasting copied data into R. E.g. Copy a few rows and cells from excel spreadsheet and run this command read.delim("clipboard", header = FALSE).

Read data into a vector through `scan()` or `readline()`

As afore-mentioned functions read.table() et al are used in reading data into tables, we may also require reading data into a vector or simple list. The two functions scan() and readline() are used for these purposes.

The basic syntax of scan() is -

scan(file = "",
     what = double(),
     nmax = -1,
     ...
)

Where -

file is name of the file, or link
what is format/data type to be read. Default type is double
nmax is mux number of data values to be read, default is -1 which means all.

There are many other usefule arguments, for which please check results of ?scan in your console.

Example -

scan("http://mattmahoney.net/iq/digits.txt", nmax = 10)

This function is sometimes useful to read data from keyboard into a vector. Just use a blank string "" in file name. See this example

Function readline() on the other hand does similar job, but with a prompt. See this example

Reading text files through `readLines()`

Function readLines() is used to read text lines from a file (or connection). To see this in action, prepare a text file (say "txt.txt") and try reading it using readLines("txt.txt").

9.2 Exporting data out of R - Base R methods

Since the nature of most of the data anaytic jobs carried out in R, will be followed after reading the external data which will be followed by wrangling, transformation, modelling, etc., all in R. Exporting files will not be used as much as reading external data. Still, there will be times, when wrangled data tables need to be exported out of R. For each of the different use cases, the following functions will almost complete our export requirements.

Let’s learn these.

Writing tabular data through `write.table()` and/or `write.csv()`

Exporting data frames, whether after cleaning, wrangling, or transformation, etc., can be exported using these functions. Latter will be used to write data frames in csv formats specifically. The syntax is -

write.table(x, file = "", sep = " ", ...)
write.csv(x, file = "", ...)
write.csv2(x, file = "", ...)

where -

x is the data frame object to be exported
file is used to give file name (along with path)
sep is separator
... - there are many more arguments which are used to cutomised export needs. See ?write.table() for full details.

E.g. - The following command will export iris data frame as iris.csv file in the current working directory.

write.csv(iris, 'iris.csv')

Writing character data line by line to a file through `writeLines()`

Similar to readLines, function writeLines() will write the text data into a file with given file name. Type the following code in your console and check that a new file with name my_new_file.txt has been created with the given contents in your current working directory.

writeLines("Andrew 25
           Bob 45
           Charles 56", "my_new_file.txt")

Using `dput()` to get a code representation of R object

This function will output the code representation of the given R object. This function is particularly useful, when you are searching for help online and you need to give some sample data to reproduce the problem. E.g. on Stack Overflow when asking for a solution to a specific problem, reproducible data needs to be furnished. Please also refer to section 9.3.3 for more.

Example-1:

my_data <- data.frame(Name = c("Andrew", "Bob", "Charles"),
                      Age = c(25, 45, 56))
dput(my_data)

## structure(list(Name = c("Andrew", "Bob", "Charles"), Age = c(25, 
## 45, 56)), class = "data.frame", row.names = c(NA, -3L))

And while reproducing someone else’s dput-

now_mydata <- structure(list(Name = c("Andrew", "Bob", "Charles"), Age = c(25, 
45, 56)), class = "data.frame", row.names = c(NA, -3L))

now_mydata

##      Name Age
## 1  Andrew  25
## 2     Bob  45
## 3 Charles  56

9.3 Using external packages for reading/writing data

9.3.1 Package `readr`

The readr package¹⁰ is part of core tidyverse and is loaded directly when we load it through library(tidyverse). It provides a range of analogous functions for each of the reading functions in base R.

Base R	readr	Uses
read.table	read_table	Reading table
read.csv	read_csv	Reading CSV file with comma as sep
read.csv2	read_csv2	Reading CSV file with semi-colon as sep
read.delim	read_delim	Reading files with any delimiter
read.fwf	read_fwf	Reading fixed width files
read.tsv	read_tsv	Reading tab delimited file
--	write_delim	Writing files with any delimiter
write.csv	write_csv	Writing files with comma delimiter
write.csv2	write_csv2	Writing files with semi-colon delimiter
--	write_tsv	writing a tab delimited file

So a question may be asked here that what’s the difference between these two sets of functions. Firstly, readr alternatives are much faster than their base R counterparts. Secondly, it provides an informative problem report when parsing leads to unexpected results.

For these, check results of these examples-

read_csv(readr_example("chickens.csv"))

## Rows: 5 Columns: 4
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): chicken, sex, motto
## dbl (1): eggs_laid
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 5 × 4
##   chicken                 sex     eggs_laid motto                               
##   <chr>                   <chr>       <dbl> <chr>                               
## 1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that's a jok…
## 2 Chicken Little          hen             3 The sky is falling!                 
## 3 Ginger                  hen            12 Listen. We'll either die free chick…
## 4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.                
## 5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the cargo hold.

Note that the column types, while parsing the data frame, have now been printed. These column types have been guessed by readr actually. If the column types are not what were actually required to be parsed, then argument col_types may be used. We can also use spec() function to retrieve the data types guessed by readr later-on, so these can be modified and used again in col_types argument. See this example

write.csv(iris, "iris.csv") #write a dummy data
spec(read_csv("iris.csv"))

## New names:
## Rows: 150 Columns: 6
## ── Column specification
## ────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Species dbl (5): ...1, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to
## quiet this message.
## • `` -> `...1`

## cols(
##   ...1 = col_double(),
##   Sepal.Length = col_double(),
##   Sepal.Width = col_double(),
##   Petal.Length = col_double(),
##   Petal.Width = col_double(),
##   Species = col_character()
## )

read_csv("iris.csv",
         col_select = 2:6,
         col_types = cols(
           Sepal.Length = col_double(),
           Sepal.Width = col_double(),
           Petal.Length = col_double(),
           Petal.Width = col_double(),
           Species = col_factor(levels = c('setosa', 'versicolor', 'virginica'))
         )
) %>% head(2)

## New names:
## • `` -> `...1`

## # A tibble: 2 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa

9.3.2 Package `readxl`

This package is also part of tidyverse but this one has to be loaded specifically using library(readxl). As the name suggests, it has functions which are useful to read/write data to/from excel files. Excel files (having extension .xls or .xlsx) are slightly different in a way that these may contain several sheets of data at once. The functions read_excel(), read_xlsx, and read_xls have been designed for reading sheets from excel files. The syntax is

read_excel(path, sheet = NULL, range = NULL)
read_xlsx(path, sheet = NULL, range = NULL)
read_xls(path, sheet = NULL, range = NULL)

To get the names of sheets in an excel file we can use excel_sheets() function from the library.

excel_sheets(path)

9.3.3 Package `reprex`

This is again part of tidyverse, but has to be loaded specifically by calling library(reprex). The name reprex is actually short for reproducible example. This is useful particularly when we are stuck in some problem and seek for online help on some forum such as Stack Overflow, R Studio Community, etc.

As an example do this in your console

library(reprex)

x <- dput(head(iris))
x

Thereafter run reprex(), a small window in the Viewer tab will be opened like this. Moreover, the code has been copied on the clipboard.

For more reading please refer this page.¹¹

8 File handling operations in R

10 Data Cleaning in R

9 Getting data in and out of R

9.1 Importing external data in R - Base R methods

Reading tables through read.table() and/or read.csv()

Read data into a vector through scan() or readline()

Reading text files through readLines()

9.2 Exporting data out of R - Base R methods

Writing tabular data through write.table() and/or write.csv()

Writing character data line by line to a file through writeLines()

Using dput() to get a code representation of R object