Comma-separated value (csv) files are one of the most common file formats used in data analysis today. Sometimes we need to read multiple csv files from disk and combine them into a single data frame or data table object in R. We shall explore five different approaches to that task and determine the most efficient one. First, let us make sure that we know how to answer the following question:
How to list the files in a given directory?
The function list.files()
lists all files in a given directory whose names contain a specific character pattern. An example of its use can be the following:
list.files(path = "./csv/", pattern = "^f.*199", full.names = TRUE)
[1] "./csv/football-results-1998.csv" "./csv/football-results-1999.csv"
The output is a character vector giving the names of the files matching the search criterion. In our example, we have called the function with the following options:
-
path = "./csv/"
the directory path is set to the csv subdirectory of the working directory; -
pattern = "^f.*199"
a regex expression matching strings starting with the character f, followed by an unspecified number of arbitrary characters, followed by 199; -
full.names = TRUE
the path is attached to the returned file names.
Examining five approaches
The stated problem is different from looking for the fastest way to read one large csv file into memory. In our use case we need to import multiple csv files (say 10 or more), each having many columns (say 10 or more), which we need to row bind together to produce a single large data frame or data table object. We next list five approaches to that task in increasing order of efficiency.
The variable files
in the code below is a character vector containing a list of names of csv files, for example as produced by list.files()
.
1. Sequentially reading and binding all data frames
table = NULL
for (file in files)
table = rbind(table, read.csv(file))
This is the least efficient solution in which we simply append each data frame sequentially, starting with an empty data.frame
object. At each step R stores the intermediate data frame at some new address in memory, which results in copying the data of the earlier data frames multiple times.
2. Making a single call to rbind()
table = lapply(files, read.csv) %>% do.call(rbind, .)
Here we use the fact that the function rbind()
can simultaneously bind together multiple data frames, which is significantly faster than the previous approach. A special feature of R functions like rbind()
is having a dynamic argument list. The R way of passing a variable number of arguments to a function is by invoking the method do.call()
.
3. Further optimization with rbindlist()
table = lapply(files, read.csv) %>% rbindlist()
The row binding of multiple data frames and other matrix-like structures with the same column names can be optimized with the function rbindlist()
, a member of the data.table
package.
4. A solution based on the newer readr
package
col_spec = spec_csv(files[1])
table = lapply(files, read_csv, col_types = col_spec) %>% rbindlist()
Another optimization may come from the replacement of read.csv()
with a faster analogue of it. Such is the function read_csv()
from the package readr
. The function read_csv()
, however, may be very slow and inefficient if it has to determine the column types of the data very often. To overcome this problem we have first invoked readr::spec_csv()
to automatically set up the column types before running the rest of the code.
5. A solution based on the data.table
package
table = lapply(files, fread) %>% rbindlist()
The package data.table
is a modern and much optimized analogue to the more familiar package data.frame
. The function fread()
from data.table
is designed to read a csv file into a data.table
object. For large csv files it can be an order of magnitude faster than read.csv()
.
Comparison of running times
We have benchmarked the relative efficiency of each approach in the table below.
test | replications | elapsed | user | system |
---|---|---|---|---|
fread | 100 | 3.37 | 2.84 | 0.53 |
read_csv | 100 | 9.02 | 5.83 | 2.81 |
rbindlist | 100 | 16.51 | 16.09 | 0.42 |
do.call | 100 | 19.48 | 19.06 | 0.41 |
sequential | 100 | 36.83 | 36.26 | 0.50 |
Apart from these results, we also note the following:
-
The test involves reading 22 csv files with 15 columns (a mixture of character and integer values) and around 1000 rows each with actual data on international football matches. The elapsed time in the table above is given in seconds for 100 repetitions.
-
The relative performance of the functions
fread()
,read_csv()
andread.csv()
under equal conditions may be gleaned from the first three results in the table above. -
The method
rbindlist()
is faster thanrbind()
called withdo.call()
for multipledata.frame
elements, but not fordata.table
ones. So, in fact replacing Solution 5 above with the codetable = lapply(files, fread) %>% do.call(rbind, .)
results in marginally faster running times. -
Without specifying the column types in
read_csv()
the function runs very slowly and we get a 4-5 fold reduction in speed. Its default usage, such as inlapply(files, read_csv)
is not recommended in a set-up like ours.