简体   繁体   中英

Importing only every Nth row from a .csv file in R

just a quick question. Is there a way to use read.csv to import every Nth row from a large file:

Example, a 50-60 million line file where you only need every 4th row starting at row 2.

I thought about maybe incorporating the 'seq' function, but I am not sure if that is possible.

Any suggestions?

For a large data file the best option is to filter out unnecessary row before they get imported into R. The simplest way to do this is by the means of the OS commands, like sed, awk, grep etc. The following code reads every 4th line from the file: for example:

write.csv(1:1000, file='test.csv')

file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%4==0) print $1}' < test.csv ")
res <- read.csv(file.pipe)
res

> res
     X3 X3.1
1     7    7
2    11   11
3    15   15
4    19   19
5    23   23
6    27   27
7    31   31
8    35   35
read.csv("filename.csv")[c(FALSE, TRUE, FALSE, FALSE), ]

will do the trick.

This works since the logical vector is recycled until it matches the number of rows of the data frame returned by read.csv .

As @df239 suggested, its much better to filter the rows beforehand using a commandline tool.

Here's a simpler version using sed :

df <- read.csv(pipe("sed -n '2~4p' test.csv")) 

The 2~4p tells sed to get every 4th line, starting at line 2.

Sven gave a great answer for moderately sized files. But if the reason that you are doing this is because reading the entire file does not fit into memory then you need to take a different approach.

It may be simplest to use an external tool like Perl or AWK to preprocess the file to only have the lines that you want, you can use pipe to read from the output of another program so that you do not have to create an intermediate file.

Another approach would be to transfer the file to a database, then select just the rows that you want from the database.

You can also loop through the file. If you explicitly open the file, then you can read a few rows at a time, keep just the ones that you want, then read the next chunk starting where you left off. The options to read.csv to skip lines and limit the number of lines to read would be helpful here.

While the sed and awk solutions are great, it might be nice to do this within R itself (say on Windows machines or to avoid GNU sed vs BSD sed differences). Using readr::read_*_chunked from the tidyverse with a callback that samples every nth row works rather well:

read_tsv_sample <- function(fn, nth, ...) {
  sample_df_cb <- function(df, idx) {
    df[seq(1, nrow(df), nth), ]
  }

  read_tsv_chunked(fn,
                   ...,
                   chunk_size = 10000,
                   callback = DataFrameCallback$new(sample_df_cb)
  ) %>%
    bind_rows()
}

For example...

iris %>% write_tsv("iris.tsv")

iris %>% dim
#> [1] 150   5

"iris.tsv" %>%
    read_tsv_sample(10,
                    col_types = cols(.default = col_double())
                    ) %>%
    dim
#> [1] 15  5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM