简体   繁体   English

如何使用data.table :: fread读取未加引号的\\ r

[英]How to read unquoted extra \r with data.table::fread

Data I have to process has unquoted text with some additional \\r character. 我必须处理的数据具有一些附加的\\ r字符的未引用文本。 Files are big (500MB), copious (>600), and changing the export is not an option. 文件很大(500MB),数量很多(> 600),并且不能更改导出。 Data might look like 数据可能看起来像

A,B,C A,B,C

blah,a,1 等等,a,1

bloo,a\\r,b bloo,a \\ r,b

blee,c,d 布莱,C,D

  1. How can this be handled with data.table's fread ? 如何用data.table的fread处理?
  2. Is there a better R read CSV function for this, that's similarly performant? 是否有更好的R read CSV函数,其性能类似?

Repro 复制

library(data.table)
csv<-"A,B,C\r\n
      blah,a,1\r\n
      bloo,a\r,b\r\n
      blee,c,d\r\n"
fread(csv)

Error in fread(csv) : Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0: bloo,a fread(csv)中的错误:检测到从点0开始的类型时,预期的sep(',')是预期的,但换行,EOF(或其他非打印字符)在字段1处结束

Advanced repro 高级复制

The simple repro might be too trivial to give a sense of scale... 简单的复制可能太琐碎,无法给出规模感。

samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")

# Naive approach
fread("sample.csv")

# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) :  file name conversion problem -- name too long?

# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage  48029706 is too close to the limit

Further to @dirk-eddelbuettel & @nrussell's suggestions, a way of solving this is to is to pre-process the file. 除了@ dirk-eddelbuettel和@nrussell的建议外,解决此问题的方法是对文件进行预处理。 The processor could also be called within fread() but here it is performed in seperate steps: 也可以在fread()中调用处理器,但是在这里它是通过单独的步骤执行的:

samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")

We can try with gsub 我们可以尝试使用gsub

fread(gsub("\r\n|\r", "", csv))
#      A B C
#1: blah a 1
#2: bloo a b
#3: blee c d

You can also do this with tidyverse packages, if you'd like. 如果需要,您也可以使用tidyverse软件包进行此操作。

> library(readr)
> library(stringr)
> read_csv(str_replace_all(csv, "\r", ""))
# A tibble: 3 × 3
      A     B     C
  <chr> <chr> <chr>
1  blah     a     1
2  bloo     a     b
3  blee     c     d

If you do want to do it purely in R, you could try working with connections. 如果您确实想纯粹在R中执行此操作,则可以尝试使用连接。 As long as a connection is kept open, it will start reading/writing from its previous position. 只要连接保持打开状态,它将从其先前位置开始读取/写入。 Of course, this means the burden of opening and closing connections falls on you. 当然,这意味着打开和关闭连接的负担落在您身上。

In the following code, the file is processed by chunks: 在以下代码中,文件是按块处理的:

library(data.table)

input_csv <- "sample.csv"
in_conn <- file(input_csv)
output_csv <- "out.csv"
out_conn <- file(output_csv, "w+")
open(in_conn)

chunk_size <- 1E6
return_pattern <- "(?<=^|,|\n)([^,]*(?<!\n)\r(?!\n)[^,]*)(?=,|\n|$)"

buffer <- ""

repeat {
  new_chars <- readChar(in_conn, chunk_size)
  buffer <- paste0(buffer, new_chars)
  while (grepl("[\r\n]$", buffer, perl = TRUE)) {
    next_char <- readChar(in_conn, 1)
    buffer <- paste0(buffer, next_char)
    if (!length(next_char))
      break
  }
  chunk <- gsub("(.*)[,\n][^,\n]*$", "\\1", buffer, perl = TRUE)
  buffer <- substr(buffer, nchar(chunk) + 1, nchar(buffer))
  cleaned <- gsub(return_pattern, '"\\1"', chunk, perl = TRUE)
  writeChar(cleaned, out_conn, eos = NULL)
  if (!length(new_chars))
    break
}

writeChar('\n', out_conn, eos = NULL)

close(in_conn)
close(out_conn)

result <- fread(output_csv)

Process: 处理:

  • If a chunk ends with a \\r or \\n , another character is added until it doesn't. 如果块以\\r\\n结尾,则添加另一个字符,直到没有出现为止。
  • Quotes are put around values containing a \\r which isn't adjacent to a \\n . 用引号引起来的值包含一个\\r ,该值与\\n不相邻。
  • The cleaned chunk is added to the end of another file. 清理后的块将添加到另一个文件的末尾。
  • Rinse and repeat. 冲洗并重复。

This code simplifies the problem by assuming no quoting is done for any field in sample.csv . 该代码通过假设sample.csv任何字段都没有引用来简化此问题。 It's not especially fast, but not terribly slow. 它不是特别快,但不是很慢。 Larger values for chunk_size should reduce the amount of time spent in I/O operations. chunk_size较大值应减少在I / O操作中花费的时间。 If used for anything beyond this toy example, I'd strongly suggesting wrapping it in a tryCatch(...) call to make sure the files are closed afterwards. 如果不用于此玩具示例,则强烈建议将其包装在tryCatch(...)调用中,以确保之后关闭文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM