如何使用data.table :: fread读取未加引号的\\ r

Question

Data I have to process has unquoted text with some additional \\r character. 我必须处理的数据具有一些附加的\\ r字符的未引用文本。 Files are big (500MB), copious (>600), and changing the export is not an option. 文件很大（500MB），数量很多（> 600），并且不能更改导出。 Data might look like 数据可能看起来像

A,B,C A，B，C

blah,a,1 等等，a，1

bloo,a\\r,b bloo，a \\ r，b

blee,c,d 布莱，C，D

How can this be handled with data.table's fread ? 如何用data.table的fread处理？
Is there a better R read CSV function for this, that's similarly performant? 是否有更好的R read CSV函数，其性能类似？

Repro 复制

library(data.table)
csv<-"A,B,C\r\n
      blah,a,1\r\n
      bloo,a\r,b\r\n
      blee,c,d\r\n"
fread(csv)

Error in fread(csv) : Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0: bloo,a fread（csv）中的错误：检测到从点0开始的类型时，预期的sep（'，'）是预期的，但换行，EOF（或其他非打印字符）在字段1处结束

Advanced repro 高级复制

The simple repro might be too trivial to give a sense of scale... 简单的复制可能太琐碎，无法给出规模感。

samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")

# Naive approach
fread("sample.csv")

# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) :  file name conversion problem -- name too long?

# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage  48029706 is too close to the limit

Answer 1

Further to @dirk-eddelbuettel & @nrussell's suggestions, a way of solving this is to is to pre-process the file. 除了@ dirk-eddelbuettel和@nrussell的建议外，解决此问题的方法是对文件进行预处理。 The processor could also be called within fread() but here it is performed in seperate steps: 也可以在fread（）中调用处理器，但是在这里它是通过单独的步骤执行的：

samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")

Answer 2

We can try with gsub 我们可以尝试使用gsub

fread(gsub("\r\n|\r", "", csv))
#      A B C
#1: blah a 1
#2: bloo a b
#3: blee c d

Answer 3

You can also do this with tidyverse packages, if you'd like. 如果需要，您也可以使用tidyverse软件包进行此操作。

> library(readr)
> library(stringr)
> read_csv(str_replace_all(csv, "\r", ""))
# A tibble: 3 × 3
      A     B     C
  <chr> <chr> <chr>
1  blah     a     1
2  bloo     a     b
3  blee     c     d

Answer 4

If you do want to do it purely in R, you could try working with connections. 如果您确实想纯粹在R中执行此操作，则可以尝试使用连接。 As long as a connection is kept open, it will start reading/writing from its previous position. 只要连接保持打开状态，它将从其先前位置开始读取/写入。 Of course, this means the burden of opening and closing connections falls on you. 当然，这意味着打开和关闭连接的负担落在您身上。

In the following code, the file is processed by chunks: 在以下代码中，文件是按块处理的：

library(data.table)

input_csv <- "sample.csv"
in_conn <- file(input_csv)
output_csv <- "out.csv"
out_conn <- file(output_csv, "w+")
open(in_conn)

chunk_size <- 1E6
return_pattern <- "(?<=^|,|\n)([^,]*(?<!\n)\r(?!\n)[^,]*)(?=,|\n|$)"

buffer <- ""

repeat {
  new_chars <- readChar(in_conn, chunk_size)
  buffer <- paste0(buffer, new_chars)
  while (grepl("[\r\n]$", buffer, perl = TRUE)) {
    next_char <- readChar(in_conn, 1)
    buffer <- paste0(buffer, next_char)
    if (!length(next_char))
      break
  }
  chunk <- gsub("(.*)[,\n][^,\n]*$", "\\1", buffer, perl = TRUE)
  buffer <- substr(buffer, nchar(chunk) + 1, nchar(buffer))
  cleaned <- gsub(return_pattern, '"\\1"', chunk, perl = TRUE)
  writeChar(cleaned, out_conn, eos = NULL)
  if (!length(new_chars))
    break
}

writeChar('\n', out_conn, eos = NULL)

close(in_conn)
close(out_conn)

result <- fread(output_csv)

Process: 处理：

If a chunk ends with a \\r or \\n , another character is added until it doesn't. 如果块以\\r或\\n结尾，则添加另一个字符，直到没有出现为止。
Quotes are put around values containing a \\r which isn't adjacent to a \\n . 用引号引起来的值包含一个\\r ，该值与\\n不相邻。
The cleaned chunk is added to the end of another file. 清理后的块将添加到另一个文件的末尾。
Rinse and repeat. 冲洗并重复。

This code simplifies the problem by assuming no quoting is done for any field in sample.csv . 该代码通过假设sample.csv任何字段都没有引用来简化此问题。 It's not especially fast, but not terribly slow. 它不是特别快，但不是很慢。 Larger values for chunk_size should reduce the amount of time spent in I/O operations. chunk_size较大值应减少在I / O操作中花费的时间。 If used for anything beyond this toy example, I'd strongly suggesting wrapping it in a tryCatch(...) call to make sure the files are closed afterwards. 如果不用于此玩具示例，则强烈建议将其包装在tryCatch(...)调用中，以确保之后关闭文件。

如何使用data.table :: fread读取未加引号的\\ r

问题描述

Repro 复制

Advanced repro 高级复制

4 个解决方案

解决方案1
4 已采纳 2016-12-28 11:48:13

解决方案2
2 2016-12-27 17:18:21

解决方案3
1 2016-12-27 17:48:21

解决方案4
1 2016-12-27 22:13:35

如何使用data.table :: fread读取未加引号的\\ r

问题描述

Repro 复制

Advanced repro 高级复制

4 个解决方案

解决方案1 4 已采纳 2016-12-28 11:48:13

解决方案2 2 2016-12-27 17:18:21

解决方案3 1 2016-12-27 17:48:21

解决方案4 1 2016-12-27 22:13:35

解决方案1
4 已采纳 2016-12-28 11:48:13

解决方案2
2 2016-12-27 17:18:21

解决方案3
1 2016-12-27 17:48:21

解决方案4
1 2016-12-27 22:13:35