[英]How to read unquoted extra \r with data.table::fread
Data I have to process has unquoted text with some additional \\r character. 我必须处理的数据具有一些附加的\\ r字符的未引用文本。 Files are big (500MB), copious (>600), and changing the export is not an option.
文件很大(500MB),数量很多(> 600),并且不能更改导出。 Data might look like
数据可能看起来像
A,B,C
A,B,C
blah,a,1
等等,a,1
bloo,a\\r,b
bloo,a \\ r,b
blee,c,d
布莱,C,D
fread
? fread
处理? library(data.table)
csv<-"A,B,C\r\n
blah,a,1\r\n
bloo,a\r,b\r\n
blee,c,d\r\n"
fread(csv)
Error in fread(csv) : Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0: bloo,a
fread(csv)中的错误:检测到从点0开始的类型时,预期的sep(',')是预期的,但换行,EOF(或其他非打印字符)在字段1处结束
The simple repro might be too trivial to give a sense of scale... 简单的复制可能太琐碎,无法给出规模感。
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Naive approach
fread("sample.csv")
# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) : file name conversion problem -- name too long?
# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage 48029706 is too close to the limit
Further to @dirk-eddelbuettel & @nrussell's suggestions, a way of solving this is to is to pre-process the file. 除了@ dirk-eddelbuettel和@nrussell的建议外,解决此问题的方法是对文件进行预处理。 The processor could also be called within fread() but here it is performed in seperate steps:
也可以在fread()中调用处理器,但是在这里它是通过单独的步骤执行的:
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")
We can try with gsub
我们可以尝试使用
gsub
fread(gsub("\r\n|\r", "", csv))
# A B C
#1: blah a 1
#2: bloo a b
#3: blee c d
You can also do this with tidyverse packages, if you'd like. 如果需要,您也可以使用tidyverse软件包进行此操作。
> library(readr)
> library(stringr)
> read_csv(str_replace_all(csv, "\r", ""))
# A tibble: 3 × 3
A B C
<chr> <chr> <chr>
1 blah a 1
2 bloo a b
3 blee c d
If you do want to do it purely in R, you could try working with connections. 如果您确实想纯粹在R中执行此操作,则可以尝试使用连接。 As long as a connection is kept open, it will start reading/writing from its previous position.
只要连接保持打开状态,它将从其先前位置开始读取/写入。 Of course, this means the burden of opening and closing connections falls on you.
当然,这意味着打开和关闭连接的负担落在您身上。
In the following code, the file is processed by chunks: 在以下代码中,文件是按块处理的:
library(data.table)
input_csv <- "sample.csv"
in_conn <- file(input_csv)
output_csv <- "out.csv"
out_conn <- file(output_csv, "w+")
open(in_conn)
chunk_size <- 1E6
return_pattern <- "(?<=^|,|\n)([^,]*(?<!\n)\r(?!\n)[^,]*)(?=,|\n|$)"
buffer <- ""
repeat {
new_chars <- readChar(in_conn, chunk_size)
buffer <- paste0(buffer, new_chars)
while (grepl("[\r\n]$", buffer, perl = TRUE)) {
next_char <- readChar(in_conn, 1)
buffer <- paste0(buffer, next_char)
if (!length(next_char))
break
}
chunk <- gsub("(.*)[,\n][^,\n]*$", "\\1", buffer, perl = TRUE)
buffer <- substr(buffer, nchar(chunk) + 1, nchar(buffer))
cleaned <- gsub(return_pattern, '"\\1"', chunk, perl = TRUE)
writeChar(cleaned, out_conn, eos = NULL)
if (!length(new_chars))
break
}
writeChar('\n', out_conn, eos = NULL)
close(in_conn)
close(out_conn)
result <- fread(output_csv)
Process: 处理:
\\r
or \\n
, another character is added until it doesn't. \\r
或\\n
结尾,则添加另一个字符,直到没有出现为止。 \\r
which isn't adjacent to a \\n
. \\r
,该值与\\n
不相邻。 This code simplifies the problem by assuming no quoting is done for any field in sample.csv
. 该代码通过假设
sample.csv
任何字段都没有引用来简化此问题。 It's not especially fast, but not terribly slow. 它不是特别快,但不是很慢。 Larger values for
chunk_size
should reduce the amount of time spent in I/O operations. chunk_size
较大值应减少在I / O操作中花费的时间。 If used for anything beyond this toy example, I'd strongly suggesting wrapping it in a tryCatch(...)
call to make sure the files are closed afterwards. 如果不用于此玩具示例,则强烈建议将其包装在
tryCatch(...)
调用中,以确保之后关闭文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.