[英]Read comma separated csv file with fields containing commas using fread in r
I have a csv file separated by comma.我有一个用逗号分隔的 csv 文件。 However, there are fields containing commas like company names "Apple, Inc" and the fields will be separated into two columns, which leads to the following error using fread.
但是,有些字段包含逗号,例如公司名称“Apple,Inc”,并且这些字段将分为两列,这会导致使用 fread 时出现以下错误。
"Stopped early on line 5. Expected 26 fields but found 27." “在第 5 行提前停止。预计有 26 个字段,但找到了 27 个。”
Any suggestions on how to appropriately load this file?有关如何正确加载此文件的任何建议? Thanks in advance!
提前致谢!
Add:添加:
Example rows are as follows.示例行如下。 It seems that there are some fields with comma without quotes.
似乎有些字段带有逗号而没有引号。 But they have whitespace following the comma inside the field.
但是他们在字段内的逗号后面有空格。
100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com
1) Using the test file created in the Note at the end and assuming that the file has no semicolons (use some other character if it does) read in the lines, replace the first and last comma with semicolon and then read it as a semicolon separated file. 1)使用最后在注释中创建的测试文件并假设文件没有分号(如果有,请使用其他字符)在行中读取,将第一个和最后一个逗号替换为分号,然后将其读取为分号分开的文件。
L <- readLines("firms.csv")
read.table(text = sub(",(.*),", ";\\1;", L), sep = ";")
## V1 V2 V3
## 1 100 Microsoft azure.com
## 2 300 IBM ibm.com
## 3 500 Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400 SAP, Inc sap.com
2) Another approach is to use gsub to replace every comma followed by space with semicolon followed by space and then use chartr to replace every comma with semicolon and every semicolon with comma and then read it in as a semicolon separated file. 2)另一种方法是使用 gsub 将每个逗号后跟空格替换为分号后跟空格,然后使用 chartr 将每个逗号替换为分号,将每个分号替换为逗号,然后将其作为分号分隔的文件读入。
L <- readLines("firms.csv")
read.table(text = chartr(",;", ";,", gsub(", ", "; ", L)), sep = ";")
## V1 V2 V3
## 1 100 Microsoft azure.com
## 2 300 IBM ibm.com
## 3 500 Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400 SAP, Inc sap.com
3) Another possibility if there are not too many such rows is to locate them and then put quotes around the offending fields in a text editor. 3)如果没有太多这样的行,另一种可能性是找到它们,然后在文本编辑器中在有问题的字段周围加上引号。 Then it can be read in normally.
然后就可以正常读取了。
which(count.fields("firms.csv", sep = ",") != 3)
## [1] 4
Lines <- '100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com
'
cat(Lines, file = "firms.csv")
Works fine for me.对我来说很好。 Can you provide a reproducible example?
你能提供一个可重现的例子吗?
library(data.table)
# Create example and write out
df_out <- data.frame("X" = c("A", "B", "C"),
"Y"= c("a,A", "b,B", "C"))
write.csv(df_out, file = "df.csv", row.names = F)
# Read in CSV with fread
df_in <- fread("./df.csv")
df_in
X Y
1: A a,A
2: B b,B
3: C C
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.