简体   繁体   中英

Read comma separated csv file with fields containing commas using fread in r

I have a csv file separated by comma. However, there are fields containing commas like company names "Apple, Inc" and the fields will be separated into two columns, which leads to the following error using fread.

"Stopped early on line 5. Expected 26 fields but found 27."

Any suggestions on how to appropriately load this file? Thanks in advance!

Add:

Example rows are as follows. It seems that there are some fields with comma without quotes. But they have whitespace following the comma inside the field.

100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com

1) Using the test file created in the Note at the end and assuming that the file has no semicolons (use some other character if it does) read in the lines, replace the first and last comma with semicolon and then read it as a semicolon separated file.

L <- readLines("firms.csv")
read.table(text = sub(",(.*),", ";\\1;", L), sep = ";")
##    V1          V2         V3
## 1 100   Microsoft  azure.com
## 2 300         IBM    ibm.com
## 3 500      Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400    SAP, Inc    sap.com

2) Another approach is to use gsub to replace every comma followed by space with semicolon followed by space and then use chartr to replace every comma with semicolon and every semicolon with comma and then read it in as a semicolon separated file.

L <- readLines("firms.csv")
read.table(text = chartr(",;", ";,", gsub(", ", "; ", L)), sep = ";")
##    V1          V2         V3
## 1 100   Microsoft  azure.com
## 2 300         IBM    ibm.com
## 3 500      Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400    SAP, Inc    sap.com

3) Another possibility if there are not too many such rows is to locate them and then put quotes around the offending fields in a text editor. Then it can be read in normally.

which(count.fields("firms.csv", sep = ",") != 3)
## [1] 4

Note

Lines <- '100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com
'
cat(Lines, file = "firms.csv")

Works fine for me. Can you provide a reproducible example?

library(data.table)

# Create example and write out
df_out <- data.frame("X" = c("A", "B", "C"),
                     "Y"= c("a,A", "b,B", "C"))

write.csv(df_out, file = "df.csv", row.names = F)

# Read in CSV with fread
df_in <- fread("./df.csv")
df_in
   X   Y
1: A a,A
2: B b,B
3: C   C

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM