简体   繁体   中英

Replace improper commas in CSV file

This question may have been asked before, but I couldn't find it. I have a list of CSV files (439 or so) where, in a few of the files, someone also used commas in editorial comments. The result is that I can't put the files into a data frame, since the files now do not have the same number of elements after splitting them. Anyways, the problem I'm facing looks like this:

vec1 <- paste("484,1213,0,62.0006,1,go -- late F1 max, but glide?")
vec2 <- paste("467,1387,0,62.0026,1,goes2")

ls <- list(vec1, vec2)

What I want to do is to have a data frame with six columns. If there wasn't a comma in the editorial comments for vec1 , I could use (and have been using, until I found this problematic example) the following:

df <- ldply(ls, function(x)unlist(strsplit(x[1], split = ",")))

However, I'm getting the obvious error message that the results do not have the same number of lengths. Is there any way of getting rid of that comma, or turning it into a semi-colon, or ensuring that, if there are 7 elements in a vector, that 6 and 7 are combined?

If it helps, this is how I'm reading the files in R (I'm using scan because there is other information in the files that I want. There's some odd encoding issues going on here as well, but this seems to work).

data <- scan(file, fileEncoding="latin1", blank.lines.skip = FALSE, what = "list", sep = "\n", quiet = TRUE)   

If you need the comments, you still can replace the 6th comma with a semicolon and use your previous solution:

gsub("((?:[^,]*,){5}[^,]*),", "\\1;", vec1, perl=TRUE)

Regex explanation :

  • ((?:[^,]*,){5}[^,]*) - a capturing group that we will reference to as Group 1 with \\\\1 in the replacement pattern, matching
    • (?:[^,]*,){5} - 5 sequences of non-comma characters followed by a comma
    • [^,]* - 0 or more non-commas
  • , - the comma we'll turn into a ; in the replacement

Or (as @CathG pointed out, a \\\\K operator can also be used with Perl-like expressions)

sub("^([^,]+,){5}[^,]+\\K,", ";", vec1, perl=T)

From PCRE documentation :

The escape sequence \\K causes any previously matched characters not to be included in the final matched sequence.

However, it will not "normalize" any other commas that might follow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM