简体   繁体   中英

Regular expression to match stray commas in R

I'm working with html text data in R. A snippet of the data I have looks like this:

text <- "<<p channel=\"test.com\" class=\"wordpress\">  ,  , LONDON — British 
supporters of the Black Lives Matter movement stormed the runway of London 
City Airport Tuesday, forcing a halt to flights in one of the boldest acts of 
protest by the group as it spreads beyond U.S. borders. ,  , ,"

I want to remove the stray commas but preserve the commas that occur as intended (ie., Airport Tuesday, forcing a...). The stray commas appear usually with spaces (sometimes one, sometimes more) in between them.

I can only seem to chip away at a few commas at a time with this:

gsub(",  +", "", text)

Thanks for your suggestions

You can use

gsub(",(?:\\s+,)+", ",", text)

See the R demo .

Details :

  • , - a comma
  • (?:\\s+,)+ - one or more occurrences of one or more whitespace chars and then a comma.

If there can be no spaces between commas, use \s* instead of \s+ :

gsub(",(?:\\s*,)+", ",", text)

To also remove all whitespaces before the comma, add \s* at the start:

gsub("\\s*,(?:\\s*,)+", ",", text)

And to remove all commas at the start and end of string, and "shrink" those inside, you can use

gsub("^\\s*,(?:\\s*,)+\\s*|\\s*,(?:\\s*,)+\\s*$|\\s*(,)(?:\\s*,)+", "\\1", text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM