Regular expression to match stray commas in R

Question

I'm working with html text data in R. A snippet of the data I have looks like this:

text <- "<<p channel=\"test.com\" class=\"wordpress\">  ,  , LONDON — British 
supporters of the Black Lives Matter movement stormed the runway of London 
City Airport Tuesday, forcing a halt to flights in one of the boldest acts of 
protest by the group as it spreads beyond U.S. borders. ,  , ,"

I want to remove the stray commas but preserve the commas that occur as intended (ie., Airport Tuesday, forcing a...). The stray commas appear usually with spaces (sometimes one, sometimes more) in between them.

I can only seem to chip away at a few commas at a time with this:

gsub(",  +", "", text)

Thanks for your suggestions

Answer 1

You can use

gsub(",(?:\\s+,)+", ",", text)

See the R demo .

Details :

, - a comma
(?:\\s+,)+ - one or more occurrences of one or more whitespace chars and then a comma.

If there can be no spaces between commas, use \s* instead of \s+ :

gsub(",(?:\\s*,)+", ",", text)

To also remove all whitespaces before the comma, add \s* at the start:

gsub("\\s*,(?:\\s*,)+", ",", text)

And to remove all commas at the start and end of string, and "shrink" those inside, you can use

gsub("^\\s*,(?:\\s*,)+\\s*|\\s*,(?:\\s*,)+\\s*$|\\s*(,)(?:\\s*,)+", "\\1", text)

Regular expression to match stray commas in R

Question

1 answers

solution1
1 ACCPTED 2021-03-12 18:21:35

Regular expression to match stray commas in R

Question

1 answers

solution1 1 ACCPTED 2021-03-12 18:21:35

solution1
1 ACCPTED 2021-03-12 18:21:35