I'm working with html text data in R. A snippet of the data I have looks like this:
text <- "<<p channel=\"test.com\" class=\"wordpress\"> , , LONDON — British
supporters of the Black Lives Matter movement stormed the runway of London
City Airport Tuesday, forcing a halt to flights in one of the boldest acts of
protest by the group as it spreads beyond U.S. borders. , , ,"
I want to remove the stray commas but preserve the commas that occur as intended (ie., Airport Tuesday, forcing a...). The stray commas appear usually with spaces (sometimes one, sometimes more) in between them.
I can only seem to chip away at a few commas at a time with this:
gsub(", +", "", text)
Thanks for your suggestions
You can use
gsub(",(?:\\s+,)+", ",", text)
See the R demo .
Details :
,
- a comma (?:\\s+,)+
- one or more occurrences of one or more whitespace chars and then a comma. If there can be no spaces between commas, use \s*
instead of \s+
:
gsub(",(?:\\s*,)+", ",", text)
To also remove all whitespaces before the comma, add \s*
at the start:
gsub("\\s*,(?:\\s*,)+", ",", text)
And to remove all commas at the start and end of string, and "shrink" those inside, you can use
gsub("^\\s*,(?:\\s*,)+\\s*|\\s*,(?:\\s*,)+\\s*$|\\s*(,)(?:\\s*,)+", "\\1", text)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.