简体   繁体   中英

Text Mining R Package & Regex to handle Replace Smart Curly Quotes

I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes.

textclean::replace_non_ascii("You don‘t get “your” money’s worth")

Received Output: "You dont get your moneys worth"

Expected Output: "You don't get "your" money's worth"

Also would appreciate if someone's got the regex to replace every such quotes in one shot.

Thanks!

Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes:

> gsub("[“”]", "\"", gsub("[‘’]", "'", text))
[1] "You don't get \"your\" money's worth"

See the online R demo . Tested in both Linux and Windows, and works the same.

The [“”] construct is a positive character class that matches any single char defined in the class.

To normalize all chars similar to double quotes, you might want to use

> sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
> dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\"″‶]"
> res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8"))) 
> cat(res, sep="\n")
You don't get "your" money's worth

Here, [«»““”„‟≪≫《》〝〞〟"″‶] matches

«   00AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
»   00BB  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
“   05F4  HEBREW PUNCTUATION GERSHAYIM
“   201C  LEFT DOUBLE QUOTATION MARK
”   201D  RIGHT DOUBLE QUOTATION MARK
„   201E  DOUBLE LOW-9 QUOTATION MARK
‟   201F  DOUBLE HIGH-REVERSED-9 QUOTATION MARK
≪  226A  MUCH LESS-THAN
≫  226B  MUCH GREATER-THAN
《  300A  LEFT DOUBLE ANGLE BRACKET
》  300B  RIGHT DOUBLE ANGLE BRACKET
〝  301D  REVERSED DOUBLE PRIME QUOTATION MARK
〞  301E  DOUBLE PRIME QUOTATION MARK
〟  301F  LOW DOUBLE PRIME QUOTATION MARK
"  FF02  FULLWIDTH QUOTATION MARK
″   2033  DOUBLE PRIME
‶   2036  REVERSED DOUBLE PRIME

The [ʻʼʽ٬''‚‛՚︐] is used to normalize some chars similar to single quotes:

ʻ  02BB  MODIFIER LETTER TURNED COMMA
ʼ  02BC  MODIFIER LETTER APOSTROPHE
ʽ  02BD  MODIFIER LETTER REVERSED COMMA
٬  066C  ARABIC THOUSANDS SEPARATOR
‘  2018  LEFT SINGLE QUOTATION MARK
’  2019  RIGHT SINGLE QUOTATION MARK
‚  201A  SINGLE LOW-9 QUOTATION MARK
‛  201B  SINGLE HIGH-REVERSED-9 QUOTATION MARK
՚   055A  ARMENIAN APOSTROPHE
︐  FE10  PRESENTATION FORM FOR VERTICAL COMMA

There's a function in {proustr} to normalize punctuation, called pr_normalize_punc() :

https://github.com/ColinFay/proustr#pr_normalize_punc

It turns :

 => ″‶«  »“”`´„“ into "
 => ՚ ’ into ' 
 => … into ...

For example :

library(proustr)
a <- data.frame(text = "Il l՚a dit : « La ponctuation est chelou » !")
pr_normalize_punc(a, text)
# A tibble: 1 x 1
                                            text
*                                          <chr>
1 "Il l'a dit : \"La ponctuation est chelou\" !"

For your text :

pr_normalize_punc(data.frame( text = "You don‘t get “your” money’s worth"), text)
# A tibble: 1 x 1
                                    text
*                                  <chr>
1 "You don‘t get \"your\" money's worth"

在此输入图像描述

We can use gsub here for a base R option. Replace each curly quoted term at a time.

text <- "You don‘t get “your” money’s worth"
new_text <- gsub("“(.*?)”", "\"\\1\"", text)
new_text <- gsub("’", "'", new_text)
new_text
[1] "You don‘t get \"your\" money's worth"

I have assumed here that your curly quotes are always balanced, ie they always wrap a word. If not, then you might have to do more work.

Doing a blanket replacement of opening/closing double curly quotes may not play out as intended, if you want them to remain as is when not quoting a word.

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM