遇到特殊字符时，从R中的字符串中提取子字符串

Question

I am trying to extract the base name (U+...) for all emoji's in a string. 我试图在字符串中提取所有表情符号的基本名称（U + ...）。 I currently have a data frame with a column of Instagram messages (For such reasons I can't post on here for ethical reasons. However I will post a self generated one.) 我目前有一个带有一列Instagram消息的数据框（出于这些原因，由于道德原因，我不能在这里发帖。但是我会发布一个自生成的。）

I want to extract all emoji's from the messages string. 我想从消息字符串中提取所有表情符号。

So far I have been successful in using gsub to extract a single emoji from a single piece of text. 到目前为止，我已成功使用gsub从单个文本中提取单个表情符号。 For example: 例如：

    gsub(".*[<]([^.]+)[>].*", "\\1", "I know <U+0001F621<U+0001F923>")

This gives me the last emoji : 这给了我最后一个表情符号：

    [1] "U+0001F923"

However I'd like it to catch all emoji's in the string. 但是我希望它能捕获字符串中的所有表情符号。

like this: 像这样：

    [1] "U+0001F923"  [2] "U+0001F621"

Furthermore I have tried to use this gsub code to extract the data from a 2 column data frame. 此外，我尝试使用此gsub代码从2列数据框中提取数据。 (Below is a snippet from a much larger data frame) （以下是来自更大数据框的片段）

df: DF：

    name                     value
    <chr>                    <chr>
    Participant1             instahandle1   
    Participant2             instahandle2   
    conversation.sender      instahandle2   
    conversation.created_at  2019-03-24T19:08:25.632223+00:00   
    conversation.text        I know <U+0001F923><U+0001F923>x   
    conversation.sender      instahandle1   
    conversation.created_at  2019-03-24T19:04:01.042261+00:00   
    conversation.text        Me too! it was cool    
    conversation.sender      instahandle2   
    conversation.created_at  2019-03-24T19:03:42.065983+00:00

    gsub(".*[<]([^.]+)[>].*", "\\1", df$value)

However this just retrieves. 然而，这只是检索。

    [1] "instahandle1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
    [2] "instahandle2"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
    [3] "instahandle2"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
    [4] "2019-03-24T19:08:25.632223+00:00"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
    [5] "I know \U0001f923\U0001f923x"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
    [6] "instahandle1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
    [7] "2019-03-24T19:04:01.042261+00:00"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
    [8] "Me too! it was cool"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
    [9] "instahandle2"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
    [10] "2019-03-24T19:03:42.065983+00:00"

I would like it to extract every emoji and nothing else. 我希望它能提取每一个表情符号，而不是其他任何东西。 like this: 像这样：

     [1] "U+0001F923"  [2] "U+0001F621"

Answer 1

You may use 你可以用

x <- "I know \U0001F621\U0001F923s"
regmatches(x, gregexpr("[^[:ascii:]]+", x, perl=TRUE))
## => [[1]]
##    [1] "😡🤣"

This code extracts all non-ASCII char chunks from the input. 此代码从输入中提取所有非ASCII char块。 See the online R demo . 查看在线R演示。

Since [:ascii:] character class is not POSIX compliant, perl=TRUE is required. 由于[:ascii:]字符类不符合POSIX，因此需要perl=TRUE 。

If you want to extract only emojis separately use 如果你想只提取表情符号分开使用

emoji_rx <- "[\\x{1f300}-\\x{1f5ff}\\x{1f900}-\\x{1f9ff}\\x{1f600}-\\x{1f64f}\\x{1f680}-\\x{1f6ff}\\x{2600}-\\x{26ff}\\x{2700}-\\x{27bf}\\x{1f1e6}-\\x{1f1ff}\\x{1f191}-\\x{1f251}\\x{1f004}\\x{1f0cf}\\x{1f170}-\\x{1f171}\\x{1f17e}-\\x{1f17f}\\x{1f18e}\\x{3030}\\x{2b50}\\x{2b55}\\x{2934}-\\x{2935}\\x{2b05}-\\x{2b07}\\x{2b1b}-\\x{2b1c}\\x{3297}\\x{3299}\\x{303d}\\x{00a9}\\x{00ae}\\x{2122}\\x{23f3}\\x{24c2}\\x{23e9}-\\x{23ef}\\x{25b6}\\x{23f8}-\\x{23fa}]"
x <- "I know \U0001F621\U0001F923s"
regmatches(x, gregexpr(emoji_rx, x, perl=TRUE))
## => [[1]]
##    [1] "😡" "🤣"
## Or, to get them as single chunks
emoji_rx <- "[\\x{1f300}-\\x{1f5ff}\\x{1f900}-\\x{1f9ff}\\x{1f600}-\\x{1f64f}\\x{1f680}-\\x{1f6ff}\\x{2600}-\\x{26ff}\\x{2700}-\\x{27bf}\\x{1f1e6}-\\x{1f1ff}\\x{1f191}-\\x{1f251}\\x{1f004}\\x{1f0cf}\\x{1f170}-\\x{1f171}\\x{1f17e}-\\x{1f17f}\\x{1f18e}\\x{3030}\\x{2b50}\\x{2b55}\\x{2934}-\\x{2935}\\x{2b05}-\\x{2b07}\\x{2b1b}-\\x{2b1c}\\x{3297}\\x{3299}\\x{303d}\\x{00a9}\\x{00ae}\\x{2122}\\x{23f3}\\x{24c2}\\x{23e9}-\\x{23ef}\\x{25b6}\\x{23f8}-\\x{23fa}]+"
regmatches(x, gregexpr(emoji_rx, x, perl=TRUE))
## => [[1]]
##    [1] "😡🤣"

See this online R demo . 看到这个在线R演示。

Answer 2

EDIT: Turns out you'll have to escape the backslashes as well: 编辑：结果你也必须逃避反斜杠：

<(U\\\\+\\\\S*?)> /g Try here it works <(U\\\\+\\\\S*?)> /g试试这里有效

This captures all the emojis as expected. 这会按预期捕获所有表情符号。 An emoji is assumed to be enclosed in angular brackets and begin with U+ . 假设表情符号用尖括号括起来，以U+开头。

Demo 演示

遇到特殊字符时，从R中的字符串中提取子字符串

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-07-24 11:55:03

解决方案2
0 2019-07-24 10:31:24

遇到特殊字符时，从R中的字符串中提取子字符串

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-07-24 11:55:03

解决方案2 0 2019-07-24 10:31:24

解决方案1
2 已采纳 2019-07-24 11:55:03

解决方案2
0 2019-07-24 10:31:24