繁体   English   中英

从 Dataframe 列中提取表情符号并将它们添加到同一 Dataframe Scala Spark 的不同列中

[英]Extract emojis from Dataframe column and add them into a different Column of the same Dataframe Scala Spark

我有以下数据框

+----------------------------------
|______value______________________|
| I am going to school 😀        |
| why are you crying 🙁 😞       |
| You are not very good my friend |

我想在每一行中提取表情符号并将这些值插入到同一数据框的新列中,如下所示

+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school 😀        |      😀      |
| why are you crying 🙁 😞       |    🙁 😞    |
--------------------------------------------------

我有以下代码来过滤值列中带有笑脸的句子。

kafkaTopicDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")

但我不确定如何使用 spark scala 为各个行插入带有笑脸的新列。

编辑 2

如果我想让表情符号列包含不同表情符号的数组,我编写了以下代码

df.filter(
      regexp_extract(col("value"), raw"(\p{block=Emoticons})", 1) =!= ""
    ).withColumn(
      "emoji", array(regexp_replace(
        col("value"),raw"([^\p{block=Emoticons}|\p{block=Miscellaneous Symbols and Pictographs}|\uD83E\uDD00-\uD83E\uDDFF])",
        ""
      ))

    )

实际产出

+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school 😀😀      |    [😀😀]   |
| why are you crying 🙁 😞       |    [🙁😞]   |
--------------------------------------------------

预期产出

+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school 😀 😀     |    [😀]      |
| why are you crying 🙁 😞       |    [🙁,😞]   |
--------------------------------------------------

您可以用空字符串替换非表情符号字符。 请注意正则表达式模式开头的^ ,它匹配非指定字符的字符。

val df2 = df.filter(
    regexp_extract($"value", raw"(\p{block=Emoticons})", 1) =!= ""
).withColumn(
    "emoji", 
    regexp_replace(
        col("value"), 
        raw"([^\p{block=Emoticons}\p{block=Miscellaneous Symbols and Pictographs}\uD83E\uDD00-\uD83E\uDDFF])", 
        ""
    )
)

df2.show(false)
+-------------------------------+-----+
|value                          |emoji|
+-------------------------------+-----+
|I am going to school 😀        |😀   |
|why are you crying 🙁 😞       |🙁😞 |
+-------------------------------+-----+

编辑:

val df2 = df.filter(
    regexp_extract(col("value"), raw"(\p{block=Emoticons})", 1) =!= ""
).withColumn(
    "emoji", 
    regexp_replace(
        col("value"),
        raw"([^\p{block=Emoticons}|\p{block=Miscellaneous Symbols and Pictographs}|\uD83E\uDD00-\uD83E\uDDFF])",
        ""
    )
).withColumn(
    "emoji", 
    regexp_replace(
        col("emoji"),
        raw"([\p{block=Emoticons}|\p{block=Miscellaneous Symbols and Pictographs}|\uD83E\uDD00-\uD83E\uDDFF])", 
        "$1 "
    )
).withColumn(
    "emoji", 
    split(trim(col("emoji")), " ")
)

df2.show(false)
+------------------------+--------+
|value                   |emoji   |
+------------------------+--------+
|I am going to school 😀 |[😀]    |
|why are you crying 🙁 😞|[🙁, 😞]|
+------------------------+--------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM