简体   繁体   English

向 dataframe 添加一个新列,这将指示另一列是否包含单词 pyspark

[英]add a new column to a dataframe that will indicate if another column contains a word pyspark

i have a dataframe that i want to add to it a column that will indicate if the word "yes" is in that row text column (1 if the word is in that row 0 if not) i need to put 1 in check only if "yes" appear as a word and not as a substring or if "yes" is next to a punctuation mark(example: yes?) how can i do that in spark: for example:我有一个 dataframe 我想向它添加一个列,该列将指示单词“是”是否在该行文本列中(如果单词在该行中,则为 1,如果不是,则为 0)我需要仅在以下情况下检查 1 “是”显示为一个单词而不是 substring 或者如果“是”在标点符号旁边(例如:是?)我如何在 spark 中做到这一点:例如:

id  group  text
1   a       hey there
2   c       no you can
3   a       yes yes yes
4   b       yes or no
5   b       you need to say yes.
6   a       yes you can
7   d       yes!
8   c       no&
9   b       ok

the result on that will be:结果将是:

id  group  text                  check
1   a       hey there             0
2   c       no you can            0
3   a       yes yes yes           1
4   b       yes or no             1
5   b       you need to say yes.  1
6   a       yes you can           1
7   d       yes!                  1
8   c       no&                   0
9   b       ok                    0

You can check with rlike and cast to Integer:您可以使用rlike检查并转换为 Integer:

import pyspark.sql.functions as F
df.withColumn("check",F.col("text").rlike("yes").cast("Integer")).show()

+---+-----+--------------------+-----+
| id|group|                text|check|
+---+-----+--------------------+-----+
|  1|    a|           hey there|    0|
|  2|    c|          no you can|    0|
|  3|    a|         yes yes yes|    1|
|  4|    b|           yes or no|    1|
|  5|    b|you need to say yes.|    1|
|  6|    a|         yes you can|    1|
|  7|    d|                yes!|    1|
|  8|    c|                 no&|    0|
|  9|    b|                  ok|    0|
+---+-----+--------------------+-----+

For edited question, you can try with higher order functions :对于已编辑的问题,您可以尝试使用higher order functions

import string
import re
pat = '|'.join([re.escape(i) for i in list(string.punctuation)])

(df.withColumn("text1",F.regexp_replace(F.col("text"),pat,""))
.withColumn("Split",F.split("text1"," "))
.withColumn("check",
  F.expr('''exists(Split,x-> replace(x,"","") = "yes")''').cast("Integer"))
.drop("Split","text1")).show()

+---+-----+--------------------+-----+
| id|group|                text|check|
+---+-----+--------------------+-----+
|  1|    a|           hey there|    0|
|  2|    c|          no you can|    0|
|  3|    a|         yes yes yes|    1|
|  4|    b|           yes or no|    1|
|  5|    b|you need to say yes.|    1|
|  6|    a|         yes you can|    1|
|  7|    d|                yes!|    1|
|  8|    c|                 no&|    0|
|  9|    b|               okyes|    0|
+---+-----+--------------------+-----+

I need to put 1 in check only if "yes" appear as a word and not as a substring.只有当“yes”作为一个单词而不是 substring 出现时,我才需要勾选1

You could address this by matching text against a regex that uses word boundaries ( \b ).您可以通过将text与使用单词边界( \b ) 的正则表达式进行匹配来解决此问题。 This is handy regex feature that represents characters that separate words (spaces, punctuation marks, and so one).这是一个方便的正则表达式功能,表示分隔单词的字符(空格、标点符号等)。

In SQL, you would do:在 SQL 中,您将执行以下操作:

select
    t.*
    case when text rlike '\byes\b' then 1 else 0 end as check
from mytable t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark:从另一个 pyspark dataframe 添加新列 - Pyspark: Add new column from another pyspark dataframe 在 PySpark 数据框中添加列总和作为新列 - Add column sum as new column in PySpark dataframe 从另一个 DataFrame 将列添加到 Pyspark DataFrame - Add column to Pyspark DataFrame from another DataFrame PySpark使用新列表将新列添加到数据框 - PySpark add new column to dataframe with new list 添加包含唯一计数的新列 Pyspark - Add new column Pyspark that contains unique count 如何在pySpark数据框中添加一个新列,该列包含计数大于0的列值? - How to add a new column to pySpark dataframe which contains count its column values which are greater to 0? 如果一个单词在 dataframe 的列中,则将该单词替换为另一个单词并使用新信息创建一个新行并添加到另一个 DataFrame - If a word is in a column in dataframe, replace the word with another and make a new row with new info and add to another DataFrame Pyspark:将平均值作为新列添加到 DataFrame - Pyspark: Add the average as a new column to DataFrame 将PySpark RDD添加为pyspark.sql.dataframe的新列 - Add PySpark RDD as new column to pyspark.sql.dataframe 如果列在另一个 Spark Dataframe 中,Pyspark 创建新列 - Pyspark create new column based if a column isin another Spark Dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM