[英]add a new column to a dataframe that will indicate if another column contains a word pyspark
i have a dataframe that i want to add to it a column that will indicate if the word "yes" is in that row text column (1 if the word is in that row 0 if not) i need to put 1 in check only if "yes" appear as a word and not as a substring or if "yes" is next to a punctuation mark(example: yes?) how can i do that in spark: for example:我有一个 dataframe 我想向它添加一个列,该列将指示单词“是”是否在该行文本列中(如果单词在该行中,则为 1,如果不是,则为 0)我需要仅在以下情况下检查 1 “是”显示为一个单词而不是 substring 或者如果“是”在标点符号旁边(例如:是?)我如何在 spark 中做到这一点:例如:
id group text
1 a hey there
2 c no you can
3 a yes yes yes
4 b yes or no
5 b you need to say yes.
6 a yes you can
7 d yes!
8 c no&
9 b ok
the result on that will be:结果将是:
id group text check
1 a hey there 0
2 c no you can 0
3 a yes yes yes 1
4 b yes or no 1
5 b you need to say yes. 1
6 a yes you can 1
7 d yes! 1
8 c no& 0
9 b ok 0
You can check with rlike
and cast to Integer:您可以使用
rlike
检查并转换为 Integer:
import pyspark.sql.functions as F
df.withColumn("check",F.col("text").rlike("yes").cast("Integer")).show()
+---+-----+--------------------+-----+
| id|group| text|check|
+---+-----+--------------------+-----+
| 1| a| hey there| 0|
| 2| c| no you can| 0|
| 3| a| yes yes yes| 1|
| 4| b| yes or no| 1|
| 5| b|you need to say yes.| 1|
| 6| a| yes you can| 1|
| 7| d| yes!| 1|
| 8| c| no&| 0|
| 9| b| ok| 0|
+---+-----+--------------------+-----+
For edited question, you can try with higher order functions
:对于已编辑的问题,您可以尝试使用
higher order functions
:
import string
import re
pat = '|'.join([re.escape(i) for i in list(string.punctuation)])
(df.withColumn("text1",F.regexp_replace(F.col("text"),pat,""))
.withColumn("Split",F.split("text1"," "))
.withColumn("check",
F.expr('''exists(Split,x-> replace(x,"","") = "yes")''').cast("Integer"))
.drop("Split","text1")).show()
+---+-----+--------------------+-----+
| id|group| text|check|
+---+-----+--------------------+-----+
| 1| a| hey there| 0|
| 2| c| no you can| 0|
| 3| a| yes yes yes| 1|
| 4| b| yes or no| 1|
| 5| b|you need to say yes.| 1|
| 6| a| yes you can| 1|
| 7| d| yes!| 1|
| 8| c| no&| 0|
| 9| b| okyes| 0|
+---+-----+--------------------+-----+
I need to put
1
in check only if "yes" appear as a word and not as a substring.只有当“yes”作为一个单词而不是 substring 出现时,我才需要勾选
1
。
You could address this by matching text
against a regex that uses word boundaries ( \b
).您可以通过将
text
与使用单词边界( \b
) 的正则表达式进行匹配来解决此问题。 This is handy regex feature that represents characters that separate words (spaces, punctuation marks, and so one).这是一个方便的正则表达式功能,表示分隔单词的字符(空格、标点符号等)。
In SQL, you would do:在 SQL 中,您将执行以下操作:
select
t.*
case when text rlike '\byes\b' then 1 else 0 end as check
from mytable t
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.