[英]How to verify an array contain another array
I have a pyspark dataframe that contain 4 columns.我有一个包含 4 列的 pyspark dataframe。
Example Dataframe:示例 Dataframe:
id | name | age | job
-------------------------------------------------------------------
["98475", "748574"] | ["98475",748574] |
-------------------------------------------------------------------
["75473","98456"] | ["98456"] |
-------------------------------------------------------------------
["23456","28596"] | ["84758","56849","86954"]
-------------------------------------------------------------------
I want to compare 2 columns (array<string> types)
:我想比较 2 列(array<string> types)
:
Example:例子:
Array_A (id) | Array_B(name)
------------------------------
if all the values in the Array_B are matches are the values in the Array_A ==> ok如果 Array_B 中的所有值都匹配,则 Array_A 中的值 ==> ok
if all the values in the Array_B are in the array_A ==> medium如果 Array_B 中的所有值都在 array_A ==> 介质中
if the values of the Array_B are not exist in the array_A ==> not found如果 Array_B 的值在 array_A 中不存在 ==> 未找到
I did an UDF:我做了一个UDF:
def contains(x,y):
z = len(set(x) - set(y))
if ((z == 0) & (set(x) == set(y))):
return "ok"
elif (set(y).isin(set(x))) & (z != 0):
return "medium"
else set(y) != set(x):
return "not found in raw"
contains_udf = udf(contains)
Then:然后:
new_df= df.withColumn(
"new_column",
F.when(
(df.id.isNotNull() & df.name.isNotNull()),
contains_udf(df.id,df.name)
).otherwise(
F.lit(None)
)
)
I got this error:我收到了这个错误:
else set(y) != set(x):
^
SyntaxError: invalid syntax
How can I resolve it using udf or another solution like array_contains perhaps?我如何使用 udf 或其他解决方案(如 array_contains)来解决它? Thank you谢谢
else set(y) != set(x):
^
SyntaxError: invalid syntax
This is because an else
statement doesn't require a condition.这是因为else
语句不需要条件。 It contains code to be executed only if none of the previous conditions are met.它包含仅在不满足先前条件时才执行的代码。 Use instead:改用:
elif set(y) != set(x):
#code
OR或者
else :
#code
As @Buckeye14Guy and @Sid pointed out the main problems in your code, you might also need to clean out some of the logic:正如@Buckeye14Guy 和@Sid 指出代码中的主要问题,您可能还需要清除一些逻辑:
from pyspark.sql.functions import udf
def contains(x,y):
try:
sx, sy = set(x), set(y)
if len(sy) == 0:
return 'list is empty'
elif sx == sy:
return "ok"
elif sy.issubset(sx):
return "medium"
# below none of sy is in sx
elif sx - sy == sx:
return "none found in raw" # including empty x
else:
return "some missing in raw"
# in exception, for example `x` or `y` is None (not a list)
except:
return "not an iterable or other errors"
udf_contains = udf(contains, 'string')
df.withColumn('new_column', udf_contains('id', 'name')).show(truncate=False)
+---------------+---------------------+-----------------+
|id |name |new_column |
+---------------+---------------------+-----------------+
|[98475, 748574]|[98475, 748574] |ok |
|[75473, 98456] |[98456] |medium |
|[23456, 28596] |[84758, 56849, 86954]|none found in raw|
+---------------+---------------------+-----------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.