如何验证一个数组是否包含另一个数组

Question

I have a pyspark dataframe that contain 4 columns.我有一个包含 4 列的 pyspark dataframe。

Example Dataframe:示例 Dataframe：

id                       |  name                          | age |  job
    -------------------------------------------------------------------
     ["98475", "748574"] |  ["98475",748574]              |
    -------------------------------------------------------------------
      ["75473","98456"]  |   ["98456"]                    |
    -------------------------------------------------------------------
      ["23456","28596"]  |   ["84758","56849","86954"]      
    -------------------------------------------------------------------

I want to compare 2 columns (array<string> types) :我想比较 2 列(array<string> types) ：

Example:例子：

Array_A (id)  | Array_B(name)
------------------------------

if all the values in the Array_B are matches are the values in the Array_A ==> ok如果 Array_B 中的所有值都匹配，则 Array_A 中的值 ==> ok

if all the values in the Array_B are in the array_A ==> medium如果 Array_B 中的所有值都在 array_A ==> 介质中

if the values of the Array_B are not exist in the array_A ==> not found如果 Array_B 的值在 array_A 中不存在 ==> 未找到

I did an UDF:我做了一个UDF：

def contains(x,y):
        z = len(set(x) - set(y))
        if ((z == 0) & (set(x) == set(y))):
            return "ok"
        elif (set(y).isin(set(x))) & (z != 0):
            return "medium"
        else set(y) != set(x):
            return "not found in raw"


contains_udf = udf(contains)

Then:然后：

new_df= df.withColumn(
    "new_column",
    F.when(
        (df.id.isNotNull() & df.name.isNotNull()),
        contains_udf(df.id,df.name)
    ).otherwise(
        F.lit(None)
    )

)

I got this error:我收到了这个错误：

else set(y) != set(x):
           ^
SyntaxError: invalid syntax

How can I resolve it using udf or another solution like array_contains perhaps?我如何使用 udf 或其他解决方案（如 array_contains）来解决它？ Thank you谢谢

Answer 1

else set(y) != set(x):
           ^
SyntaxError: invalid syntax

This is because an else statement doesn't require a condition.这是因为else语句不需要条件。 It contains code to be executed only if none of the previous conditions are met.它包含仅在不满足先前条件时才执行的代码。 Use instead:改用：

elif set(y) != set(x):
    #code

OR或者

else :
    #code

Answer 2

As @Buckeye14Guy and @Sid pointed out the main problems in your code, you might also need to clean out some of the logic:正如@Buckeye14Guy 和@Sid 指出代码中的主要问题，您可能还需要清除一些逻辑：

from pyspark.sql.functions import udf

def contains(x,y): 
  try:
    sx, sy = set(x), set(y) 
    if len(sy) == 0: 
        return 'list is empty'
    elif sx == sy: 
        return "ok"    
    elif sy.issubset(sx): 
        return "medium"  
    # below none of sy is in sx
    elif sx - sy == sx: 
        return "none found in raw"  # including empty x
    else: 
        return "some missing in raw"
  # in exception, for example `x` or `y` is None (not a list)
  except:
    return "not an iterable or other errors"

udf_contains = udf(contains, 'string')

df.withColumn('new_column', udf_contains('id', 'name')).show(truncate=False)
+---------------+---------------------+-----------------+
|id             |name                 |new_column       |
+---------------+---------------------+-----------------+
|[98475, 748574]|[98475, 748574]      |ok               |
|[75473, 98456] |[98456]              |medium           |
|[23456, 28596] |[84758, 56849, 86954]|none found in raw|
+---------------+---------------------+-----------------+

如何验证一个数组是否包含另一个数组

问题描述

2 个解决方案

解决方案1
0 2019-10-22 13:24:12

解决方案2
0 2019-10-22 16:40:08

如何验证一个数组是否包含另一个数组

问题描述

2 个解决方案

解决方案1 0 2019-10-22 13:24:12

解决方案2 0 2019-10-22 16:40:08

解决方案1
0 2019-10-22 13:24:12

解决方案2
0 2019-10-22 16:40:08