简体   繁体   中英

To check if elements in a given list present in array column in DataFrame

I have below function which works on pandas dataframe

def event_list(df,steps):
    df['steps_present'] =  df['labels'].apply(lambda x:all(step in x for step in steps))
    return df

DataFrame have a column called labels with values as list. This function accepts dataframe and steps(which is a list) and outputs dataframe with a new column Steps present if all the elements in parameter list present in the dataframe column

value in df['labels'] =  [EBBY , ABBY , JULIE , ROBERTS]

event_list(df,['EBBY','ABBY']) will return True for the that record since EBBY and ABBY are present in the dataframe list column.

I would like to create a similar function in pyspark.

You can convert the function as a UDF, may be something like below, which works.

from pyspark.sql.functions import lit, array

values = [(["EBBY" , "ABBY" , "JULIE" , "ROBERTS"],),
           (["EBBY" , "ABBY"],)]
columns = ['labels']
df = spark.createDataFrame(values, columns)

@udf
def event_list(column_to_test, input_values):
    return all(value in column_to_test for value in input_values)

steps = ["EBBY", "JULIE"]
df.withColumn("steps_present", event_list(df['labels'], array([lit(x) for x in steps]))).show(truncate=False)

在此处输入图像描述

You can use array_except to check whether every element in the provided list is present in the labels column. If it is, the size of the result of array_except would be 0. Comparing the size to 0 will give you a Boolean value as you desired.

import pyspark.sql.functions as F

def event_list(df, steps):
    return df.withColumn(
        'steps_present', 
        F.size(F.array_except(F.array(*[F.lit(l) for l in steps]), 'labels')) == 0
    )

df2 = event_list(df, ["EBBY", "ABBY"])

df2.show(truncate=False)
+----------------------------+-------------+
|labels                      |steps_present|
+----------------------------+-------------+
|[EBBY, ABBY, JULIE, ROBERTS]|true         |
|[EBBY, JULIE]               |false        |
+----------------------------+-------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM