从 df 列的列表中过滤期望值

Question

I have a data frame with the following column:我有一个包含以下列的数据框：

raw_col
['a','b','c']
['b']
['a','b']
['c']

I want to return a column with single value based on a conditional statement.我想根据条件语句返回具有单个值的列。 I wrote the following function:我写了以下function：

def filter_func(elements):
  if "a" in elements:
    return "a"
  else:
    return "Other"

When running the function on the column df.withColumn("col", filter_func("raw_col")) I have the following error col should be Column在df.withColumn("col", filter_func("raw_col"))列上运行 function 我有以下错误col should be Column

What's wrong here?这里有什么问题？ What should I do?我应该怎么办？

Answer 1

You can use array_contains function:您可以使用array_contains function：

import pyspark.sql.functions as f

df = df.withColumn("col", f.when(f.array_contains("raw_col", f.lit("a")), f.lit("a")).otherwise(f.lit("Other")))

But if you have a complex logic and need necessary use the filter_func , it's needed to create an UDF:但是如果你有一个复杂的逻辑并且需要使用filter_func ，则需要创建一个 UDF：

@f.udf()
def filter_func(elements):
    if "a" in elements:
        return "a"
    else:
        return "Other"

df = df.withColumn("col", filter_func("raw_col"))

从 df 列的列表中过滤期望值

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-10 21:27:10

从 df 列的列表中过滤期望值

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-10 21:27:10

解决方案1
1 已采纳 2021-06-10 21:27:10