从PySpark DataFrame中的非空列中选择值

Question

There is a pyspark dataframe with missing values: 有一个缺少值的pyspark数据框：

tbl = sc.parallelize([
        Row(first_name='Alice', last_name='Cooper'),             
        Row(first_name='Prince', last_name=None),
        Row(first_name=None, last_name='Lenon')
    ]).toDF()
tbl.show()

Here's the table: 这是表格：

  +----------+---------+
  |first_name|last_name|
  +----------+---------+
  |     Alice|   Cooper|
  |    Prince|     null|
  |      null|    Lenon|
  +----------+---------+

I would like to create a new column as follows: 我想创建一个新列，如下所示：

if first name is None, take the last name 如果名字是None，则取姓氏
if last name is None, take the first name 如果姓氏为无，请取名字
if they are both present, concatenate them 如果他们都在场，请将它们连接起来
we can safely assume that at least one of them is present 我们可以放心地假设它们中至少有一个存在

I can construct a simple function: 我可以构造一个简单的函数：

def combine_data(row):
    if row.last_name is None:
        return row.first_name
    elif row.first_name is None:
        return row.last_name
    else:
        return '%s %s' % (row.first_name, row.last_name)
tbl.map(combine_data).collect()

I do get the correct result, but I can't append it to the table as a column: tbl.withColumn('new_col', tbl.map(combine_data)) results in AssertionError: col should be Column 我得到了正确的结果，但我无法将其作为列附加到表中： tbl.withColumn('new_col', tbl.map(combine_data))导致AssertionError: col should be Column

What is the best way to convert the result of map to a Column ? 将map结果转换为Column的最佳方法是什么？ Is there a preferred way to deal with null values? 有没有一种处理null值的首选方法？

Answer 1

As always it is best to operate directly on native representation instead of fetching data to Python: 一如既往，最好直接在本机表示上操作，而不是将数据提取到Python：

from pyspark.sql.functions import concat_ws, coalesce, lit, trim

def combine(*cols):
    return trim(concat_ws(" ", *[coalesce(c, lit("")) for c in cols]))

tbl.withColumn("foo", combine("first_name", "last_name")).

Answer 2

You just need to use a UDF that receives two columns as arguments. 您只需要使用接收两columns作为参数的UDF 。

from pyspark.sql.functions import *
from pyspark.sql import Row

tbl = sc.parallelize([
        Row(first_name='Alice', last_name='Cooper'),             
        Row(first_name='Prince', last_name=None),
        Row(first_name=None, last_name='Lenon')
    ]).toDF()

tbl.show()

def combine(c1, c2):
  if c1 != None and c2 != None:
    return c1 + " " + c2
  elif c1 == None:
    return c2
  else:
    return c1

combineUDF = udf(combine)

expr = [c for c in ["first_name", "last_name"]] + [combineUDF(col("first_name"), col("last_name")).alias("full_name")]

tbl.select(*expr).show()

#+----------+---------+------------+
#|first_name|last_name|   full_name|
#+----------+---------+------------+
#|     Alice|   Cooper|Alice Cooper|
#|    Prince|     null|      Prince|
#|      null|    Lenon|       Lenon|
#+----------+---------+------------+

从PySpark DataFrame中的非空列中选择值

问题描述

2 个解决方案

解决方案1
6 2016-03-23 16:37:03

解决方案2
3 已采纳 2016-03-23 14:25:43

从PySpark DataFrame中的非空列中选择值

问题描述

2 个解决方案

解决方案1 6 2016-03-23 16:37:03

解决方案2 3 已采纳 2016-03-23 14:25:43

解决方案1
6 2016-03-23 16:37:03

解决方案2
3 已采纳 2016-03-23 14:25:43