简体   繁体   English

从PySpark DataFrame中的非空列中选择值

[英]Selecting values from non-null columns in a PySpark DataFrame

There is a pyspark dataframe with missing values: 有一个缺少值的pyspark数据框:

tbl = sc.parallelize([
        Row(first_name='Alice', last_name='Cooper'),             
        Row(first_name='Prince', last_name=None),
        Row(first_name=None, last_name='Lenon')
    ]).toDF()
tbl.show()

Here's the table: 这是表格:

  +----------+---------+
  |first_name|last_name|
  +----------+---------+
  |     Alice|   Cooper|
  |    Prince|     null|
  |      null|    Lenon|
  +----------+---------+

I would like to create a new column as follows: 我想创建一个新列,如下所示:

  • if first name is None, take the last name 如果名字是None,则取姓氏
  • if last name is None, take the first name 如果姓氏为无,请取名字
  • if they are both present, concatenate them 如果他们都在场,请将它们连接起来
  • we can safely assume that at least one of them is present 我们可以放心地假设它们中至少有一个存在

I can construct a simple function: 我可以构造一个简单的函数:

def combine_data(row):
    if row.last_name is None:
        return row.first_name
    elif row.first_name is None:
        return row.last_name
    else:
        return '%s %s' % (row.first_name, row.last_name)
tbl.map(combine_data).collect()

I do get the correct result, but I can't append it to the table as a column: tbl.withColumn('new_col', tbl.map(combine_data)) results in AssertionError: col should be Column 我得到了正确的结果,但我无法将其作为列附加到表中: tbl.withColumn('new_col', tbl.map(combine_data))导致AssertionError: col should be Column

What is the best way to convert the result of map to a Column ? map结果转换为Column的最佳方法是什么? Is there a preferred way to deal with null values? 有没有一种处理null值的首选方法?

As always it is best to operate directly on native representation instead of fetching data to Python: 一如既往,最好直接在本机表示上操作,而不是将数据提取到Python:

from pyspark.sql.functions import concat_ws, coalesce, lit, trim

def combine(*cols):
    return trim(concat_ws(" ", *[coalesce(c, lit("")) for c in cols]))

tbl.withColumn("foo", combine("first_name", "last_name")).

You just need to use a UDF that receives two columns as arguments. 您只需要使用接收两columns作为参数的UDF

from pyspark.sql.functions import *
from pyspark.sql import Row

tbl = sc.parallelize([
        Row(first_name='Alice', last_name='Cooper'),             
        Row(first_name='Prince', last_name=None),
        Row(first_name=None, last_name='Lenon')
    ]).toDF()

tbl.show()

def combine(c1, c2):
  if c1 != None and c2 != None:
    return c1 + " " + c2
  elif c1 == None:
    return c2
  else:
    return c1

combineUDF = udf(combine)

expr = [c for c in ["first_name", "last_name"]] + [combineUDF(col("first_name"), col("last_name")).alias("full_name")]

tbl.select(*expr).show()

#+----------+---------+------------+
#|first_name|last_name|   full_name|
#+----------+---------+------------+
#|     Alice|   Cooper|Alice Cooper|
#|    Prince|     null|      Prince|
#|      null|    Lenon|       Lenon|
#+----------+---------+------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PySpark数据框:过滤具有四个或更多非空列的记录 - PySpark dataframe: filter records with four or more non-null columns 将 dataframe 的多列与非空值的分隔符连接起来 - Concatenate multiple columns of dataframe with a seperating character for Non-null values 从熊猫数据框中的多个列创建一个包含所有非空值的单个列 - create a single column containing all non-null values from multiple columns in a pandas dataframe 使用来自其他列的非空值填充列中的空值 - Fill nulls in columns with non-null values from other columns 根据非空列数从数据框中选择行 - Select rows from a dataframe based on the number of non-null columns 有条件地设置DataFrame的非空值 - Set non-null values of DataFrame conditionally 通过行中非空元素的计数对PySpark Dataframe进行统一分区 - Uniformly partition PySpark Dataframe by count of non-null elements in row PySpark:获取数据框中每列的第一个非空值 - PySpark: Get first Non-null value of each column in dataframe PySpark DataFrame - 每个分区的“非空集群”数 - PySpark DataFrame - Number of "non-null clusters" per partition 使用基于具有非空值的其他列的lambda在数据框中创建列 - Create a column in dataframe using lambda based on another columns with non-null values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM