简体   繁体   English

PySpark:当列是列表时向 DataFrame 添加一列

[英]PySpark: Add a column to DataFrame when column is a list

I have read similar questions but couldn't find a solution to my specific problem.我读过类似的问题,但找不到解决我的具体问题的方法。

I have a list我有一个清单

l = [1, 2, 3]

and a DataFrame和一个数据帧

df = sc.parallelize([
    ['p1', 'a'],
    ['p2', 'b'],
    ['p3', 'c'],
]).toDF(('product', 'name'))

I would like to obtain a new DataFrame where the list l is added as a further column, namely我想获得一个新的 DataFrame,其中将列表l添加为另一列,即

+-------+----+---------+
|product|name| new_col |
+-------+----+---------+
|     p1|   a|     1   |
|     p2|   b|     2   |
|     p3|   c|     3   |
+-------+----+---------+

Approaches with JOIN, where I was joining df with an使用 JOIN 的方法,我在那里加入了 df

 sc.parallelize([[1], [2], [3]])

have failed.失败了。 Approaches using withColumn , as in使用withColumn方法,如

new_df = df.withColumn('new_col', l)

have failed because the list is not a Column object.失败,因为列表不是Column对象。

So, from reading some interesting stuff here , I've ascertained that you can't really just append a random / arbitrary column to a given DataFrame object. 因此,通过阅读这里的一些有趣的东西,我已经确定你不能真正只是将随机/任意列附加到给定的DataFrame对象。 It appears what you want is more of a zip than a join . 看起来你想要的更多的是zip不是join I looked around and found this ticket , which makes me think you won't be able to zip given that you have DataFrame rather than RDD objects. 我环顾四周找到了这张票 ,这让我觉得如果你有DataFrame而不是RDD对象,你将无法zip

The only way I've been able to solve your issue invovles leaving the world of DataFrame objects and returning to RDD objects. 我能够解决你的问题的唯一方法就是离开DataFrame对象的世界并返回到RDD对象。 I've also needed to create an index for the purpose of the join, which may or may not work with your use case. 我还需要为连接创建索引,这可能适用于您的用例,也可能不适用。

l = sc.parallelize([1, 2, 3])
index = sc.parallelize(range(0, l.count()))
z = index.zip(l)

rdd = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']])
rdd_index = index.zip(rdd)

# just in case!
assert(rdd.count() == l.count())
# perform an inner join on the index we generated above, then map it to look pretty.
new_rdd = rdd_index.join(z).map(lambda (x, y): [y[0][0], y[0][1], y[1]])
new_df = new_rdd.toDF(["product", 'name', 'new_col'])

When I run new_df.show() , I get: 当我运行new_df.show() ,我得到:

+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
|     p1|   a|      1|
|     p2|   b|      2|
|     p3|   c|      3|
+-------+----+-------+

Sidenote: I'm really surprised this didn't work. 旁注:我真的很惊讶这没用。 Looks like an outer join? 看起来像外部联接?

from pyspark.sql import Row
l = sc.parallelize([1, 2, 3])
new_row = Row("new_col_name")
l_as_df = l.map(new_row).toDF()
new_df = df.join(l_as_df)

When I run new_df.show() , I get: 当我运行new_df.show() ,我得到:

+-------+----+------------+
|product|name|new_col_name|
+-------+----+------------+
|     p1|   a|           1|
|     p1|   a|           2|
|     p1|   a|           3|
|     p2|   b|           1|
|     p3|   c|           1|
|     p2|   b|           2|
|     p2|   b|           3|
|     p3|   c|           2|
|     p3|   c|           3|
+-------+----+------------+

If the product column is unique then consider the following approach: 如果product列是唯一的,请考虑以下方法:

original dataframe: 原始数据帧:

df = spark.sparkContext.parallelize([
    ['p1', 'a'],
    ['p2', 'b'],
    ['p3', 'c'],
]).toDF(('product', 'name'))

df.show()

+-------+----+
|product|name|
+-------+----+
|     p1|   a|
|     p2|   b|
|     p3|   c|
+-------+----+

new column (and new index column): 新列(和新索引列):

lst = [1, 2, 3]
indx = ['p1','p2','p3']

create a new dataframe from the list above (with an index): 从上面的列表创建一个新的数据框(带索引):

from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),
                       StructField("newCol", IntegerType(), True)                       
                     ])
df1=spark.createDataFrame(zip(indx,lst),schema = myschema)
df1.show()
+----+------+
|indx|newCol|
+----+------+
|  p1|     1|
|  p2|     2|
|  p3|     3|
+----+------+

join this to the original dataframe, using the index created: 使用创建的索引将此连接到原始数据框:

dfnew = df.join(df1, df.product == df1.indx,how='left')\
          .drop(df1.indx)\
          .sort("product")

to get: 要得到:

dfnew.show()

+-------+----+------+
|product|name|newCol|
+-------+----+------+
|     p1|   a|     1|
|     p2|   b|     2|
|     p3|   c|     3|
+-------+----+------+

This is achievable via RDDs.这可以通过 RDD 实现。

1 Convert dataframes to indexed rdds: 1 将数据帧转换为索引的 rdds:

df_rdd = df.rdd.zipWithIndex().map(lambda row: (row[1], (row[0][0], row[0][1])))
l_rdd = sc.parallelize(l).zipWithIndex().map(lambda row: (row[1], row[0]))

2 Join two RDDs on index, drop index and rearrange elements: 2 在索引、删除索引和重新排列元素上加入两个 RDD:

res_rdd = df_rdd.join(l_rdd).map(lambda row: [row[1][0][0], row[1][0][1], row[1][1]])

3 Convert result to Dataframe: 3 将结果转换为数据帧:

res_df = res_rdd.toDF(['product', 'name', 'new_col'])
res_df.show()

+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
|     p1|   a|      1|
|     p2|   b|      2|
|     p3|   c|      3|
+-------+----+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM