[英]PySpark: Add a column to DataFrame when column is a list
I have read similar questions but couldn't find a solution to my specific problem.我读过类似的问题,但找不到解决我的具体问题的方法。
I have a list我有一个清单
l = [1, 2, 3]
and a DataFrame和一个数据帧
df = sc.parallelize([
['p1', 'a'],
['p2', 'b'],
['p3', 'c'],
]).toDF(('product', 'name'))
I would like to obtain a new DataFrame where the list l
is added as a further column, namely我想获得一个新的 DataFrame,其中将列表l
添加为另一列,即
+-------+----+---------+
|product|name| new_col |
+-------+----+---------+
| p1| a| 1 |
| p2| b| 2 |
| p3| c| 3 |
+-------+----+---------+
Approaches with JOIN, where I was joining df with an使用 JOIN 的方法,我在那里加入了 df
sc.parallelize([[1], [2], [3]])
have failed.失败了。 Approaches using withColumn
, as in使用withColumn
方法,如
new_df = df.withColumn('new_col', l)
have failed because the list is not a Column
object.失败,因为列表不是Column
对象。
So, from reading some interesting stuff here , I've ascertained that you can't really just append a random / arbitrary column to a given DataFrame
object. 因此,通过阅读这里的一些有趣的东西,我已经确定你不能真正只是将随机/任意列附加到给定的DataFrame
对象。 It appears what you want is more of a zip
than a join
. 看起来你想要的更多的是zip
不是join
。 I looked around and found this ticket , which makes me think you won't be able to zip
given that you have DataFrame
rather than RDD
objects. 我环顾四周找到了这张票 ,这让我觉得如果你有DataFrame
而不是RDD
对象,你将无法zip
。
The only way I've been able to solve your issue invovles leaving the world of DataFrame
objects and returning to RDD
objects. 我能够解决你的问题的唯一方法就是离开DataFrame
对象的世界并返回到RDD
对象。 I've also needed to create an index for the purpose of the join, which may or may not work with your use case. 我还需要为连接创建索引,这可能适用于您的用例,也可能不适用。
l = sc.parallelize([1, 2, 3])
index = sc.parallelize(range(0, l.count()))
z = index.zip(l)
rdd = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']])
rdd_index = index.zip(rdd)
# just in case!
assert(rdd.count() == l.count())
# perform an inner join on the index we generated above, then map it to look pretty.
new_rdd = rdd_index.join(z).map(lambda (x, y): [y[0][0], y[0][1], y[1]])
new_df = new_rdd.toDF(["product", 'name', 'new_col'])
When I run new_df.show()
, I get: 当我运行new_df.show()
,我得到:
+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+-------+
Sidenote: I'm really surprised this didn't work. 旁注:我真的很惊讶这没用。 Looks like an outer join? 看起来像外部联接?
from pyspark.sql import Row
l = sc.parallelize([1, 2, 3])
new_row = Row("new_col_name")
l_as_df = l.map(new_row).toDF()
new_df = df.join(l_as_df)
When I run new_df.show()
, I get: 当我运行new_df.show()
,我得到:
+-------+----+------------+
|product|name|new_col_name|
+-------+----+------------+
| p1| a| 1|
| p1| a| 2|
| p1| a| 3|
| p2| b| 1|
| p3| c| 1|
| p2| b| 2|
| p2| b| 3|
| p3| c| 2|
| p3| c| 3|
+-------+----+------------+
If the product
column is unique then consider the following approach: 如果product
列是唯一的,请考虑以下方法:
original dataframe: 原始数据帧:
df = spark.sparkContext.parallelize([
['p1', 'a'],
['p2', 'b'],
['p3', 'c'],
]).toDF(('product', 'name'))
df.show()
+-------+----+
|product|name|
+-------+----+
| p1| a|
| p2| b|
| p3| c|
+-------+----+
new column (and new index column): 新列(和新索引列):
lst = [1, 2, 3]
indx = ['p1','p2','p3']
create a new dataframe from the list above (with an index): 从上面的列表创建一个新的数据框(带索引):
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),
StructField("newCol", IntegerType(), True)
])
df1=spark.createDataFrame(zip(indx,lst),schema = myschema)
df1.show()
+----+------+
|indx|newCol|
+----+------+
| p1| 1|
| p2| 2|
| p3| 3|
+----+------+
join this to the original dataframe, using the index created: 使用创建的索引将此连接到原始数据框:
dfnew = df.join(df1, df.product == df1.indx,how='left')\
.drop(df1.indx)\
.sort("product")
to get: 要得到:
dfnew.show()
+-------+----+------+
|product|name|newCol|
+-------+----+------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+------+
This is achievable via RDDs.这可以通过 RDD 实现。
1 Convert dataframes to indexed rdds: 1 将数据帧转换为索引的 rdds:
df_rdd = df.rdd.zipWithIndex().map(lambda row: (row[1], (row[0][0], row[0][1])))
l_rdd = sc.parallelize(l).zipWithIndex().map(lambda row: (row[1], row[0]))
2 Join two RDDs on index, drop index and rearrange elements: 2 在索引、删除索引和重新排列元素上加入两个 RDD:
res_rdd = df_rdd.join(l_rdd).map(lambda row: [row[1][0][0], row[1][0][1], row[1][1]])
3 Convert result to Dataframe: 3 将结果转换为数据帧:
res_df = res_rdd.toDF(['product', 'name', 'new_col'])
res_df.show()
+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.