[英]Adding a column to a dataframe in pyspark
I want to add a new column to a dataframe based on the value of a existing column using pyspark
.我想根据使用pyspark
的现有列的值向 dataframe 添加新列。
For example, if this is the original dataframe, I want to add a new column called "parent's data", which contains the data of the parent based on the column "parent_id", so that the resulting dataframe looks like below.例如,如果这是原始的 dataframe,我想添加一个名为“parent's data”的新列,其中包含基于列“parent_id”的父级数据,因此生成的 dataframe 如下所示。
Any help would be appreciated.任何帮助,将不胜感激。 Thank you.谢谢你。
I am sure there are multiple ways to achieve this.我相信有多种方法可以实现这一目标。 However, the simplest way is to create a new dataframe using 2 columns of existing dataframe.但是,最简单的方法是使用现有 dataframe 的 2 列创建新的 dataframe。 Then join the 2 dataframe to acieve this.然后加入 2 dataframe 来实现这一点。
Here is the code这是代码
df1 = pd.DataFrame([[1, 'a', 2], [2, 'b', 3], [3, 'c', 1]], columns=["id", "data", "parent_id"])
print(df1)
sparkdf=spark.createDataFrame(df1)
sparkdf.show()
sparkdf2=sparkdf.select('id','data')
sparkdf2.show()
sparkdf.registerTempTable("sparkdf")
sparkdf2.registerTempTable("sparkdf2")
sparkdf3=spark.sql('select a.id,a.data,a.parent_id,b.data from sparkdf as a join sparkdf2 as b on a.parent_id=b.id')
sparkdf3.show()
You can create a dictionary from your two columns, id and data, and then add your new column using withColumn
:您可以从 id 和 data 两列创建字典,然后使用withColumn
添加新列:
>>> d = {row["id"]:row["data"] for row in df.collect()}
Out[260]: {1: 'a', 2: 'b', 3: 'c'}
from itertools import chain
from pyspark.sql.functions import create_map, lit
m = create_map([lit(x) for x in chain(*d.items())])
df = df.withColumn('parent_data', m[df['parent_id']])
Which prints back:哪个打印回来:
>>> df.show(truncate=False)
+---+----+---------+-----------+
|id |data|parent_id|parent_data|
+---+----+---------+-----------+
|1 |a |2 |b |
|2 |b |3 |c |
|3 |c |1 |a |
+---+----+---------+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.