[英]Add column to Pyspark DataFrame from another DataFrame
I have this:我有这个:
df_e :=
|country, name, year, c2, c3, c4|
|Austria, Jon Doe, 2003, 21.234, 54.234, 345.434|
...
df_p :=
|name, 2001, 2002, 2003, 2004|
|Jon Doe, 2849234, 12384312, 123908234, 12398193|
...
Both Pyspark Dataframes read from a csv file.从 csv 文件中读取的两个 Pyspark 数据帧。
How can I create a new column named "amount" inside df_e, which takes the name and year value of every record as reference from df_e, and gets the according amount from df_p?如何在 df_e 中创建一个名为“amount”的新列,它将每条记录的名称和年份值作为 df_e 的引用,并从 df_p 获取相应的金额? Using Pyspark.
使用 Pyspark。
In this case I should get the following DataFrame:在这种情况下,我应该得到以下 DataFrame:
df_e :=
|country, name, year, c2, c3, c4, amount|
|Austria, Jon Doe, 2003, 21.234, 54.234, 345.434, 123908234|
...
Thanks for the help!谢谢您的帮助!
EDIT:编辑:
This is how I'm reading the files:这就是我阅读文件的方式:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate(SparkConf().setMaster('local[*]'))
spark = SparkSession.builder.getOrCreate()
df_e = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/e.csv')
df_p = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/p.csv')
I'm starting out with Pyspark, so I don't really know what functions I can use for this problem.我从 Pyspark 开始,所以我真的不知道我可以使用哪些函数来解决这个问题。
With pandas I would do it by iterating over the DataFrame, like this:使用 pandas 我将通过迭代 DataFrame 来做到这一点,如下所示:
for i in df_e.index:
p[i] = df_p.query('name == "{}"'.format(df_e['name'][i]))['{}'.format(df_e['year'][i])]
And then adding the list p as a new column to df_e (although I know there may be a much better way to do it).然后将列表 p 作为新列添加到 df_e (尽管我知道可能有更好的方法来做到这一点)。
import pyspark.sql.functions as F
### i am assumming all the columns are years in this except the first one
### you can manually specify the list also ['2003','2005'] etc ..
columns_to_transpose=df_p .columns[1:]
k=[]
for x in columns_to_pivot:
k.append(F.struct(F.lit(f'{x}').alias('year'),F.col(x).alias('year_value')))
df_p_new=df_p.withColumn('New',F.explode(F.array(k))).select([F.col('Name').alias('JOIN_NAME'),F.col('New')['YEAR'].alias('NEW_YEAR'),F.col('New')['year_value'].alias('YEAR_VALUE')])
>>> df_p_new.show()
+---------+--------+----------+
|JOIN_NAME|NEW_YEAR|YEAR_VALUE|
+---------+--------+----------+
|John Doe | 2001| 2849234|
|John Doe | 2002| 12384312|
|John Doe | 2003| 123908234|
|John Doe | 2004| 12398193|
+---------+--------+----------+
##Column Names are case sensitive
df_answer=df_e.join(df_p_new,(df_p_new.JOIN_NAME==df_e.name) & (df_p_new.NEW_YEAR==df_e.year), how='left').select(*df_e.columns,'YEAR_VALUE')
df_answer.show()
+-------+--------+----+------+------+-------+----------+
|country| name|year| c2| c3| c4|YEAR_VALUE|
+-------+--------+----+------+------+-------+----------+
|Austria|John Doe|2003|21.234|54.234|345.434| 123908234|
+-------+--------+----+------+------+-------+----------+
df_answer.select([*df_e.columns,'YEAR_VALUE'])
## you can use the alias to rename the columns
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.