从另一个 DataFrame 将列添加到 Pyspark DataFrame

Question

I have this:我有这个：

df_e :=     
|country, name, year, c2, c3, c4|       
|Austria, Jon Doe, 2003, 21.234, 54.234, 345.434|       
...

df_p :=     
|name, 2001, 2002, 2003, 2004|       
|Jon Doe, 2849234, 12384312, 123908234, 12398193|       
...

Both Pyspark Dataframes read from a csv file.从 csv 文件中读取的两个 Pyspark 数据帧。

How can I create a new column named "amount" inside df_e, which takes the name and year value of every record as reference from df_e, and gets the according amount from df_p?如何在 df_e 中创建一个名为“amount”的新列，它将每条记录的名称和年份值作为 df_e 的引用，并从 df_p 获取相应的金额？ Using Pyspark.使用 Pyspark。

In this case I should get the following DataFrame:在这种情况下，我应该得到以下 DataFrame：

df_e :=     
|country, name, year, c2, c3, c4, amount|       
|Austria, Jon Doe, 2003, 21.234, 54.234, 345.434, 123908234|       
...

Thanks for the help!谢谢您的帮助！

EDIT:编辑：

This is how I'm reading the files:这就是我阅读文件的方式：

from pyspark import SparkContext, SparkConf       
from pyspark.sql import SparkSession       

sc = SparkContext.getOrCreate(SparkConf().setMaster('local[*]'))       
spark = SparkSession.builder.getOrCreate()       

df_e = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/e.csv')       
df_p = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/p.csv')

I'm starting out with Pyspark, so I don't really know what functions I can use for this problem.我从 Pyspark 开始，所以我真的不知道我可以使用哪些函数来解决这个问题。

With pandas I would do it by iterating over the DataFrame, like this:使用 pandas 我将通过迭代 DataFrame 来做到这一点，如下所示：

for i in df_e.index:       
    p[i] = df_p.query('name == "{}"'.format(df_e['name'][i]))['{}'.format(df_e['year'][i])]

And then adding the list p as a new column to df_e (although I know there may be a much better way to do it).然后将列表 p 作为新列添加到 df_e （尽管我知道可能有更好的方法来做到这一点）。

Answer 1

import pyspark.sql.functions as F

### i am assumming all the columns are years in this except the first one 
### you can manually specify the list also   ['2003','2005']  etc .. 
columns_to_transpose=df_p .columns[1:] 
k=[]
for x in columns_to_pivot:
    k.append(F.struct(F.lit(f'{x}').alias('year'),F.col(x).alias('year_value')))
df_p_new=df_p.withColumn('New',F.explode(F.array(k))).select([F.col('Name').alias('JOIN_NAME'),F.col('New')['YEAR'].alias('NEW_YEAR'),F.col('New')['year_value'].alias('YEAR_VALUE')])

>>> df_p_new.show()
+---------+--------+----------+
|JOIN_NAME|NEW_YEAR|YEAR_VALUE|
+---------+--------+----------+
|John Doe |    2001|   2849234|
|John Doe |    2002|  12384312|
|John Doe |    2003| 123908234|
|John Doe |    2004|  12398193|
+---------+--------+----------+

##Column Names are case sensitive 
df_answer=df_e.join(df_p_new,(df_p_new.JOIN_NAME==df_e.name) & (df_p_new.NEW_YEAR==df_e.year), how='left').select(*df_e.columns,'YEAR_VALUE')
df_answer.show()

    
+-------+--------+----+------+------+-------+----------+
|country|    name|year|    c2|    c3|     c4|YEAR_VALUE|
+-------+--------+----+------+------+-------+----------+
|Austria|John Doe|2003|21.234|54.234|345.434| 123908234|
+-------+--------+----+------+------+-------+----------+


df_answer.select([*df_e.columns,'YEAR_VALUE'])
    
    ## you can use the alias  to  rename the columns

从另一个 DataFrame 将列添加到 Pyspark DataFrame

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-12-05 04:35:18

从另一个 DataFrame 将列添加到 Pyspark DataFrame

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-12-05 04:35:18

解决方案1
0 已采纳 2020-12-05 04:35:18