简体   繁体   English

从另一个 DataFrame 将列添加到 Pyspark DataFrame

[英]Add column to Pyspark DataFrame from another DataFrame

I have this:我有这个:

df_e :=     
|country, name, year, c2, c3, c4|       
|Austria, Jon Doe, 2003, 21.234, 54.234, 345.434|       
...

df_p :=     
|name, 2001, 2002, 2003, 2004|       
|Jon Doe, 2849234, 12384312, 123908234, 12398193|       
...

Both Pyspark Dataframes read from a csv file.从 csv 文件中读取的两个 Pyspark 数据帧。

How can I create a new column named "amount" inside df_e, which takes the name and year value of every record as reference from df_e, and gets the according amount from df_p?如何在 df_e 中创建一个名为“amount”的新列,它将每条记录的名称和年份值作为 df_e 的引用,并从 df_p 获取相应的金额? Using Pyspark.使用 Pyspark。

In this case I should get the following DataFrame:在这种情况下,我应该得到以下 DataFrame:

df_e :=     
|country, name, year, c2, c3, c4, amount|       
|Austria, Jon Doe, 2003, 21.234, 54.234, 345.434, 123908234|       
...

Thanks for the help!谢谢您的帮助!

EDIT:编辑:

This is how I'm reading the files:这就是我阅读文件的方式:

from pyspark import SparkContext, SparkConf       
from pyspark.sql import SparkSession       

sc = SparkContext.getOrCreate(SparkConf().setMaster('local[*]'))       
spark = SparkSession.builder.getOrCreate()       

df_e = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/e.csv')       
df_p = spark.read.option('header', 'true').option('inferSchema', 'true').csv('data/p.csv')       

I'm starting out with Pyspark, so I don't really know what functions I can use for this problem.我从 Pyspark 开始,所以我真的不知道我可以使用哪些函数来解决这个问题。

With pandas I would do it by iterating over the DataFrame, like this:使用 pandas 我将通过迭代 DataFrame 来做到这一点,如下所示:

for i in df_e.index:       
    p[i] = df_p.query('name == "{}"'.format(df_e['name'][i]))['{}'.format(df_e['year'][i])]

And then adding the list p as a new column to df_e (although I know there may be a much better way to do it).然后将列表 p 作为新列添加到 df_e (尽管我知道可能有更好的方法来做到这一点)。

import pyspark.sql.functions as F

### i am assumming all the columns are years in this except the first one 
### you can manually specify the list also   ['2003','2005']  etc .. 
columns_to_transpose=df_p .columns[1:] 
k=[]
for x in columns_to_pivot:
    k.append(F.struct(F.lit(f'{x}').alias('year'),F.col(x).alias('year_value')))
df_p_new=df_p.withColumn('New',F.explode(F.array(k))).select([F.col('Name').alias('JOIN_NAME'),F.col('New')['YEAR'].alias('NEW_YEAR'),F.col('New')['year_value'].alias('YEAR_VALUE')])

>>> df_p_new.show()
+---------+--------+----------+
|JOIN_NAME|NEW_YEAR|YEAR_VALUE|
+---------+--------+----------+
|John Doe |    2001|   2849234|
|John Doe |    2002|  12384312|
|John Doe |    2003| 123908234|
|John Doe |    2004|  12398193|
+---------+--------+----------+

##Column Names are case sensitive 
df_answer=df_e.join(df_p_new,(df_p_new.JOIN_NAME==df_e.name) & (df_p_new.NEW_YEAR==df_e.year), how='left').select(*df_e.columns,'YEAR_VALUE')
df_answer.show()

    
+-------+--------+----+------+------+-------+----------+
|country|    name|year|    c2|    c3|     c4|YEAR_VALUE|
+-------+--------+----+------+------+-------+----------+
|Austria|John Doe|2003|21.234|54.234|345.434| 123908234|
+-------+--------+----+------+------+-------+----------+


df_answer.select([*df_e.columns,'YEAR_VALUE'])
    
    ## you can use the alias  to  rename the columns 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark:如何从另一个 dataframe 向 dataframe 添加列? - Pyspark: how to add a column to a dataframe from another dataframe? Pyspark:从另一个 pyspark dataframe 添加新列 - Pyspark: Add new column from another pyspark dataframe PySpark DataFrame - 从另一个 dataframe 创建一个列 - PySpark DataFrame - Create a column from another dataframe 如何在 pyspark 中从另一个 dataframe 添加列? - How to add column to one dataframe from another in pyspark? PySpark 从 TimeStampType 列向 DataFrame 添加一列 - PySpark add a column to a DataFrame from a TimeStampType column 在 PySpark 中将 Spark DataFrame 从行移到列,并附加另一个 DataFrame - Transposing a Spark DataFrame from row to column in PySpark and appending it with another DataFrame 将 dataframe 中的值添加到另一个 dataframe pyspark 中的列 - Adding values from a dataframe to a column in another dataframe pyspark 如果两个数据列值在另一个数据框中,如何在pyspark中添加一列? - How to add a column in pyspark if two column values is in another dataframe? 向 dataframe 添加一个新列,这将指示另一列是否包含单词 pyspark - add a new column to a dataframe that will indicate if another column contains a word pyspark Pyspark DataFrame 列基于另一个 DataFrame 值 - Pyspark DataFrame column based on another DataFrame value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM