简体   繁体   English

Pyspark:如何从另一个 dataframe 向 dataframe 添加列?

[英]Pyspark: how to add a column to a dataframe from another dataframe?

I have two dataframes with 10 rows.我有两个 10 行的数据框。

df1.show()
+-------------------+------------------+--------+-------+
|                lat|               lon|duration|stop_id|
+-------------------+------------------+--------+-------+
|  -6.23748779296875| 106.6937255859375|     247|      0|
|  -6.23748779296875| 106.6937255859375|    2206|      1|
|  -6.23748779296875| 106.6937255859375|     609|      2|
| 0.5733972787857056|101.45503234863281|   16879|      3|
| 0.5733972787857056|101.45503234863281|    4680|      4|
| -6.851855278015137|108.64261627197266|     164|      5|
| -6.851855278015137|108.64261627197266|     220|      6|
| -6.851855278015137|108.64261627197266|    1669|      7|
|-0.9033176600933075|100.41548919677734|   30811|      8|
|-0.9033176600933075|100.41548919677734|   23404|      9|
+-------------------+------------------+--------+-------+

I would like to add the column bank_and_post from df2 to df1 .我想将df2中的bank_and_post列添加到df1

df2 comes from a function. df2来自 function。

def assignPtime(x, mu, std):
  mu = mu.values[0]
  std = std.values[0]
  x1 = np.random.normal(mu, std, 100000) 
  a1, b1 = np.histogram(x1, density=True)
  val = x / 60
  for k, v in enumerate(val):
    prob = 0
    for i,j in enumerate(b1[:-1]):
      v1 = b1[i]
      v2 = b1[i+1]
      if (v >= v1) and (v < v2):
        prob = a1[i]
    x[k] = prob
  return x

ff = pandas_udf(assignPtime, returnType=DoubleType())
df2 = df1.select(ff(col("duration"), lit(15), lit(15)).alias("ptime_bank_and_post"))
df2.show()
+--------------------+
|       bank_and_post|
+--------------------+
|0.021806558032484918|
|0.014366417828826784|
|0.021806558032484918|
|                 0.0|
|                 0.0|
|0.021806558032484918|
|0.021806558032484918|
|0.014366417828826784|
|                 0.0|
|                 0.0|
+--------------------+

If I try如果我尝试

df2 = df2.withColumn("stop_id", monotonically_increasing_id())

I get the error我得到错误

ValueError: assignment destination is read-only

Use row_number() window function to add new column to the df1,df2 dataframes then join the dataframes on the row_number column.使用row_number() window function 将新列添加到df1,df2数据帧,然后加入 row_number 列上的数据帧。

Example:

1. Using row_number function:

df1=spark.createDataFrame([(0,),(1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,)],["stop_id"])

df2=spark.createDataFrame([("0.021806558032484918",),("0.014366417828826784",),("0.021806558032484918",),("                 0.0",),("                 0.0",),("0.021806558032484918",),("0.021806558032484918",),("0.014366417828826784",),("                 0.0",),("                 0.0",)],["bank_and_post"])

from pyspark.sql import *
from pyspark.sql.functions import *

w=Window.orderBy(lit(1))

df4=df2.withColumn("rn",row_number().over(w)-1)
df3=df1.withColumn("rn",row_number().over(w)-1)

df3.join(df4,["rn"]).drop("rn").show()

#+-------+--------------------+
#|stop_id|       bank_and_post|
#+-------+--------------------+
#|      0|0.021806558032484918|
#|      1|0.014366417828826784|
#|      2|0.021806558032484918|
#|      3|                 0.0|
#|      4|                 0.0|
#|      5|0.021806558032484918|
#|      6|0.021806558032484918|
#|      7|0.014366417828826784|
#|      8|                 0.0|
#|      9|                 0.0|
#+-------+--------------------+

2. Using monotonically_increasing_id() function:

df1.withColumn("mid",monotonically_increasing_id()).\
join(df2.withColumn("mid",monotonically_increasing_id()),["mid"]).\
drop("mid").\
orderBy("stop_id").\
show()
#+-------+--------------------+
#|stop_id|       bank_and_post|
#+-------+--------------------+
#|      0|0.021806558032484918|
#|      1|0.014366417828826784|
#|      2|0.021806558032484918|
#|      3|                 0.0|
#|      4|                 0.0|
#|      5|0.021806558032484918|
#|      6|0.021806558032484918|
#|      7|0.014366417828826784|
#|      8|                 0.0|
#|      9|                 0.0|
#+-------+--------------------+

3. Using row_number() on monotonically_increasing_id() function:

w=Window.orderBy("mid")
df3=df1.withColumn("mid",monotonically_increasing_id()).withColumn("rn",row_number().over(w) - 1)
df4=df2.withColumn("mid",monotonically_increasing_id()).withColumn("rn",row_number().over(w) - 1)
df3.join(df4,["rn"]).drop("rn","mid").show()

#+-------+--------------------+
#|stop_id|       bank_and_post|
#+-------+--------------------+
#|      0|0.021806558032484918|
#|      1|0.014366417828826784|
#|      2|0.021806558032484918|
#|      3|                 0.0|
#|      4|                 0.0|
#|      5|0.021806558032484918|
#|      6|0.021806558032484918|
#|      7|0.014366417828826784|
#|      8|                 0.0|
#|      9|                 0.0|
#+-------+--------------------+

4. Using zipWithIndex:

df3=df1.rdd.zipWithIndex().toDF().select("_1.*","_2")
df4=df2.rdd.zipWithIndex().toDF().select("_1.*","_2")
df3.join(df4,["_2"]).drop("_2").orderBy("stop_id").show()
#+-------+--------------------+
#|stop_id|       bank_and_post|
#+-------+--------------------+
#|      0|0.021806558032484918|
#|      1|0.014366417828826784|
#|      2|0.021806558032484918|
#|      3|                 0.0|
#|      4|                 0.0|
#|      5|0.021806558032484918|
#|      6|0.021806558032484918|
#|      7|0.014366417828826784|
#|      8|                 0.0|
#|      9|                 0.0|
#+-------+--------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从另一个 DataFrame 将列添加到 Pyspark DataFrame - Add column to Pyspark DataFrame from another DataFrame 如何在 pyspark 中从另一个 dataframe 添加列? - How to add column to one dataframe from another in pyspark? Pyspark:从另一个 pyspark dataframe 添加新列 - Pyspark: Add new column from another pyspark dataframe PySpark DataFrame - 从另一个 dataframe 创建一个列 - PySpark DataFrame - Create a column from another dataframe 如果两个数据列值在另一个数据框中,如何在pyspark中添加一列? - How to add a column in pyspark if two column values is in another dataframe? 如何将pyspark数据帧列中的值与pyspark中的另一个数据帧进行比较 - How to compare values in a pyspark dataframe column with another dataframe in pyspark PySpark 从 TimeStampType 列向 DataFrame 添加一列 - PySpark add a column to a DataFrame from a TimeStampType column pyspark-如何添加列以从列表中激发 dataframe - pyspark- how to add a column to spark dataframe from a list 如何在pyspark中将数据框列与另一个数据框列进行比较? - How to compare dataframe column to another dataframe column inplace in pyspark? 如何根据 PySpark 中的另一个数据框列处理数据框列? - How to process a dataframe column based on another dataframe column in PySpark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM