簡體   English   中英

在 Pyspark 數據框中添加一個帶有地圖的新列作為總和

[英]adding a new column as sum with map in Pyspark dataframe

我有一個 pyspark 數據框,如下所示:

Stock | open_price | list_price
A     | 100        | 1
B     | 200        | 2
C     | 300        | 3

我試圖用 map 和 rdd 來實現下面的內容,它用股票、open_price*list_price、整個 open_price 列的總和打印出每一行和單獨的行

(A, 100 , 600)
(B, 400, 600)
(C, 900, 600)

所以使用上面的表格例如第一行:A, 100*1, 100+200+300

我能夠使用下面的代碼獲得前兩列。

stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price) ).collect()
for name in stockNames:
    print(name)

但是,當我嘗試執行 sum(p.open_price) 時,如下所示:

stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect()
for name in stockNames:
    print(name)

它給了我下面的錯誤

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 75.0 failed 1 times, most recent failure: Lost task 0.0 in stage 75.0 (TID 518, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
  File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
  File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-48-f08584cc31c6>", line 19, in <lambda>
TypeError: 'int' object is not iterable

如何在我的地圖 RDD 中添加 open_price 的總和?

提前謝謝你,因為我對 RDD 和地圖還是很陌生。

分別計算總和:

df = spark.createDataFrame(
    [("A", 100, 1), ("B", 200, 2), ("C", 300, 3)],
    ("stock", "price", "list_price")
)

total = df.selectExpr("sum(price) AS total")

並添加為一列:

from pyspark.sql.functions import lit

df.withColumn("total", lit(total.first()[0])).show()

# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# |    A|  100|         1|  600|
# |    B|  200|         2|  600|
# |    C|  300|         3|  600|
# +-----+-----+----------+-----+

crossJoin

df.crossJoin(total).show()

# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# |    A|  100|         1|  600|
# |    B|  200|         2|  600|
# |    C|  300|         3|  600|
# +-----+-----+----------+-----+

RDD.map在這里並不真正適用(您可以使用它代替withColumn ,但它效率低下,我不建議這樣做)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM