[英]Define partition for window operation using Pyspark.pandas
I am trying to learn how to use pyspark.pandas
and I am coming across an issue that I don't know how to solve.我正在尝试学习如何使用
pyspark.pandas
并且遇到了一个我不知道如何解决的问题。 I have a df
of about 700k rows and 7 columns.我有大约 700k 行和 7 列的
df
。 Here is a sample of my data:这是我的数据示例:
import pyspark.pandas as ps
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Even when I run the simplest of operations like df.head()
, I get the following warning and I'm not sure how to fix it:即使我运行最简单的操作,如
df.head()
,我也会收到以下警告,但我不确定如何修复它:
WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I know how to work around this with pyspark
dataframes, but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation. I know how to work around this with
pyspark
, but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation.
Does anyone have any suggestions?有没有人有什么建议?
For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html
I think the goal here is to run Pandas functions on a Spark DataFrame.我认为这里的目标是在 Spark DataFrame 上运行 Pandas 函数。 One option you can use is Fugue .
您可以使用的一种选择是Fugue 。 Fugue can take a Python function and apply it on Spark per partition.
Fugue 可以采用 Python function 并将其应用于 Spark 每个分区。 Example code below.
下面的示例代码。
from typing import List, Dict, Any
import pandas as pd
df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
"id": (["A"]*3 + ["B"]*3 + ["C"]*3),
"value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
def count(df: pd.DataFrame) -> pd.DataFrame:
# this assumes the data is already partitioned
id = df.iloc[0]["id"]
count = df.shape[0]
return pd.DataFrame({"id": [id], "count": [count]})
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
from fugue import transform
# Pandas
pdf = transform(df.copy(),
count,
schema="id:str, count:int",
partition={"by": "id"})
print(pdf.head())
# Spark
transform(sdf,
count,
schema="id:str, count:int",
partition={"by": "id"},
engine=spark).show()
You just need to annotate your function with input and output types and then you can use it with the Fugue transform function.您只需要使用输入和 output 类型注释您的 function,然后您可以将其与赋格变换 function 一起使用。 Schema is a requirement for Spark so you need to pass it.
Schema 是 Spark 的要求,因此您需要通过它。 If you supply
spark
as the engine, then the execution will happen on Spark.如果您提供
spark
作为引擎,则执行将在 Spark 上进行。 Otherwise, it will run on Pandas by default.否则,它将默认在 Pandas 上运行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.