使用 Pyspark.pandas 为 window 操作定义分区

Question

我正在尝试学习如何使用pyspark.pandas并且遇到了一个我不知道如何解决的问题。 我有大约 700k 行和 7 列的df 。 这是我的数据示例：

import pyspark.pandas as ps
import pandas as pd

data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
         'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
         'Price': [500, 0,450,750,0,0,890,19,120,3],
         'Quantity': [1200,0,330,500,190,70,120,300,50,80],
         'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}

df = ps.DataFrame(data)

即使我运行最简单的操作，如df.head() ，我也会收到以下警告，但我不确定如何修复它：

WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

I know how to work around this with pyspark , but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation.

有没有人有什么建议？

Answer 1

For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html

我认为这里的目标是在 Spark DataFrame 上运行 Pandas 函数。 您可以使用的一种选择是Fugue 。 Fugue 可以采用 Python function 并将其应用于 Spark 每个分区。 下面的示例代码。

from typing import List, Dict, Any
import pandas as pd 

df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
                   "id": (["A"]*3 + ["B"]*3 + ["C"]*3),
                   "value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})


def count(df: pd.DataFrame) -> pd.DataFrame:
    # this assumes the data is already partitioned
    id = df.iloc[0]["id"]
    count = df.shape[0]
    return pd.DataFrame({"id": [id], "count": [count]})

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

sdf = spark.createDataFrame(df)

from fugue import transform

# Pandas
pdf = transform(df.copy(),
          count,
          schema="id:str, count:int",
          partition={"by": "id"})
print(pdf.head())

# Spark
transform(sdf,
          count,
          schema="id:str, count:int",
          partition={"by": "id"},
          engine=spark).show()

您只需要使用输入和 output 类型注释您的 function，然后您可以将其与赋格变换 function 一起使用。 Schema 是 Spark 的要求，因此您需要通过它。 如果您提供spark作为引擎，则执行将在 Spark 上进行。 否则，它将默认在 Pandas 上运行。

使用 Pyspark.pandas 为 window 操作定义分区

问题描述

1 个解决方案

解决方案1
0 2022-08-10 18:55:28

使用 Pyspark.pandas 为 window 操作定义分区

问题描述

1 个解决方案

解决方案1 0 2022-08-10 18:55:28

解决方案1
0 2022-08-10 18:55:28