使用 Pyspark.pandas 为 window 操作定义分区

Question

I am trying to learn how to use pyspark.pandas and I am coming across an issue that I don't know how to solve.我正在尝试学习如何使用pyspark.pandas并且遇到了一个我不知道如何解决的问题。 I have a df of about 700k rows and 7 columns.我有大约 700k 行和 7 列的df 。 Here is a sample of my data:这是我的数据示例：

import pyspark.pandas as ps
import pandas as pd

data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
         'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
         'Price': [500, 0,450,750,0,0,890,19,120,3],
         'Quantity': [1200,0,330,500,190,70,120,300,50,80],
         'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}

df = ps.DataFrame(data)

Even when I run the simplest of operations like df.head() , I get the following warning and I'm not sure how to fix it:即使我运行最简单的操作，如df.head() ，我也会收到以下警告，但我不确定如何修复它：

WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

I know how to work around this with pyspark dataframes, but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation. I know how to work around this with pyspark , but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation.

Does anyone have any suggestions?有没有人有什么建议？

Answer 1

For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html

I think the goal here is to run Pandas functions on a Spark DataFrame.我认为这里的目标是在 Spark DataFrame 上运行 Pandas 函数。 One option you can use is Fugue .您可以使用的一种选择是Fugue 。 Fugue can take a Python function and apply it on Spark per partition. Fugue 可以采用 Python function 并将其应用于 Spark 每个分区。 Example code below.下面的示例代码。

from typing import List, Dict, Any
import pandas as pd 

df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
                   "id": (["A"]*3 + ["B"]*3 + ["C"]*3),
                   "value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})


def count(df: pd.DataFrame) -> pd.DataFrame:
    # this assumes the data is already partitioned
    id = df.iloc[0]["id"]
    count = df.shape[0]
    return pd.DataFrame({"id": [id], "count": [count]})

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

sdf = spark.createDataFrame(df)

from fugue import transform

# Pandas
pdf = transform(df.copy(),
          count,
          schema="id:str, count:int",
          partition={"by": "id"})
print(pdf.head())

# Spark
transform(sdf,
          count,
          schema="id:str, count:int",
          partition={"by": "id"},
          engine=spark).show()

You just need to annotate your function with input and output types and then you can use it with the Fugue transform function.您只需要使用输入和 output 类型注释您的 function，然后您可以将其与赋格变换 function 一起使用。 Schema is a requirement for Spark so you need to pass it. Schema 是 Spark 的要求，因此您需要通过它。 If you supply spark as the engine, then the execution will happen on Spark.如果您提供spark作为引擎，则执行将在 Spark 上进行。 Otherwise, it will run on Pandas by default.否则，它将默认在 Pandas 上运行。

使用 Pyspark.pandas 为 window 操作定义分区

问题描述

1 个解决方案

解决方案1
0 2022-08-10 18:55:28

使用 Pyspark.pandas 为 window 操作定义分区

问题描述

1 个解决方案

解决方案1 0 2022-08-10 18:55:28

解决方案1
0 2022-08-10 18:55:28