将 Python 函数应用于 Pandas 分组数据帧 - 加速计算的最有效方法是什么？

Question

I'm dealing with quite large Pandas DataFrame - my dataset resembles a following df setup :我正在处理相当大的 Pandas DataFrame - 我的数据集类似于以下df设置：

import pandas as pd
import numpy  as np

#--------------------------------------------- SIZING PARAMETERS :
R1 =                    20        # .repeat( repeats = R1 )
R2 =                    10        # .repeat( repeats = R2 )
R3 =                541680        # .repeat( repeats = [ R3, R4 ] )
R4 =                576720        # .repeat( repeats = [ R3, R4 ] )
T  =                 55920        # .tile( , T)
A1 = np.arange( 0, 2708400, 100 ) # ~ 20x re-used
A2 = np.arange( 0, 2883600, 100 ) # ~ 20x re-used

#--------------------------------------------- DataFrame GENERATION :
df = pd.DataFrame.from_dict(
         { 'measurement_id':        np.repeat( [0, 1], repeats = [ R3, R4 ] ), 
           'time':np.concatenate( [ np.repeat( A1,     repeats = R1 ),
                                    np.repeat( A2,     repeats = R1 ) ] ), 
           'group':        np.tile( np.repeat( [0, 1], repeats = R2 ), T ),
           'object':       np.tile( np.arange( 0, R1 ),                T )
           }
        )

#--------------------------------------------- DataFrame RE-PROCESSING :
df = pd.concat( [ df,
                  df                                                  \
                    .groupby( ['measurement_id', 'time', 'group'] )    \
                    .apply( lambda x: np.random.uniform( 0, 100, 10 ) ) \
                    .explode()                                           \
                    .astype( 'float' )                                    \
                    .to_frame( 'var' )                                     \
                    .reset_index( drop = True )
                  ], axis = 1
                )

Note: For the purpose of having a minimal example, it can be easily subsetted (for example with df.loc[df['time'] <= 400, :] ), but since I simulate the data anyway I thought that original size would give a better overview.注意：为了有一个最小的例子，它可以很容易地df.loc[df['time'] <= 400, :] （例如使用df.loc[df['time'] <= 400, :] ），但由于我无论如何模拟数据，我认为原始大小会给出更好的概述。

For each group defined by ['measurement_id', 'time', 'group'] I need to call the following function:对于['measurement_id', 'time', 'group']定义的每个组['measurement_id', 'time', 'group']我需要调用以下函数：

from sklearn.cluster import SpectralClustering
from pandarallel     import pandarallel

def cluster( x, index ):
    if len( x ) >= 2:
        data = np.asarray( x )[:, np.newaxis]
        clustering = SpectralClustering( n_clusters   =  5,
                                         random_state = 42
                                         ).fit( data )
        return pd.Series( clustering.labels_ + 1, index = index )
    else:
        return pd.Series( np.nan, index = index )

To enhance the performance I tried two approaches:为了提高性能，我尝试了两种方法：

Pandarallel package潘达列包

First approach was to parallelise the computations using pandarallel package:第一种方法是使用pandarallel包并行计算：

pandarallel.initialize( progress_bar = True )
df \
  .groupby( ['measurement_id', 'time', 'group'] ) \
  .parallel_apply( lambda x: cluster( x['var'], x['object'] ) )

However, this seems to be sub-optimal as it consumes a lot of RAM and not all cores are used in computations ( even despite specifying the number of cores explicitly in the pandarallel.initialize() method ).然而，这似乎是次优的，因为它消耗大量 RAM，并且并非所有内核都用于计算（即使在pandarallel.initialize()方法中明确指定了内核数）。 Also, sometimes computations are terminated with various errors, although I have not had a chance to find a reason for that ( possibly a lack of RAM? ).此外，有时计算会因各种错误而终止，尽管我还没有机会找到原因（可能是 RAM 不足？）。

PySpark Pandas UDF PySpark Pandas UDF

I also gave a Spark Pandas UDF a go, although I am totally new to Spark.我还尝试了 Spark Pandas UDF，尽管我对 Spark 完全陌生。 Here's my attempt:这是我的尝试：

import findspark;  findspark.init()

from pyspark.sql           import SparkSession
from pyspark.conf          import SparkConf
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types     import *

spark = SparkSession.builder.master( "local" ).appName( "test" ).config( conf = SparkConf() ).getOrCreate()
df = spark.createDataFrame( df )

@pandas_udf( StructType( [StructField( 'id', IntegerType(), True )] ), functionType = PandasUDFType.GROUPED_MAP )
def cluster( df ):
    if len( df['var'] ) >= 2:
        data = np.asarray( df['var'] )[:, np.newaxis]
        clustering = SpectralClustering( n_clusters   =  5,
                                         random_state = 42
                                         ).fit( data )
        return pd.DataFrame( clustering.labels_ + 1,
                             index = df['object']
                             )
    else:
        return pd.DataFrame( np.nan,
                             index = df['object']
                             )

res = df                                           \
        .groupBy( ['id_half', 'frame', 'team_id'] ) \
        .apply( cluster )                            \
        .toPandas()

Unfortunately, the performance was unsatisfactory as well, and from what I read on the topic, this may be just the burden of using UDF function, written in Python and the associated need of converting all Python objects to Spark objects and back.不幸的是，性能也不令人满意，从我读到的主题来看，这可能只是使用 Python 编写的 UDF 函数的负担以及将所有 Python 对象转换为 Spark 对象并返回的相关需求。

So here are my questions:所以这里是我的问题：

Could either of my approaches be adjusted to eliminate possible bottlenecks and improve the performance?是否可以调整我的任何一种方法以消除可能的瓶颈并提高性能？ (eg PySpark setup, adjusting sub-optimal operations etc.) （例如 PySpark 设置、调整次优操作等）
Are they any better alternatives?他们有更好的选择吗？ How do they compare to the provided solutions in terms of performance?在性能方面，它们与提供的解决方案相比如何？

Answer 1

Q : " Could either of my approaches be adjusted to eliminate possible bottlenecks and improve the performance? _{( eg PySpark setup, adjusting sub-optimal operations etc. )} "问： “是否可以调整我的任一方法以消除可能的瓶颈并提高性能？ _{（例如 PySpark 设置、调整次优操作等）} ”

_{+1 for mentioning the setup add-on overhead costs for either strategy of computing.} _{+1提及两种计算策略的设置附加开销成本。} _{This always makes a break-even point, only after which a non- [SERIAL] strategy may achieve any beneficial joy of some wished-to-have [TIME] -Domain speedup ( yet, if other, typically [SPACE] -Domain costs permit or stay feasible - yes, RAM ... existence of & access to such a sized device, budget and other similar real-world constraints )}_{这总是一个盈亏平衡点，只有在此之后，非[SERIAL]策略才能实现某些希望拥有的[TIME] -Domain 加速（然而，如果其他，通常[SPACE] -Domain 成本）允许或保持可行 - 是的，RAM ...存在和访问如此大小的设备，预算和其他类似的现实世界限制）}

First,第一的，
the pre-flight check, before we take-off起飞前的飞行前检查
The new, overhead-strict formulation of Amdahl's Law is currently able to incorporate both of these add-on pSO + pTO overheads and reflects these in predicting the achievable Speedup-levels including the break-even point, since which it may become meaningful ( in a costs/effect, efficiency sense ) to go parallel.阿姆达尔定律的新的、开销严格的公式目前能够合并这两个附加的pSO + pTO开销，并在预测可实现的加速水平（包括收支平衡点）时反映这些，因为它可能变得有意义（在成本/效果，效率意识）并行。

Yet,然而，
that is not our core problem here .这不是我们的核心问题。
This comes next :接下来是：

Next,下一个，
given the computational costs of SpectralClustering() , which is going here to use the Radial Boltzmann Function kernel ~ exp( -gamma * distance( data, data )**2 ) there seems to be no advance from split of data -object over any number of disjunct work-units, as the distance( data, data ) -component, by definition, has but to visit all the data -elements ( ref. the communication costs of any-to-any value-passing { process | node } -distributed topologies are, for obvious reasons, awfully bad if not the worst use-cases for { process | node } -distributed processing, if not the straight anti-patterns ( except for some indeed arcane, memory-less / state-less, yet computing fabrics ).考虑到SpectralClustering()的计算成本，它在这里使用径向 Boltzmann 函数内核~ exp( -gamma * distance( data, data )**2 )似乎没有从data拆分超过任何分离的工作单元的数量，作为distance( data, data ) ，根据定义，只需要访问所有data元素（参考任意值传递{ process | node }的通信成本{ process | node } - 分布式拓扑，出于显而易见的原因，如果不是最糟糕的{ process | node }用例的话，是非常糟糕的 - 分布式处理，如果不是直接的反模式（除了一些确实神秘的、无内存/无状态的，还计算结构）。

For pedantic analysts, yes - add to this ( and we may already say a bad state ) the costs of - again - any-to-any k-means -processing, here about O( N^( 1 + 5 * 5 ) ) that goes, for N ~ len( data ) ~ 1.12E6+ , awfully against our wish to have some smart and fast processing.对于迂腐的分析师，是的 - 添加到这个（我们可能已经说一个糟糕的状态） -再次- 任意对任意k 均值 -处理的成本，这里是O( N^( 1 + 5 * 5 ) )对于N ~ len( data ) ~ 1.12E6+ ，这与我们希望进行一些智能和快速处理的愿望N ~ len( data ) ~ 1.12E6+ 。

So what?所以呢？

While the setup costs are not neglected, the increased communication costs will almost for sure disable any improvement from using the above sketched attempts to move from a pure- [SERIAL] process flow into some form of just - [CONCURRENT] or True- [PARALLEL] orchestration of some work-sub-units, due to increased overheads related to a must to implement ( a tandem pair of ) any-to-any value-passing topologies.虽然设置成本没有被忽视，但增加的通信成本几乎肯定会阻止使用上述尝试从纯[SERIAL]流程转移到某种形式的只是- [CONCURRENT]或 True- [PARALLEL]一些工作子单元的编排，因为与必须实现（串联对）任意值传递拓扑相关的开销增加。

If it weren't for 'em?如果不是为了他们？

Well, this sounds as a Computing Science oxymoron - even if it were possible, the costs of the any-to-any pre-computed distances ( which would take those immense [TIME] -Domain complexity costs "beforehand" ( Where? How? Is there any other, un-avoidable latency, permitting a possible latency masking by some ( unknown so far ) incremental buildup of a complete-in-future any-to-any distance matrix? ) ) would but reposition these principally present costs to some other location in [TIME] - and [SPACE] -Domains, not reduce 'em.嗯，这听起来像是计算科学的矛盾——即使有可能，任何到任何预先计算的距离的成本（这将花费巨大的[TIME]复杂性成本“预先” （在哪里？如何？）是否有任何其他不可避免的延迟，允许通过一些（目前未知）增量构建一个完整的未来任意到任意距离矩阵来实现可能的延迟屏蔽？））但会重新定位这些主要存在的成本给某些人[TIME] - 和[SPACE] -Domains 中的其他位置，而不是减少它们。

Q : "Are they any better alternatives? "问： “它们有更好的选择吗？ ”

The only one, I am aware off so far, is to try, if the problem is possible to get re-formulated into another, a QUBO-formulated, problem fashion ( ref.: Q uantum- U nconstrained- B inary- O ptimisation, good news is that tools for doing so, a base of first-hand knowledge and practical problem-solving experience exist and grow larger )唯一的一个，我知道关至今，是尝试，如果问题可以得到重新配制成另一种，一个QUBO配制的，问题的方式（参见：Q uantum-ünconstrained-乙inary-Ø器优化，好消息是这样做的工具、第一手知识和实际问题解决经验的基础已经存在并且越来越大）

Q : How do they compare to the provided solutions in terms of performance?问：在性能方面，它们与提供的解决方案相比如何？

The performance is breathtaking - QUBO-formulated problem has a promising O(1) (!) solver in constant time ( in [TIME] -Domain ) and somewhat restricted in [SPACE] -Domain ( where recently announced LLNL tricks may help avoid this physical world, current QPU implementation, constraint of problem sizes ).性能令人叹为观止 - QUBO 公式化的问题在恒定时间（在[TIME] -Domain 中）有一个有前途的O(1) (!) 求解器，并且在[SPACE] -Domain 中受到一些限制（最近宣布的 LLNL 技巧可能有助于避免这种情况）物理世界、当前的 QPU 实现、问题大小的约束）。

Answer 2

This is not an answer, but...这不是答案，而是……

If you run如果你跑

df.groupby(['measurement_id', 'time', 'group']).apply(
    lambda x: cluster(x['var'], x['object']))

(ie, with Pandas alone), you will notice that you are already using several cores. （即，单独使用 Pandas），您会注意到您已经在使用多个内核。 This is because sklearn uses joblib by default to parallelise the work.这是因为sklearn默认使用joblib来并行化工作。 You can swap out the scheduler in favour of Dask and perhaps get more efficiency over sharing the data between threads, but so long as the work you are doing is CPU-bound like this, there will be nothing you can do to speed it up.您可以替换调度程序以支持 Dask，并且可能会通过在线程之间共享数据获得更高的效率，但是只要您正在执行的工作像这样受 CPU 限制，您将无法做任何事情来加速它。

In short, this is an algorithm problem: figure out what you really need to compute, before trying to consider different frameworks for computing it.简而言之，这是一个算法问题：在尝试考虑不同的计算框架之前，弄清楚你真正需要计算什么。

Answer 3

I am not an expert on Dask , but I provide the following code as a baseline:我不是Dask专家，但我提供以下代码作为基准：

import dask.dataframe as ddf

df = ddf.from_pandas(df, npartitions=4) # My PC has 4 cores

task = df.groupby(["measurement_id", "time", "group"]).apply(
    lambda x: cluster(x["var"], x["object"]),
    meta=pd.Series(np.nan, index=pd.Series([0, 1, 1, 1])),
)

res = task.compute()

将 Python 函数应用于 Pandas 分组数据帧 - 加速计算的最有效方法是什么？

问题描述

Pandarallel package潘达列包

PySpark Pandas UDF PySpark Pandas UDF

3 个解决方案

解决方案1
2 2020-03-02 20:09:18

解决方案2
0 2020-02-27 18:10:46

解决方案3
0 2020-03-04 10:32:20

将 Python 函数应用于 Pandas 分组数据帧 - 加速计算的最有效方法是什么？

问题描述

Pandarallel package潘达列包

PySpark Pandas UDF PySpark Pandas UDF

3 个解决方案

解决方案1 2 2020-03-02 20:09:18

解决方案2 0 2020-02-27 18:10:46

解决方案3 0 2020-03-04 10:32:20

解决方案1
2 2020-03-02 20:09:18

解决方案2
0 2020-02-27 18:10:46

解决方案3
0 2020-03-04 10:32:20