简体   繁体   English

如何有效地比较两个 1x1 Spark DataFrames?

[英]How to efficiently compare two 1x1 Spark DataFrames?

I have a use case where I need to efficiently compare the average values of two columns.我有一个用例,我需要有效地比较两列的平均值。 More specifically, I want to find the percent change between two values (which involves some algebra and comparisons between the numbers).更具体地说,我想找到两个值之间的百分比变化(这涉及一些代数和数字之间的比较)。

To do this, I start by grouping and aggregating the average over the column that I want, which gives me a DataFrame with a single float in it (ie a DataFrame with one cell).为此,我首先对我想要的列的平均值进行分组和聚合,这给了我一个带有单个浮点数的 DataFrame(即带有一个单元格的 DataFrame)。 Now, what I originally did was grab this value from the DataFrame using:现在,我最初所做的是使用以下命令从 DataFrame 中获取此值:

my_df.head()[0]

but it turns out that this is very slow (several seconds to bring this DataFrame to the driver, I believe).但事实证明,这非常慢(我相信,将这个 DataFrame 带到驱动程序需要几秒钟)。 I am unsure how else to get this value, or how to compare it with another average value (which is aggregated/obtained in the same way).我不确定如何获得这个值,或者如何将它与另一个平均值(以相同的方式聚合/获得)进行比较。 Side note, .collect()[0][0] also has this speed issue.旁注, .collect()[0][0]也有这个速度问题。

Is there a way to get this average value without such a slow runtime, or otherwise compare the two average values in these separate DataFrames?有没有办法在没有这么慢的运行时间的情况下获得这个平均值,或者比较这些单独的 DataFrame 中的两个平均值?

If your two dataframes have a common key, you can join the average column on the key.如果您的两个数据框有一个共同的键,您可以加入键上的平均列。 If they don't and you just need the average across the complete dataframe you can add a key column with a constant value to both dataframes using F.lit .如果他们不这样做,而您只需要整个 dataframe 的平均值,则可以使用F.lit向两个数据帧添加一个具有恒定值的键列。

df = df.withColumn("key", F.lit(1))

However, as soon as you want to actually see the results, they need to be collected anyway.但是,一旦您想真正看到结果,无论如何都需要收集它们。 This will take some time even for really small dataframes.即使对于非常小的数据帧,这也需要一些时间。 Spark will be able to compute an average for 1 billion rows very quickly compared to tools like Pandas.与 Pandas 等工具相比,Spark 将能够非常快速地计算 10 亿行的平均值。 This is because it builds up an infrastructure that allows it to compute subtasks of the problem you give it in a distributed fashion.这是因为它建立了一个基础设施,允许它以分布式方式计算你给它的问题的子任务。 Building this infrastructure takes Spark some time.构建这个基础设施需要 Spark 一些时间。 If you want to compute the average of 3 rows it is not worth to build such a complex infrastructure.如果您想计算 3 行的平均值,那么构建如此复杂的基础架构是不值得的。 That means don't use Spark for things like that or live with the fact that Spark is slower on small data sets than Pandas.这意味着不要将 Spark 用于此类事情,或者接受 Spark 在小型数据集上比 Pandas 慢的事实。 To speed things up a little bit for small dataframes in Spark cache them (that is load them into memory) before you work with them.为了加快 Spark 中小数据帧的速度,在使用它们之前cache它们(即将它们加载到内存中)。

df = df.cache()
dfA = spark.createDataFrame([
    [1, 1],
    [1, 2],
    [1, 3]
], ["key", "a"])

dfB = spark.createDataFrame([
    [1, 2],
    [1, 3],
    [1, 4]
], ["key", "b"])

dfC = dfA.join(dfB, on=["key"], how="inner")
(
    dfC
    .groupBy("key")
    .agg(
        F.avg("a").alias("avg_a"), 
        F.avg("b").alias("avg_b")
    )
    .withColumn("avg_is_equal", F.expr("avg_a = avg_b"))
    .show()
)

Output Output

+---+-----+-----+------------+
|key|avg_a|avg_b|avg_is_equal|
+---+-----+-----+------------+
|  1|  2.0|  3.0|       false|
+---+-----+-----+------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM