用于协方差计算的 Pandas 与 MLLib 的确切 Apache-Spark NA 处理差异是什么？

Question

I recently observed significant differences in results between covariance computation in Pandas and the MLLib equivalent .我最近观察到Pandas和MLLib 等价物的协方差计算结果之间存在显着差异。 Results are reasonably close for fully specified inputs (ie without any NAs) but deviate significantly for missing values.对于完全指定的输入（即没有任何 NA），结果相当接近，但对于缺失值有显着差异。 Pandas source explains how NAs are treated but I could not reproduce results using Spark. Pandas 源解释了如何处理 NA，但我无法使用 Spark 重现结果。 I could not find documentation on what exactly RowMatrix().computeCovariance() does with regards to NAs in the source - but my Scala is very fair at best and I am unfamiliar with BLAS , perhaps I missed something.我在源代码中找不到关于RowMatrix().computeCovariance()对 NA 做了什么的文档 - 但我的 Scala 充其量是非常公平的，我不熟悉BLAS ，也许我错过了一些东西。 There is the BLAS warning for which I could not track down the reason since I am using a pre-build macOS Spark setup:有 BLAS 警告，我无法找到原因，因为我使用的是预构建的 macOS Spark 设置：

WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS

Given the importance of covariance for many applications, I wonder if someone could shed some light on the exact treatment of missing values for covariance calculation in Apache Spark MLLib?鉴于协方差对许多应用程序的重要性，我想知道是否有人可以阐明 Apache Spark MLLib 中协方差计算缺失值的确切处理方法？

EDIT: Additionally, this is not resolved in the current Spark 3.2 release , since The method `pd.DataFrame.cov()` is not implemented yet .编辑：此外，这在当前的 Spark 3.2 版本中没有解决，因为The method `pd.DataFrame.cov()` is not implemented yet 。

Assuming the following setup:假设以下设置：

from pyspark.sql import SparkSession
from pyspark.mllib.linalg.distributed import RowMatrix

spark = SparkSession.builder.appName("MyApp") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .getOrCreate()
sc = spark.sparkContext
good_rows = sc.parallelize([[11, 12, 13, 14, 16, 17, 18], 
                            [21, 22, 23, 42, 26, 27, 28],
                            [31, 32, 33, 34, 36, 37, 38],
                            [41, 42, 43, 44, 46, 47, 48],
                            [51, 52, 53, 54, 56, 57, 58],
                            [ 1,  2,  3,  4,  6,  7,  8]])
bad_rows = sc.parallelize([[11, 12, None, 14, 16, None, 18], 
                           [21, 22, None, 42, 26, None, 28],
                           [31, 32, None, 34, 36, None, 38],
                           [41, 42, 43, 44, 46, 47, 48],
                           [51, 52, 53, 54, 56, 57, 58],
                           [ 1,  2,  3,  4,  6,  7,  8]])

The covariance computed from good_rows are equal for Pandas and Spark:对于 Pandas 和 Spark，从good_rows计算的协方差是相等的：

good_rows.toDF().toPandas().cov()
# Results in:
       _1     _2     _3     _4     _5     _6     _7
_1  350.0  350.0  350.0  332.0  350.0  350.0  350.0
_2  350.0  350.0  350.0  332.0  350.0  350.0  350.0
_3  350.0  350.0  350.0  332.0  350.0  350.0  350.0
_4  332.0  332.0  332.0  368.0  332.0  332.0  332.0
_5  350.0  350.0  350.0  332.0  350.0  350.0  350.0
_6  350.0  350.0  350.0  332.0  350.0  350.0  350.0
_7  350.0  350.0  350.0  332.0  350.0  350.0  350.0

spark.createDataFrame(RowMatrix(good_rows).computeCovariance().toArray().tolist()).toPandas()
# Results in:
      _1     _2     _3     _4     _5     _6     _7
0  350.0  350.0  350.0  332.0  350.0  350.0  350.0
1  350.0  350.0  350.0  332.0  350.0  350.0  350.0
2  350.0  350.0  350.0  332.0  350.0  350.0  350.0
3  332.0  332.0  332.0  368.0  332.0  332.0  332.0
4  350.0  350.0  350.0  332.0  350.0  350.0  350.0
5  350.0  350.0  350.0  332.0  350.0  350.0  350.0
6  350.0  350.0  350.0  332.0  350.0  350.0  350.0

Running the same with the bad_rows results in very different covariance matrices, unless Pandas is cov() runs with min_periods=(bad_rows.count()/2)+1使用bad_rows运行相同的结果会导致非常不同的协方差矩阵，除非 Pandas 是cov()以min_periods=(bad_rows.count()/2)+1

bad_rows.toDF().toPandas().cov()
#Results in: 
       _1     _2     _3     _4     _5     _6     _7
_1  350.0  350.0  700.0  332.0  350.0  700.0  350.0
_2  350.0  350.0  700.0  332.0  350.0  700.0  350.0
_3  700.0  700.0  700.0  700.0  700.0  700.0  700.0
_4  332.0  332.0  700.0  368.0  332.0  700.0  332.0
_5  350.0  350.0  700.0  332.0  350.0  700.0  350.0
_6  700.0  700.0  700.0  700.0  700.0  700.0  700.0
_7  350.0  350.0  700.0  332.0  350.0  700.0  350.0
spark.createDataFrame(RowMatrix(bad_rows).computeCovariance().toArray().tolist()).toPandas()
# Results in:
      _1     _2  _3     _4     _5  _6     _7
0  350.0  350.0 NaN  332.0  350.0 NaN  350.0
1  350.0  350.0 NaN  332.0  350.0 NaN  350.0
2    NaN    NaN NaN    NaN    NaN NaN    NaN
3  332.0  332.0 NaN  368.0  332.0 NaN  332.0
4  350.0  350.0 NaN  332.0  350.0 NaN  350.0
5    NaN    NaN NaN    NaN    NaN NaN    NaN
6  350.0  350.0 NaN  332.0  350.0 NaN  350.0

bad_rows.toDF().toPandas().cov(min_periods=(bad_rows.count()/2)+1)
# With 50% of dataframe rows +1 Pandas equals the Spark result:
       _1     _2  _3     _4     _5  _6     _7
_1  350.0  350.0 NaN  332.0  350.0 NaN  350.0
_2  350.0  350.0 NaN  332.0  350.0 NaN  350.0
_3    NaN    NaN NaN    NaN    NaN NaN    NaN
_4  332.0  332.0 NaN  368.0  332.0 NaN  332.0
_5  350.0  350.0 NaN  332.0  350.0 NaN  350.0
_6    NaN    NaN NaN    NaN    NaN NaN    NaN
_7  350.0  350.0 NaN  332.0  350.0 NaN  350.0

I did try to set None to 0 and to mean but could not reproduce the MLLib covariance results with these standard imputations, see below.我确实尝试将None设置为0并mean但无法使用这些标准插补重现 MLLib 协方差结果，请参见下文。

# Zero NA fill:
zeroed_na_rows = sc.parallelize([[11, 12, 0, 14, 16, 0, 18], 
                       [21, 22, 0, 42, 26, 0, 28],
                       [31, 32, 0, 34, 36, 0, 38],
                       [41, 42, 43, 44, 46, 47, 48],
                       [51, 52, 53, 54, 56, 57, 58],
                       [1, 2, 3, 4, 6, 7, 8]])
spark.createDataFrame(RowMatrix(zeroed_na_rows).computeCovariance().toArray().tolist()).toPandas()
# Results in:
      _1     _2     _3     _4     _5     _6     _7
0  350.0  350.0  379.0  332.0  350.0  391.0  350.0
1  350.0  350.0  379.0  332.0  350.0  391.0  350.0
2  379.0  379.0  606.7  319.6  379.0  646.3  379.0
3  332.0  332.0  319.6  368.0  332.0  324.4  332.0
4  350.0  350.0  379.0  332.0  350.0  391.0  350.0
5  391.0  391.0  646.3  324.4  391.0  690.7  391.0
6  350.0  350.0  379.0  332.0  350.0  391.0  350.0

# Mean NA fill:
mean_rows = sc.parallelize([[11, 12, 27, 14, 16, 37, 18], 
                           [21, 22, 27, 42, 26, 37, 28],
                           [31, 32, 27, 34, 36, 37, 38],
                           [41, 42, 43, 44, 46, 47, 48],
                           [51, 52, 53, 54, 56, 57, 58],
                           [ 1,  2,  3,  4,  6,  7,  8]])
spark.createDataFrame(RowMatrix(mean_rows).computeCovariance().toArray().tolist()).toPandas()
#Results in (still different from Pandas.cov()):
      _1     _2     _3     _4     _5     _6     _7
0  350.0  350.0  298.0  332.0  350.0  280.0  350.0
1  350.0  350.0  298.0  332.0  350.0  280.0  350.0
2  298.0  298.0  290.8  287.2  298.0  280.0  298.0
3  332.0  332.0  287.2  368.0  332.0  280.0  332.0
4  350.0  350.0  298.0  332.0  350.0  280.0  350.0
5  280.0  280.0  280.0  280.0  280.0  280.0  280.0
6  350.0  350.0  298.0  332.0  350.0  280.0  350.0

If it's not that, what's going on here and how do I get Spark MLLib to produce reasonably similar results to Pandas?如果不是这样，这里发生了什么，我如何让 Spark MLLib 产生与 Pandas 相当相似的结果？

Answer 1

I don't think there is an easy way to reproduce Pandas treatments of NANs in Spark without re-implementing your own cov method.我认为没有一种简单的方法可以在不重新实现自己的 cov 方法的情况下在 Spark 中重现 Pandas 对 NAN 的处理。

The reason is that Pandas just ignore every NAN - it does not replace it with any value - that's why you replacing the NANs with 0 or the mean does not lead to the same results.原因是 Pandas 只是忽略每个 NAN——它不会用任何值替换它——这就是为什么你用 0 替换 NAN 或平均值不会导致相同的结果。 Pandas instead seems to throw away the pair of observations with missing values and computes the covariance on the remaining pairs.相反，Pandas 似乎丢弃了具有缺失值的观测对，并计算了剩余对的协方差。

The Spark implementation on the other hand, returns NAN when it is asked to compute the covariance of a set of pairs that contain a NAN.另一方面，Spark 实现在被要求计算包含 NAN 的一组对的协方差时返回 NAN。 I don't know at what point exactly this happens in the code/calculation, but as far as I can see you can't change it easily by just changing a default parameter and you might have to create your own version of the cov function or find a way to pre- and post-process columns with NANs, eg remove their NANs and calculate the Covariance and stick replace the NANs in your resulting covariance matrix with those values.我不知道这在代码/计算中究竟发生在什么时候，但据我所知，您无法通过更改默认参数轻松更改它，您可能必须创建自己的 cov 函数版本或者找到一种使用 NAN 预处理和后处理列的方法，例如删除它们的 NAN 并计算协方差并坚持用这些值替换生成的协方差矩阵中的 NAN。

用于协方差计算的 Pandas 与 MLLib 的确切 Apache-Spark NA 处理差异是什么？

问题描述

1 个解决方案

解决方案1
-1 2021-10-29 11:36:44

用于协方差计算的 Pandas 与 MLLib 的确切 Apache-Spark NA 处理差异是什么？

问题描述

1 个解决方案

解决方案1 -1 2021-10-29 11:36:44

解决方案1
-1 2021-10-29 11:36:44