繁体 English 中英

spark函数中approxCountDsitinct和approx_count_distinct的区别

[英]Difference between approxCountDsitinct and approx_count_distinct in spark functions

原文 2020-09-01 21:44:05 1 1 python/ apache-spark/ pyspark

谁能说出pyspark.sql.functions.approxCountDistinct （我知道它已被弃用）和pyspark.sql.functions.approx_count_distinct之间的区别？ 我在一个项目中使用过两个版本，体验过不同的价值观

1 个解决方案

正如您提到的， pyspark.sql.functions.approxCountDistinct已弃用。 原因很可能只是风格问题。 他们可能希望所有东西都装在蛇壳里。 正如您在源代码中看到的那样pyspark.sql.functions.approxCountDistinct只是调用pyspark.sql.functions.approx_count_distinct ，除了给您一个警告之外，仅此而已。 因此，无论您使用哪个，最终都会运行相同的代码。

此外，仍然根据源代码， approx_count_distinct是基于HyperLogLog++算法的。 我不太熟悉该算法，但它基于重复的集合合并。 因此，结果很可能取决于合并执行者的各种结果的顺序。 由于这对 spark 来说不是确定性的，这可以解释为什么你会看到不同的结果。

如何使用 approx_count_distinct 计算 Spark DataFrame 中两列的不同组合？

[英]How to use approx_count_distinct to count distinct combinations of two columns in a Spark DataFrame?

为什么以“0”和“3”开头大约有区别

[英]Why is there a difference between starting with a "0" and "3" for approx

计算两列之间的不同集合，同时使用 agg 函数 Pyspark Spark Session

[英]Count distinct sets between two columns, while using agg function Pyspark Spark Session

功能和方法之间的区别

[英]Difference between functions and methods

PySpark 和 Spark 有什么区别？

[英]What is the difference between PySpark and Spark?

熊猫：计算日期之间的差异

[英]Pandas: count difference between dates

python中复制函数的区别

[英]Difference between copy functions in python

Spark-SQL中DISTRIBUTE BY和Shuffle之间的区别

[英]Difference between DISTRIBUTE BY and Shuffle in Spark-SQL

函数与非函数的区别？

[英]Difference between functions and non-functions?

读取和打开功能有什么区别？

[英]What is the difference between read and open functions?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 approx_count_distinct 计算 Spark DataFrame 中两列的不同组合？为什么以“0”和“3”开头大约有区别计算两列之间的不同集合，同时使用 agg 函数 Pyspark Spark Session 功能和方法之间的区别 PySpark 和 Spark 有什么区别？熊猫：计算日期之间的差异 python中复制函数的区别 Spark-SQL中DISTRIBUTE BY和Shuffle之间的区别函数与非函数的区别？读取和打开功能有什么区别？

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM