Spark：PySpark + Cassandra 查询性能

Question

I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:我在本地机器（8 核，16gb 内存）上设置了 Spark 2.0 和 Cassandra 3.0 用于测试目的，并编辑spark-defaults.conf如下：

spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4

Next I imported 1.5 million rows in Cassandra:接下来我在 Cassandra 中导入了 150 万行：

test(
    tid int,
    cid int,
    pid int,
    ev list<double>,
    primary key (tid)
)

test.ev is a list containing numeric values ie [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680] test.ev是一个包含数值的列表，即[2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]

Now in the code, to test the whole thing I just created a SparkSession , connected to Cassandra and make a simple select count:现在在代码中，为了测试整个事情，我刚刚创建了一个SparkSession ，连接到 Cassandra 并进行简单的选择计数：

cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()

At this point, Spark outputs the count and takes about 28 seconds to finish the Job , distributed in 13 Tasks (in Spark UI , the total Input for the Tasks is 331.6MB)此时，Spark 输出count ，完成Job大约需要 28 秒，分布在 13 个Tasks （在Spark UI ，Tasks 的总 Input 为 331.6MB）

Questions:问题：

Is that the expected performance?这是预期的表现吗？ If not, what am I missing?如果没有，我错过了什么？
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks?理论上说，DataFrame 的分区数决定了 Spark 将在其中分配作业的任务数。如果我将spark.sql.shuffle.partitions设置为 4，为什么要创建 13 个任务？ (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame) （还要确保在我的 DataFrame 上调用rdd.getNumPartitions()的分区数）

Update更新

A common operation I would like to test over this data:我想对这些数据进行测试的一个常见操作：

Query a large data set, say, from 100,000 ~ N rows grouped by pid查询一个大数据集，比如从 100,000 ~ N 行按pid分组
Select ev , a list<double>选择ev ，一个list<double>
Perform an average on each member, assuming by now each list has the same length ie df.groupBy('pid').agg(avg(df['ev'][1]))对每个成员执行平均，假设现在每个列表都具有相同的长度，即df.groupBy('pid').agg(avg(df['ev'][1]))

~~As @zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set.~~ ~~正如 @zero323建议的那样，我为这个测试部署了一台带有 Cassandra 的外部机器（2Gb RAM、4 核、SSD），并加载了相同的数据集。~~ ~~The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job ).~~ ~~df.select().count()结果是与我之前的测试相比预期更大的延迟和更差的整体性能（完成 Job大约需要 70 秒）。~~

Edit: I misunderstood his suggestion.编辑：我误解了他的建议。 @zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here @ zero323旨在让卡桑德拉执行，而不是使用SQL星火计数，如解释在这里

Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.另外我想指出的是，我知道为此类数据设置list<double>而不是宽行的固有反模式，但此时我的担忧更多是花在检索大型数据集上的时间而不是实际的平均计算时间。

Answer 1

Is that the expected performance?这是预期的表现吗？ If not, what am I missing?如果没有，我错过了什么？

It looks slowish but it is not exactly unexpected.它看起来很慢，但并不完全出乎意料。 In general count is expressed as一般count表示为

SELECT 1 FROM table

followed by Spark side summation.其次是 Spark 边求和。 So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.因此，虽然它被优化了，但它仍然相当低效，因为您从外部源获取 N 个长整数只是为了在本地对这些进行求和。

As explained by the docs Cassandra backed RDD (not Datasets ) provide optimized cassandraCount method which performs server side counting.正如文档所解释的， Cassandra 支持的 RDD（不是Datasets ）提供了优化的cassandraCount方法，该方法执行服务器端计数。

Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?理论上说，DataFrame 的分区数决定了 Spark 将在其中分配作业的任务数。如果我将spark.sql.shuffle.partitions设置为 (...)，为什么要创建 (...) 任务？

Because spark.sql.shuffle.partitions is not used here.因为这里没有使用spark.sql.shuffle.partitions 。 This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).此属性用于确定 shuffle 的分区数（当数据由一组键聚合时），而不是用于Dataset创建或全局聚合，如count(*) （始终使用 1 个分区进行最终聚合）。

If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:如果您对控制初始分区的数量感兴趣，您应该查看spark.cassandra.input.split.size_in_mb ，它定义了：

Approx amount of data to be fetched into a Spark partition.要提取到 Spark 分区中的大约数据量。 Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism结果 Spark 分区的最小数量为 1 + 2 * SparkContext.defaultParallelism

As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.正如你所看到的，这里的另一个因素是spark.default.parallelism但它并不是一个微妙的配置，所以依赖它通常不是一个最佳选择。

Answer 2

I see that it is very old question but maybe someone needs it now.我看到这是一个非常古老的问题，但也许现在有人需要它。 When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.在本地机器上运行 Spark 时，将 SparkConf 主节点设置为“local[*]”非常重要，根据文档，它允许使用与机器上的逻辑核心一样多的工作线程运行 Spark。

It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".与主“本地”相比，它帮助我在本地机器上将 count() 操作的性能提高了 100%。

Spark：PySpark + Cassandra 查询性能

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-09-20 12:00:01

解决方案2
0 2020-01-22 11:28:01

Spark：PySpark + Cassandra 查询性能

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-09-20 12:00:01

解决方案2 0 2020-01-22 11:28:01

解决方案1
5 已采纳 2016-09-20 12:00:01

解决方案2
0 2020-01-22 11:28:01