简体繁体 English

Cassandra + Spark executor 超融合

[英]Cassandra + Spark executor hyperconvergence

原文 2020-01-25 10:15:33 2 3 apache-spark/ cassandra/ spark-cassandra-connector

As Apache Spark is a suggested distributed processing engine for Cassandra, I know that there is a possibility to run Spark executors along with Cassandra nodes.由于 Apache Spark 是建议用于 Cassandra 的分布式处理引擎，我知道有可能将 Spark 执行程序与 Cassandra 节点一起运行。 My question is if the driver and Spark connector are smart enough to understand partitioning and shard allocation so data are processed in a hyper-converged manner.我的问题是驱动程序和 Spark 连接器是否足够智能以了解分区和分片分配，以便以超融合方式处理数据。

Simply, does the executors read data stored from partitions that are hosted on nodes where an executor is running so no unnecessary data are transferred across the network as Spark does when it's run over HDFS?简单地说，执行程序是否从托管在执行程序运行的节点上的分区中读取存储的数据，因此不会像 Spark 在 HDFS 上运行时那样通过网络传输不必要的数据？

3 个解决方案

Yes, Spark Cassandra Connector is able to do this.是的，Spark Cassandra Connector 能够做到这一点。 From the source code :从源代码：

The getPreferredLocations method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. getPreferredLocations方法告诉 Spark 从中获取分区的首选节点，以便分区的数据位于任务发送到的同一节点上。 If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes.如果 Cassandra 节点与 Spark 节点并置，则查询总是发送到与 Spark Executor 进程在同一节点上运行的 Cassandra 进程，因此数据不会在节点之间传输。 If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.如果 Cassandra 节点在读取过程中出现故障或过载，查询将重试到不同的节点。

Theoretically yes.理论上是的。 Same for HDFS too. HDFS 也一样。 Howevet practically I have seen less of it on the cloud where separate nodes are used for spark and Cassandra when their cloud services are used.然而，实际上我在云上看到的很少，在使用它们的云服务时，单独的节点用于 Spark 和 Cassandra。 If you use IAsAS and setup your own Cassandra and Spark then you can achieve it.如果您使用 IASAS 并设置您自己的 Cassandra 和 Spark，那么您就可以实现它。

I would like to add to Alex's answer:我想补充亚历克斯的回答：

Yes, Spark Cassandra Connector is able to do this.是的，Spark Cassandra Connector 能够做到这一点。 From the source code:从源代码：

The getPreferredLocations method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. getPreferredLocations 方法告诉 Spark 从中获取分区的首选节点，以便分区的数据位于任务发送到的同一节点上。 If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes.如果 Cassandra 节点与 Spark 节点并置，则查询总是发送到与 Spark Executor 进程在同一节点上运行的 Cassandra 进程，因此数据不会在节点之间传输。 If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.如果 Cassandra 节点在读取过程中出现故障或过载，查询将重试到不同的节点。

That this is a bad behavior.认为这是一种不良行为。

In Cassandra when you ask to get the data of a particular partition, only one node is accessed.在 Cassandra 中，当您要求获取特定分区的数据时，只会访问一个节点。 Spark can actually access 3 nodes thanks to the replication.由于复制，Spark 实际上可以访问 3 个节点。 So without shuffeling you have 3 nodes participating in the job.因此，无需改组，您就有 3 个节点参与该作业。

In Hadoop however, when you ask to get the data of a particular partition, usually all nodes in the cluster are accessed and then Spark uses all nodes in the cluster as executors.然而在 Hadoop 中，当您要求获取特定分区的数据时，通常会访问集群中的所有节点，然后 Spark 使用集群中的所有节点作为执行器。

So in case you have a 100 nodes: In Cassandra, Spark will take advantage of 3 nodes.因此，如果您有 100 个节点：在 Cassandra 中，Spark 将利用 3 个节点。 In Hadoop, Spark will take advantage of a 100 nodes.在 Hadoop 中，Spark 将利用 100 个节点。

Cassandra is optimized for real-time operational systems, and therefore not optimized for analytics like data lakes. Cassandra 针对实时操作系统进行了优化，因此并未针对数据湖等分析进行优化。