简体繁体 English

数据位置如何与IBM Bluemix上的OpenStack Swift一起使用？

[英]How does data locality work with OpenStack Swift on IBM Bluemix?

原文 2015-09-14 20:38:57 6 1 apache-spark/ ibm-cloud/ openstack-swift

I'm currently playing around with the Apache Spark Service in IBM Bluemix. 我目前正在使用IBM Bluemix中的Apache Spark服务。 Since the IBM Cloud relies on OpenStack Swift as Data Storage for this service I'm wondering if there is any data locality (at least possible) with that architecture. 由于IBM Cloud依赖OpenStack Swift作为此服务的数据存储，我想知道该架构是否存在任何数据位置（至少可能）。

If I'm right with HDFS the SparkDriver asks the HDFS namenode about the datanodes containing the various blocks of a file and then schedules the work to the SparkWorkers. 如果我使用HDFS，SparkDriver会向HDFS namenode询问包含文件各个块的datanode，然后将工作安排到SparkWorkers。

So I've checked the Swift API there is a Range parameter which would allow the SparkWorker to at least read only local blocks, but how can the SparkDriver find out these ranges? 所以我检查了Swift API有一个Range参数，它允许SparkWorker至少只读取本地块，但SparkDriver如何找出这些范围？

Any ideas? 有任何想法吗？

1 个解决方案

This is the disaggregation of compute and storage. 这是计算和存储的分解。 That is, the spark compute nodes are not at all shared with the swift cluster storage nodes. 也就是说，火花计算节点根本不与快速群集存储节点共享。 This confers benefits on scalability of compute separate from storage, and vice versa. 这为与存储分开的计算的可扩展性带来了好处，反之亦然。 But in this model, you cannot have data locality ... by definition. 但是在这个模型中，根据定义，你不能拥有数据局部性。 So how this works, roughly, is that each spark executor can pull its own range of blocks of the object from the swift cluster, such that each executor does not need to pull in all the object data only operate on its own portion; 因此，粗略地说，这是如何工作的，每个火花执行器都可以从快速集群中拉出自己的对象块范围，这样每个执行器都不需要拉入所有对象数据，只能在它自己的部分上运行; which would be inefficient. 这将是低效的。 But the blocks are still pulled from the remote swift cluster, then are not local. 但是仍然从远程swift集群中拉出块，然后不是本地的。 The only question here is how long it takes to pull the blocks into each executor so that doesn't slow you down. 这里唯一的问题是将块拉入每个执行程序所需的时间，这样就不会减慢速度。 In the case of the Bluemix Apache Spark Service and the Bluemix or Softlayer Object Storage service, there is low latency and a fast network between them. 对于Bluemix Apache Spark服务和Bluemix或Softlayer对象存储服务，它们之间存在低延迟和快速网络。

re: "Since the IBM Cloud relies on OpenStack Swift as Data Storage for this service". re：“由于IBM Cloud依赖OpenStack Swift作为此服务的数据存储”。 There will be other data sources available to the spark service as the beta progresses, so it will not be 100% reliance. 随着测试版的进展，spark服务将提供其他数据源，因此不会100％依赖。