简体   繁体   English

为Spark集群和Cassandra设置和配置Titan

[英]Setup and configuration of Titan for a Spark cluster and Cassandra

There are already several questions on the aurelius mailing list as well as here on stackoverflow about specific problems with configuring Titan to get it working with Spark. 关于aurelius邮件列表以及有关配置Titan以使其与Spark一起使用的特定问题的stackoverflow,已经有几个问题。 But what is missing in my opinion is a high-level description of a simple setup that uses Titan and Spark. 但我认为缺少的是对使用Titan和Spark的简单设置的高级描述。

What I am looking for is a somewhat minimal setup that uses recommended settings. 我正在寻找的是使用推荐设置的有点小设置。 For example for Cassandra, the replication factor should be 3 and a dedicated datacenter should be used for analytics. 例如,对于Cassandra,复制因子应为3,并且应使用专用数据中心进行分析。

From the information I found in the documentation of Spark, Titan, and Cassandra, such a minimal setup could look like this: 根据我在Spark,Titan和Cassandra文档中找到的信息,这样的最小设置可能如下所示:

  • Real-time processing DC: 3 Nodes with Titan + Cassandra (RF: 3) 实时处理DC:带Titan + Cassandra的3个节点(RF:3)
  • Analytics DC: 1 Spark master + 3 Spark slaves with Cassandra (RF: 3) 分析DC:1个Spark主人+ 3个Cassandra的Spark派对(RF:3)

Some questions I have about that setup and Titan + Spark in general: 关于该设置和Titan + Spark的一般问题:

  1. Is that setup correct? 这个设置是否正确?
  2. Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master? Titan是否也应安装在3个Spark从节点和/或Spark主节点上?
  3. Is there another setup that you would use instead? 是否有其他设置可供您使用?
  4. Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node? Spark Slave是否只从分析DC读取数据,理想情况下甚至可以从同一节点上的Cassandra读取数据?

Maybe someone can even share a config file that supports such a setup (or a better one). 也许有人甚至可以共享支持这种设置(或更好的设置)的配置文件。

So I just tried it out and set up a simple Spark cluster to work with Titan (and Cassandra as the storage backend) and here is what I came up with: 所以我只是试了一下并设置了一个简单的Spark集群来与Titan(和Cassandra作为存储后端)一起工作,这就是我想出来的:

High-Level Overview 高级概述

I just concentrate on the analytics side of the cluster here, so I let out the real-time processing nodes. 我只是专注于集群的分析方面,所以我放出了实时处理节点。

分析数据中心的高级概述

Spark consists of one (or more) master and multiple slaves (workers). Spark由一个(或多个)主服务器和多个从服务器(工作者)组成。 Since the slaves do the actual processing, they need to access the data they work on. 由于从站进行实际处理,因此需要访问它们所处理的数据。 Therefore Cassandra is installed on the workers and holds the graph data from Titan. 因此,Cassandra安装在工人身上并保存Titan的图表数据。

Jobs are sent from Titan nodes to the spark master who distributes them to his workers. 工作从泰坦节点发送给火花大师,火星大师将他们分发给他的工人。 Therefore, Titan basically only communicates with the Spark master. 因此,Titan基本上只与Spark主人通信。

The HDFS is only needed because TinkerPop stores intermediate results in it. 仅需要HDFS,因为TinkerPop将中间结果存储在其中。 Note, that this changed in TinkerPop 3.2.0 . 请注意, 这在TinkerPop 3.2.0中发生了变化

Installation 安装

HDFS HDFS

I just followed a tutorial I found here . 我只是按照我在这里找到的教程。 There are only two things to keep in mind here for Titan: 泰坦只有两件事需要记住:

  • Choose a compatible version, for Titan 1.0.0, this is 1.2.1. 为Titan 1.0.0选择兼容版本,这是1.2.1。
  • TaskTrackers and JobTrackers from Hadoop are not needed, as we only want the HDFS and not MapReduce. 不需要来自Hadoop的TaskTrackers和JobTrackers,因为我们只需要HDFS而不是MapReduce。

Spark 火花

Again, the version has to be compatible, which is also 1.2.1 for Titan 1.0.0. 同样,版本必须兼容,对于Titan 1.0.0也是1.2.1。 Installation basically means extracting the archive with a compiled version. 安装基本上意味着使用编译版本提取存档。 In the end, you can configure Spark to use your HDFS by exporting the HADOOP_CONF_DIR which should point to the conf directory of Hadoop. 最后,您可以通过导出应该指向Hadoop的conf目录的HADOOP_CONF_DIR来配置Spark以使用您的HDFS。

Configuration of Titan 泰坦的配置

You also need a HADOOP_CONF_DIR on the Titan node from which you want to start OLAP jobs. 您还需要要从中启动OLAP作业的Titan节点上的HADOOP_CONF_DIR It needs to contain a core-site.xml file that specifies the NameNode: 它需要包含指定NameNode的core-site.xml文件:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
     <name>fs.default.name</name>
     <value>hdfs://COORDINATOR:54310</value>
     <description>The name of the default file system.  A URI whose
       scheme and authority determine the FileSystem implementation.  The
       uri's scheme determines the config property (fs.SCHEME.impl) naming
       the FileSystem implementation class.  The uri's authority is used to
       determine the host, port, etc. for a filesystem.</description>
  </property>
</configuration>

Add the HADOOP_CONF_DIR to your CLASSPATH and TinkerPop should be able to access the HDFS. HADOOP_CONF_DIR添加到CLASSPATH ,TinkerPop应该能够访问HDFS。 The TinkerPop documentation contains more information about that and how to check whether HDFS is configured correctly. TinkerPop文档包含有关它的更多信息以及如何检查HDFS是否配置正确。

Finally, a config file that worked for me: 最后,一个适合我的配置文件:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# Titan Cassandra InputFormat configuration
#
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3
titanmr.ioformat.conf.storage.port=9160
titanmr.ioformat.conf.storage.keyspace=titan
titanmr.ioformat.cf-name=edgestore

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647

#
# SparkGraphComputer Configuration
#
spark.master=spark://COORDINATOR:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer

Answers 答案

This leads to the following answers: 这导致以下答案:

Is that setup correct? 这个设置是否正确?

It seems to be. 好像是。 At least it works with this setup. 至少它适用于此设置。

Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master? Titan是否也应安装在3个Spark从节点和/或Spark主节点上?

Since it isn't required, I wouldn't do that as I prefer a separation of Spark and Titan servers which the user can access. 由于不需要,我不会这样做,因为我更喜欢将用户可以访问的Spark和Titan服务器分开。

Is there another setup that you would use instead? 是否有其他设置可供您使用?

I would be happy to hear from someone else who has a different setup. 我很乐意听到其他人有不同的设置。

Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node? Spark Slave是否只从分析DC读取数据,理想情况下甚至可以从同一节点上的Cassandra读取数据?

Since the Cassandra nodes (from the analytics DC) are explicitly configured, the Spark slaves shouldn't be able to pull data from completely different nodes. 由于Cassandra节点(来自分析DC)是显式配置的,因此Spark从属服务器不应该从完全不同的节点提取数据。 But I am still not sure about the second part. 但我仍然不确定第二部分。 Maybe someone else can provide more insight here? 也许其他人可以在这里提供更多见解?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM