简体繁体 English

使用Spark将数据保存到Cassandra

[英]Using Spark to save data to Cassandra

原文 2016-01-25 15:21:03 0 1 java/ apache-spark/ cassandra/ spark-streaming/ datastax

Now in my current architecture I have a module which is responsible for writing/reading data to and from Cassandra, and module responsible for downloading data. 现在，在我当前的体系结构中，我有一个模块，负责与Cassandra进行数据读写，以及负责下载数据的模块。 Recently I started using Datastax and Spark. 最近，我开始使用Datastax和Spark。 I want to do some transformations on new acquired data. 我想对新获取的数据进行一些转换。 What's the right take on this problem? 对这个问题有什么正确的看法？ Do I use my module for storing data and do Spark calculations separately, or send downloaded data directly to Spark using Spark Streaming and in a job save both the orginal data and transformed data to Cassandra? 我是使用模块存储数据并单独执行Spark计算，还是使用Spark Streaming将下载的数据直接发送到Spark，并在作业中将原始数据和转换后的数据都保存到Cassandra？ I'm operating on stock quotes so it's a lot of data downloaded continuously and a lot of transformations. 我使用的是股票报价，因此需要不断下载大量数据并进行大量转换。

1 个解决方案

In my opinion, its better to keep it separated. 我认为，最好将其分开。

first store the raw data then process it. 首先存储原始数据，然后进行处理。
its easier to scale and maintain each component later. 以后更容易扩展和维护每个组件。

for example: if you want to change something in your downloading module like adding a new download sources or fix a bug, it wont affect the data processing done in spark, and changing something in the code running on spark wont have any effect(or introduce a bug) on the raw data you downloaded. 例如：如果您想更改下载模块中的某些内容，例如添加新的下载源或修复错误，则不会影响在spark中完成的数据处理，并且更改在spark上运行的代码中的任何内容都不会有任何效果（或引入一个错误）下载的原始数据。