简体繁体 English

从Cassandra中提取数据以加载到关系数据库中的机制

[英]Mechanism for extracting data out of Cassandra for load into relational databases

原文 2013-01-26 01:15:31 9 2 mysql/ hadoop/ cassandra/ etl

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput. 我们将Cassandra用作应用程序的主要数据存储，该应用程序收集大量数据，并需要大量存储空间和非常快的写入吞吐量。

We plan to extract this data on a periodic basis and load into a relational database (like mySQL). 我们计划定期提取此数据，并将其加载到关系数据库（如mySQL）中。 What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? 现有哪些提取机制可以扩展到每天数亿条记录的规模？ Expensive third party ETL tools like Informatica are not an option for us. 昂贵的第三方ETL工具（例如Informatica）不是我们的选择。 So far my web searches have revealed only Hadoop with Pig or Hive as an option. 到目前为止，我的网络搜索仅显示了带有Pig或Hive作为选项的Hadoop。 However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? 但是，对于该领域来说还很陌生，我不确定它们的扩展能力如何，以及在运行时会给Cassandra集群本身带来多少负载？ Are there other options as well? 还有其他选择吗？

2 个解决方案

You should take a look at sqoop , it has an integration with Cassandra as shown here . 你应该看看sqoop ，它与卡桑德拉集成如图所示这里。

This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically: 这也将轻松扩展，您需要一个Hadoop集群才能使sqoop工作，其工作方式基本上是：

Slice your dataset into different partitions. 将您的数据集切成不同的分区。
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice. 运行一个Map / Reduce作业，每个映射器将负责传输1个切片。

So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. 因此，您希望导出的数据集越大，映射器的数量就越多，这意味着如果您不断增加集群，吞吐量将不断增加。 It's all a matter of what resources you have. 这完全取决于您拥有什么资源。

As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. 至于Cassandra群集上的负载，我不确定，因为我没有亲自使用过Cassandra连接器和sqoop ，但是如果要提取数据，则无论如何都需要给群集增加一些负载。 You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal. 例如，您可以在流量最低的特定时间每天执行一次，这样一来，如果您的Cassandra可用性下降，则影响最小。

I'm also thinking that if this is related to your other question , you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. 我还认为，如果这与您的其他问题有关，则您可能需要考虑导出到Hive而不是MySQL，在这种情况下sqoop也可以工作，因为它可以直接导出到Hive。 And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs. 并将其放入Hive后，您可以使用与sqoop使用的群集相同的群集来运行您的分析作业。

There is no way to extract data out of cassandra other than paying for etl tool. 除了购买etl工具外，没有其他方法可以从cassandra中提取数据。 I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. 我尝试了不同的方式，例如复制命令或cql查询-所有方法都会超时，而与更改Cassandra.Yaml中的超时参数无关。 Cassandra experts say you can not query the data without 'where' clause. 卡桑德拉专家说，没有“ where”子句就无法查询数据。 This is big restriction to me. 这对我来说是一个很大的限制。 This may be one of the main reason not to use cassandra at least for me. 这可能是至少我不使用卡桑德拉的主要原因之一。