简体繁体 English

正在从Cassandra中获取数据？

[英]Getting data OUT of Cassandra?

原文 2018-05-21 18:13:51 4 1 cassandra/ export

How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? 如何在一段时间内（例如每小时或每天）或从Cassandra数据库中导出更新的记录来导出数据？ It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that. 似乎使用带有日期字段的索引似乎可行，但是当我手动尝试时，我肯定会在cqlsh中超时，因此我担心这样做不可靠。

If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? 如果这不是正确的方法，那么人们如何将数据从Cassandra中提取到传统数据库中（用于分析，使用JOIN进行查询等）？ It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). 它不是一家Java商店，因此使用Spark并非易事（而且我们不想将整个系统更改为直接使用Spark而不是cassandra）。 Do I have to read sstables and try to keep track of them that way? 我是否必须阅读sstables并尝试以这种方式跟踪它们？ Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar? 有没有办法说“让我在时间点X之后受影响的所有记录”或“让我在时间戳X之后获得所有的更改”或类似的说法？

It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying... 看起来Cassandra在快速读取和写入单个记录方面确实很棒，但是除此之外，如果您想将Cassandra的数据放入其他任何数据进行分析，仓储或查询，Cassandra似乎不是正确的工具...

1 个解决方案

Spark is the most typical to do exactly that (as you say). 正如您所说，Spark是最典型的做到这一点的工具。 It does it efficiently and is used often so pretty reliable. 它有效地做到了，并且使用起来非常可靠。 Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. Cassandra并不是真正为OLAP工作负载而设计的，但是诸如火花连接器之类的东西有助于缩小差距。 DataStax Enterprise might have some more options available to you but I am not sure their current offerings. DataStax Enterprise可能还有更多选项供您选择，但我不确定它们当前的产品。

You can still just query and page through the whole data set with normal CQL queries, its just not as fast. 您仍然可以使用普通的CQL查询来查询和分页整个数据集，只是速度并不快。 You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). 您甚至可以警惕ALLOW FILTERING，因为它非常昂贵并且会影响您的群集（为工作负载创建一个单独的DC，并对其使用LOCOL_CL查询会有所帮助）。 You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. 在这种情况下，您可能还会在< token()子句中添加< token()和> token()以拆分查询并防止在任何一个协调器上进行过多的工作。 Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time). 强烈建议您整理数据，以提高查询效率（例如，如果要进行时间片，则将其放入按时间存储的分区中并聚簇关键timeuuid，以便对时间的每个部分进行顺序读取）。

Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough. 听起来有点俗气，但是cqlsh的CSV转储实际上非常快，如果您的数据集足够小，则可能对您有用。

I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark. 除非您熟悉内部结构并使用hadoop或spark，否则我不建议您直接进入sstables。