简体   繁体   English

如何通过Java或Python高效地将大量数据写入Cassandra?

[英]How to efficiently write large amounts of data to Cassandra via Java or Python?

There are about millions of rows of data that need to be written to Cassandra.I have tried the following methods:大约有几百万行数据需要写入Cassandra。我尝试了以下方法:

The first: According to the reference code given by Datastax java-driver or python-driver on GitHub, my code is similar to:第一种:根据GitHub上Datastax java-driverpython-driver给出的参考代码,我的代码类似:

    // The following code is fixed, and this part will be omitted later.
    String cassandraHost = "******";
    String keyspace = "******";
    String table = "******";
    String insertCqlStr = " insert into " + keyspace + "." + table +"( "
            +     "id,date,value)"
            +     " values ( ?, ?, ?) ;";
    CqlSession session = CqlSession.builder()
            .addContactPoint(new InetSocketAddress(cassandraHost, 9042))
            .withLocalDatacenter("datacenter1")
            .withKeyspace(CqlIdentifier.fromCql(keyspace))
            .build();

    PreparedStatement preparedStatement = session.prepare(insertCqlStr);

    // The code below is changed, or just what I think it is.
    for(List<String> row: rows){
        session.execute(
            preparedInsertStatement.bind(row.get(0),     
            row.get(1), row.get(2))
          .setConsistencyLevel(ConsistencyLevel.ANY));
    }
    session.close();
    

This code works fine, but it's just too inefficient to write for me to accept.So I tried the asynchronous API provided by the driver, and the code is almost the same as the above code:这段代码可以正常工作,但是写的效率太低我接受不了。于是我尝试了驱动提供的异步API,代码和上面的代码几乎一样:

   for(List<String> row: rows){
        session.executeAsync(
            preparedInsertStatement.bind(row.get(0),     
            row.get(1), row.get(2))
          .setConsistencyLevel(ConsistencyLevel.ANY));
    }
    session.close();

Please excuse my lack of asynchronous programming experience for being so rude.请原谅我缺乏异步编程经验,如此粗鲁。 It works, but it has a fatal problem, I found that it does not write all the data into the database.它可以工作,但它有一个致命的问题,我发现它并没有将所有数据写入数据库。 I would like to know the correct usage for calling an async API.我想知道调用异步 API 的正确用法

Also, I tried the relevant methods of the BatchStatement provided by the driver.另外,我尝试了驱动提供的BatchStatement的相关方法。 I know this method is officially deprecated to improve performance and it has many limitations.我知道这种方法已被正式弃用以提高性能,它有很多限制。 For example, as far as I know, the number of statements in a batch cannot exceed 65535, and in the default configuration, the data length warning limit of batch is 5kb, and the error limit is 50kb.比如据我所知,一个batch的语句数不能超过65535,而默认配置下batch的数据长度警告限制为5kb,错误限制为50kb。 But I kept the number of statements below 65535 and modified the above default configuration:但是我将语句数保持在65535以下,并修改了上面的默认配置:

    List<BoundStatement> boundStatements = new ArrayList<>();
    Integer count = 0;
    BatchStatement batchStatement = BatchStatement.newInstance(BatchType.UNLOGGED);
    for (List<String> row : rows){
    // The actual code here is looping multiple times instead of exiting directly.
        if(count >= 65535){
            break;
        }
        BoundStatement boundStatement = preparedStatement.bind(row.get(0),
                                        row.get(1), row.get(2));
        boundStatements.add(boundStatement);
        count += 1;
    }
    BatchStatement batch = batchStatement.addAll(boundStatements);
    session.execute(batch.setConsistencyLevel(ConsistencyLevel.ANY));
    // session.executeAsync(batch.setConsistencyLevel().ANY);
    session.close();

It also works.它也有效。 And it is actually more efficient than asynchronous APIs, and using synchronous interfaces can ensure data integrity.而且它实际上比异步 API 更高效,使用同步接口可以保证数据的完整性。 If the asynchronous API is used to execute BatchStatement here, the incomplete data mentioned above will also occur.如果这里使用异步的API来执行BatchStatement ,也会出现上面提到的数据不完整的情况。 But this method still doesn't meet my requirements, I need to execute it with multithreading.但是这个方法还是不能满足我的要求,需要多线程执行。 When I execute multiple threads it gives error: Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S当我执行多个线程时,它会给出错误:原因:com.datastax.oss.driver.api.core.DriverTimeoutException:PT2S 后查询超时

Summary : I've tried both synchronous and asynchronous writes and Batch related methods, and there are some issues that I can't accept.总结:同步和异步写入以及Batch相关的方法我都试过了,有些问题我不能接受。 I now need to know how to properly use the async API to not lose data, and why I'm wrong.我现在需要知道如何正确使用异步 API 不丢失数据,以及为什么我错了。 As for the BatchStatement related methods, I don't expect it to work, it would be great if you could give me a workable suggestion.至于BatchStatement相关的方法,我不指望它会起作用,如果你能给我一个可行的建议就好了。 Thank you!谢谢!

Instead of trying to write data loading code yourself, I would recommend to adopt a DSBulk tool that is heavily optimized for loading/unloading data to/from Cassandra.与其尝试自己编写数据加载代码,我建议采用DSBulk 工具,该工具针对从 Cassandra 加载/卸载数据进行了高度优化。 And it's open source , so you can even use it as a Java library.而且它是开源的,因此您甚至可以将其用作 Java 库。

There are few reason for that:有几个原因:

  • Writing async code isn't easy - you need to make sure that you aren't sending too many requests over the same connection (Cassandra has a limit on number of the in-flight requests).编写异步代码并不容易——您需要确保不会通过同一连接发送太多请求(Cassandra 对正在进行的请求的数量有限制)。 For driver 3.x you can use something like this , and driver 4.x has built-in rate limiting capabilities对于驱动程序 3.x,您可以使用类似东西,并且驱动程序 4.x 具有内置的速率限制功能
  • Batch in Cassandra often leads to performance degradation when isn't used correctly. Cassandra 中的批处理在未正确使用时通常会导致性能下降。 Batch should be used only for submitting the data that belongs to the same partition, otherwise it would cause a higher load on the coordinating node. Batch 只能用于提交属于同一分区的数据,否则会导致协调节点的负载较高。 Plus you also need to implement a custom routing.另外,您还需要实现自定义路由。

DSBulk is doing all of that very efficiently as it was written by people who are working with Cassandra every day in large scale setups. DSBulk 非常有效地完成所有这些工作,因为它是由每天在大规模设置中使用 Cassandra 的人编写的。

PS in your case, consistency level ANY means that coordinator just acknowledge receiving of data, but doesn't guarantee that it will be written (for example if it's crashed). PS 在您的情况下,一致性级别 ANY 意味着协调器只确认接收到数据,但不保证它会被写入(例如,如果它崩溃了)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM