简体   繁体   中英

How to efficiently insert bulk data into Cassandra using Python?

I have a Python application, built with Flask, that allows importing of many data records (anywhere from 10k-250k+ records at one time). Right now it inserts into a Cassandra database, by inserting one record at a time like this:

for transaction in transactions:
    self.transaction_table.insert_record(transaction)

This process is incredibly slow. Is there a best-practice approach I could use to more efficiently insert this bulk data?

You can use batch statements for this, an example and documentation is available from the datastax documentation . You can also use some child workers and/or async queries on top of this.

In terms of best practices, it is more efficient if each batch only contains one partition key . This is because you do not want a node to be used as a coordinator for many different partition keys, it would be faster to contact each individual node directly.

If each record has a different partition key, a single prepared statement with some child workers may work out to be better.

You may also want to consider using a TokenAware load balancing policy allowing the relevant node to be contacted directly, instead of being coordinated through another node.

The easiest solution is to generate csv files from your data, and import it with the COPY command. That should work well for up to a few million rows. For more complicated scenarios you could use the sstableloader command.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM