How to efficiently insert bulk data into Cassandra using Python?

Question

I have a Python application, built with Flask, that allows importing of many data records (anywhere from 10k-250k+ records at one time). Right now it inserts into a Cassandra database, by inserting one record at a time like this:

for transaction in transactions:
    self.transaction_table.insert_record(transaction)

This process is incredibly slow. Is there a best-practice approach I could use to more efficiently insert this bulk data?

Answer 1

You can use batch statements for this, an example and documentation is available from the datastax documentation . You can also use some child workers and/or async queries on top of this.

In terms of best practices, it is more efficient if each batch only contains one partition key . This is because you do not want a node to be used as a coordinator for many different partition keys, it would be faster to contact each individual node directly.

If each record has a different partition key, a single prepared statement with some child workers may work out to be better.

You may also want to consider using a TokenAware load balancing policy allowing the relevant node to be contacted directly, instead of being coordinated through another node.

Answer 2

The easiest solution is to generate csv files from your data, and import it with the COPY command. That should work well for up to a few million rows. For more complicated scenarios you could use the sstableloader command.

How to efficiently insert bulk data into Cassandra using Python?

Question

2 answers

solution1
1 ACCPTED 2016-08-09 14:31:06

solution2
1 2016-08-09 16:12:27

How to efficiently insert bulk data into Cassandra using Python?

Question

2 answers

solution1 1 ACCPTED 2016-08-09 14:31:06

solution2 1 2016-08-09 16:12:27

solution1
1 ACCPTED 2016-08-09 14:31:06

solution2
1 2016-08-09 16:12:27