简体   繁体   中英

How do I run dsbulk unload and write directly to S3

I want to run a dsbulk unload command, but my cassandra cluster has ~1tb of data in the table I want to export. Is there a way to run the dsbulk unload command and stream the data into s3 as opposed to writing to disk?

Im running the following command in my dev environment, but obviously this is just writing to disk on my machine

bin/dsbulk unload -k myKeySpace -t myTable -url ~/data --connector.csv.compression gzip

It doesn't support it "natively" out of the box. Theoretically it could be implemented, as DSBulk is now open source , but it should be done by somebody.

Update: The workaround could be, as pointed by Adam is to use aws s3 cp and pipe to it from DSBulk, like this:

dsbulk unload .... |aws s3 cp - s3://...

but there is a limitation - the unload will be performed in one thread, so unloading could be much slower.

In the short term you can use Apache Spark in the local master mode with Spark Cassandra Connector, something like this (for Spark 2.4):

spark-shell --packages com.datastax.spark:spark-cassandra-connector-assembly_2.11:2.5.1

and inside:

val data = spark.read.format("org.apache.spark.sql.cassandra")\
   .options(Map( "table" -> "table_name", "keyspace" -> "keyspace_name")).load()
data.write.format("json").save("s3a://....")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM