简体   繁体   English

spark cassandra 连接器在回读时丢失数据

[英]spark cassandra connector missing data while reading back

I am writing data with 3000000 rows and 8 columns to cassandra using spark cassandra connector(python) and when i read back i am only getting 50000 rows.我正在使用 spark cassandra 连接器(python)将 3000000 行和 8 列的数据写入 cassandra,当我回读时,我只得到 50000 行。 when i check number of rows in cqlsh there also number of rows in 50000 only where is my data going is there a issue with spark -cassandra connector?当我检查 cqlsh 中的行数时,还有 50000 中的行数只有我的数据去哪里了 spark -cassandra 连接器有问题吗?

this is my spark config这是我的火花配置

spark = SparkSession.builder.appName("das_archive").config(
"spark.driver.memory", "25g").config('spark.cassandra.connection.host',
                                     '127.0.0.1').config(
'spark.jars.packages',
'datastax:spark-cassandra-connector:2.4.0-s_2.11')

write

 df.write.format("org.apache.spark.sql.cassandra").mode('append').options(
    table='shape1', keyspace="shape_db1").save(

read

 load_options = {"table": "shape1", "keyspace": "shape_db1",
                "spark.cassandra.input.split.size_in_mb": "1000",
                'spark.cassandra.input.consistency.level': "ALL"}
data_frame = spark.read.format("org.apache.spark.sql.cassandra").options(
    **load_options).load()

The most probable cause for that is that you don't have correct primary key - as result, the data is overwritten.最可能的原因是您没有正确的主键 - 结果,数据被覆盖。 You need to make sure that every row of input data is uniquely identified by the set of the columns.您需要确保输入数据的每一行都由列集唯一标识。

PS If you're just writing data that is stored in something like CSV, you can look to tool like DSBulk that is heavily optimized for loading/unloading data to/from Cassandra. PS 如果您只是编写存储在 CSV 之类的数据中的数据,您可以查看DSBulk 之类的工具,该工具针对从 Cassandra 加载/卸载数据进行了高度优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM