spark cassandra 连接器在回读时丢失数据

Question

I am writing data with 3000000 rows and 8 columns to cassandra using spark cassandra connector(python) and when i read back i am only getting 50000 rows.我正在使用 spark cassandra 连接器（python）将 3000000 行和 8 列的数据写入 cassandra，当我回读时，我只得到 50000 行。 when i check number of rows in cqlsh there also number of rows in 50000 only where is my data going is there a issue with spark -cassandra connector?当我检查 cqlsh 中的行数时，还有 50000 中的行数只有我的数据去哪里了 spark -cassandra 连接器有问题吗？

this is my spark config这是我的火花配置

spark = SparkSession.builder.appName("das_archive").config(
"spark.driver.memory", "25g").config('spark.cassandra.connection.host',
                                     '127.0.0.1').config(
'spark.jars.packages',
'datastax:spark-cassandra-connector:2.4.0-s_2.11')

write写

 df.write.format("org.apache.spark.sql.cassandra").mode('append').options(
    table='shape1', keyspace="shape_db1").save(

read读

 load_options = {"table": "shape1", "keyspace": "shape_db1",
                "spark.cassandra.input.split.size_in_mb": "1000",
                'spark.cassandra.input.consistency.level': "ALL"}
data_frame = spark.read.format("org.apache.spark.sql.cassandra").options(
    **load_options).load()

Answer 1

The most probable cause for that is that you don't have correct primary key - as result, the data is overwritten.最可能的原因是您没有正确的主键 - 结果，数据被覆盖。 You need to make sure that every row of input data is uniquely identified by the set of the columns.您需要确保输入数据的每一行都由列集唯一标识。

PS If you're just writing data that is stored in something like CSV, you can look to tool like DSBulk that is heavily optimized for loading/unloading data to/from Cassandra. PS 如果您只是编写存储在 CSV 之类的数据中的数据，您可以查看DSBulk 之类的工具，该工具针对从 Cassandra 加载/卸载数据进行了高度优化。

spark cassandra 连接器在回读时丢失数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-21 11:32:42

spark cassandra 连接器在回读时丢失数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-21 11:32:42

解决方案1
0 已采纳 2020-06-21 11:32:42