简体   繁体   中英

Save null Values in Cassandra using DataStax Spark Connector

I try to save stream data into Cassandra using Spark and Cassandra Spark Connector.

I made something like the following:

Create a Model Class:

public class ContentModel {
    String id;

    String available_at; //may be null

  public ContentModel(String id, String available_at){
     this.id=id;
     this.available_at=available_at,
  }
}

Mapping Streaming content to Model:

JavaDStream<ContentModel> contentsToModel = myStream.map(new Function<String, ContentModel>() {
        @Override
        public ContentModel call(String content) throws Exception {

            String[] parts = content.split(",");
            return new ContentModel(parts[0], parts[1]);
        }
    });

Save:

CassandraStreamingJavaUtil.javaFunctions(contentsToModel).writerBuilder("data", "contents", CassandraJavaUtil.mapToRow(ContentModel.class)).saveToCassandra();

If some values are null I get the following error:

com.datastax.spark.connector.types.TypeConversionException: Cannot convert object null to struct.ValueRepr.

Is there a way to store null values using Spark Cassandra Connector ?

Cassandra hasn't the concept of null. A column is empty or filled. I solved this issue in scala on the following way: i used the map method and checked null values. I overrode null with an empty string. That's it. Works really good.

在scala中,您也可以使用选项。

Can we know the version of your dependencies (Spark, Connector, Cassandra, etc..)

Yes, there is a way to store nulls with the Cassandra Connector. I got your example to work properly with a Simple App and a few changes (Adding Serializabe + converting your model properties to Camel Case + The relative getters and setters). I am less familiar with the Java API (you really should use Scala when doing Spark, it makes things much easier), but I was under the impression reflection on Model classes was done at the getter/setter level... Could be wrong.

The Model

public class ModelClass implements Serializable {
    String id;

    String availableAt; //may be null

    public ModelClass(String id, String availableAt){
        this.id=id;
        this.availableAt=availableAt;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
       this.id = id;
    }

    public String getAvailableAt() {
        return availableAt;
     }

    public void setAvailableAt(String availableAt) {
        this.availableAt = availableAt;
    }
}

The Driver

public static void main(String ... args) {
    SparkConf conf = new SparkConf();
    conf.setAppName("Local App");
    conf.setMaster("local[*]");
    JavaSparkContext context = new JavaSparkContext(conf);

    List<ModelClass> modelList = new ArrayList<>();
    modelList.add(new ModelClass("Test", null));
    modelList.add(new ModelClass("Test2", "test"));
    context.parallelize(modelList);
    JavaRDD<ModelClass> modelRDD = context.parallelize(modelList);
    javaFunctions(modelRDD).writerBuilder("test", "model", mapToRow(ModelClass.class))
            .saveToCassandra();
}

Produces

cqlsh:test> select * from model;

 id    | available_at
-------+--------------
  Test |         null
 Test2 |         test

It's important to know the implications of how you "write" nulls, though. Generally speaking, we want to avoid the writing out of nulls because of how Cassandra generates tombstones. If these are initial writes, you will want to treat them as "Unset".

Globally treating all nulls as Unset

Globally treating all nulls as Unset WriteConf also now contains a parameter ignoreNulls which can be set via using a SparkConf key spark.cassandra.output.ignoreNulls. The default is false which will cause nulls to be treated as in previous versions (being inserted into Cassandra as is). When set to true all nulls will be treated as unset. This can be used with DataFrames to skip null records and avoid tombstones.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#globally-treating-all-nulls-as-unset

EDIT: I should clarify, internally Cassandra doesn't store an actual null value - it's just unset . But we can reason with Cassandra using nulls at the application level.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM