简体   繁体   English

Spark Streaming 转换数据集<row>到数据集<customobject>在 java</customobject></row>

[英]Spark Streaming Convert Dataset<Row> to Dataset<CustomObject> in java

I've recently started working with apache spark and came across a requirement where I need to read kafka stream and feed the data in cassandra.我最近开始使用 apache spark 并遇到了一个要求,我需要阅读 kafka stream 并将数据输入 cassandra。 While doing so I encountered an issue where as streams are SQL based and cassandra connector is on rdd (I may be wrong here please do correct me) I was struggling to get this working.这样做时我遇到了一个问题,因为流是基于 SQL 并且 cassandra 连接器在 rdd 上(我可能在这里错了,请纠正我)我正在努力让它工作。 Somehow I made it work as of now but not sure if that's the true way to implement.不知何故,我现在让它工作,但不确定这是否是真正的实施方式。

Below is the code下面是代码

Schema架构

StructType getSchema(){
StructField[] structFields = new StructField[]{
                new StructField("id", DataTypes.LongType, true, Metadata.empty()),
                new StructField("name", DataTypes.StringType, true, Metadata.empty()),
                new StructField("cat", DataTypes.StringType, true, Metadata.empty()),
                new StructField("tag", DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty())
              
        };
        return new StructType(structFields);
}

stream reader stream读卡器

  Dataset<Row> results = kafkaDataset.select(
                col("key").cast("string"),
                from_json(col("value").cast("string"), getSchema()).as("value"),
                col("topic"),
                col("partition"),
                col("offset"),
                col("timestamp"),
                col("timestampType"));

        results.select("value.*")
                .writeStream()
                .foreachBatch(new VoidFunction2<Dataset<Row>, Long>() {
                    @Override
                    public void call(Dataset<Row> dataset, Long batchId) throws Exception {
                        ObjectMapper mapper = new ObjectMapper();
                        List<DealFeedSchema> list = new ArrayList<>();
                        List<Row> rowList = dataset.collectAsList();
                        if (!rowList.isEmpty()) {
                            rowList.forEach(row -> {
                                if (row == null) logger.info("Null DataSet");
                                else {
                                    try {
                                        list.add(mapper.readValue(row.json(), DealFeedSchema.class));
                                    } catch (JsonProcessingException e) {
                                        logger.error("error parsing Data", e);
                                    }
                                }
                            });
                            JavaRDD<DealFeedSchema> rdd = new JavaSparkContext(session.sparkContext()).parallelize(list);
                            javaFunctions(rdd).writerBuilder(Constants.CASSANDRA_KEY_SPACE,
                                    Constants.CASSANDRA_DEAL_TABLE_SPACE, mapToRow(DealFeedSchema.class)).saveToCassandra();
                        }
                    }

                }).
                start().awaitTermination();

although this works fine i need to know if theres a better way to do this if there is any please let me know how to acheive it.虽然这很好用,但我需要知道是否有更好的方法来做到这一点,如果有的话,请告诉我如何实现它。

Thanks in advance.提前致谢。 for those who are looking for a way you can refer this code as an alternative.. :)对于那些正在寻找一种方法的人,您可以参考此代码作为替代方法.. :)

Just write data from Spark Structured Streaming without conversion to RDD - you just need to switch to use Spark Cassandra Connector 2.5.0 that added this capability, together with much more stuff .只需从 Spark Structured Streaming 写入数据,无需转换为 RDD - 您只需切换到使用添加此功能的 Spark Cassandra 连接器 2.5.0 以及更多内容

When you use it, your code will look as following (I don't have Java example, but it should be similar to this):当你使用它时,你的代码将如下所示(我没有 Java 示例,但应该与此类似):

val query = streamingCountsDF.writeStream
  .outputMode(OutputMode.Update)
  .format("org.apache.spark.sql.cassandra")
  .option("checkpointLocation", "some_checkpoint_location")
  .option("keyspace", "test")
  .option("table", "sttest_tweets")
  .start()

To Convert Dataset< Row > to Dataset< DealFeedSchema > in java在 java 中将数据集<行>转换为数据集< DealFeedSchema>

1. Java Bean for DealFeedSchema


import java.util.List;

public class DealFeedSchema {
    private long id;
    private String name;
    private String cat;
    private List<String> tag;


    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getCat() {
        return cat;
    }

    public void setCat(String cat) {
        this.cat = cat;
    }

    public List<String> getTag() {
        return tag;
    }

    public void setTag(List<String> tag) {
        this.tag = tag;
    }
}

2. Load the test data

 Dataset<Row> dataFrame = spark.createDataFrame(Arrays.asList(
                RowFactory.create(1L, "foo", "cat1", Arrays.asList("tag1", "tag2"))
        ), getSchema());
        dataFrame.show(false);
        dataFrame.printSchema();
        /**
         * +---+----+----+------------+
         * |id |name|cat |tag         |
         * +---+----+----+------------+
         * |1  |foo |cat1|[tag1, tag2]|
         * +---+----+----+------------+
         *
         * root
         *  |-- id: long (nullable = true)
         *  |-- name: string (nullable = true)
         *  |-- cat: string (nullable = true)
         *  |-- tag: array (nullable = true)
         *  |    |-- element: string (containsNull = true)
         */

3. Convert Dataset<Row> to Dataset<DealFeedSchema>

        Dataset<DealFeedSchema> dealFeedSchemaDataset = dataFrame.as(Encoders.bean(DealFeedSchema.class));
        dealFeedSchemaDataset.show(false);
        dealFeedSchemaDataset.printSchema();
        /**
         * +---+----+----+------------+
         * |id |name|cat |tag         |
         * +---+----+----+------------+
         * |1  |foo |cat1|[tag1, tag2]|
         * +---+----+----+------------+
         *
         * root
         *  |-- id: long (nullable = true)
         *  |-- name: string (nullable = true)
         *  |-- cat: string (nullable = true)
         *  |-- tag: array (nullable = true)
         *  |    |-- element: string (containsNull = true)
         */

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM