[英]Spark Streaming Convert Dataset<Row> to Dataset<CustomObject> in java
I've recently started working with apache spark and came across a requirement where I need to read kafka stream and feed the data in cassandra.我最近开始使用 apache spark 并遇到了一个要求,我需要阅读 kafka stream 并将数据输入 cassandra。 While doing so I encountered an issue where as streams are SQL based and cassandra connector is on rdd (I may be wrong here please do correct me) I was struggling to get this working.
这样做时我遇到了一个问题,因为流是基于 SQL 并且 cassandra 连接器在 rdd 上(我可能在这里错了,请纠正我)我正在努力让它工作。 Somehow I made it work as of now but not sure if that's the true way to implement.
不知何故,我现在让它工作,但不确定这是否是真正的实施方式。
Below is the code下面是代码
Schema架构
StructType getSchema(){
StructField[] structFields = new StructField[]{
new StructField("id", DataTypes.LongType, true, Metadata.empty()),
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
new StructField("cat", DataTypes.StringType, true, Metadata.empty()),
new StructField("tag", DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty())
};
return new StructType(structFields);
}
stream reader stream读卡器
Dataset<Row> results = kafkaDataset.select(
col("key").cast("string"),
from_json(col("value").cast("string"), getSchema()).as("value"),
col("topic"),
col("partition"),
col("offset"),
col("timestamp"),
col("timestampType"));
results.select("value.*")
.writeStream()
.foreachBatch(new VoidFunction2<Dataset<Row>, Long>() {
@Override
public void call(Dataset<Row> dataset, Long batchId) throws Exception {
ObjectMapper mapper = new ObjectMapper();
List<DealFeedSchema> list = new ArrayList<>();
List<Row> rowList = dataset.collectAsList();
if (!rowList.isEmpty()) {
rowList.forEach(row -> {
if (row == null) logger.info("Null DataSet");
else {
try {
list.add(mapper.readValue(row.json(), DealFeedSchema.class));
} catch (JsonProcessingException e) {
logger.error("error parsing Data", e);
}
}
});
JavaRDD<DealFeedSchema> rdd = new JavaSparkContext(session.sparkContext()).parallelize(list);
javaFunctions(rdd).writerBuilder(Constants.CASSANDRA_KEY_SPACE,
Constants.CASSANDRA_DEAL_TABLE_SPACE, mapToRow(DealFeedSchema.class)).saveToCassandra();
}
}
}).
start().awaitTermination();
although this works fine i need to know if theres a better way to do this if there is any please let me know how to acheive it.虽然这很好用,但我需要知道是否有更好的方法来做到这一点,如果有的话,请告诉我如何实现它。
Thanks in advance.提前致谢。 for those who are looking for a way you can refer this code as an alternative.. :)
对于那些正在寻找一种方法的人,您可以参考此代码作为替代方法.. :)
Just write data from Spark Structured Streaming without conversion to RDD - you just need to switch to use Spark Cassandra Connector 2.5.0 that added this capability, together with much more stuff .只需从 Spark Structured Streaming 写入数据,无需转换为 RDD - 您只需切换到使用添加此功能的 Spark Cassandra 连接器 2.5.0 以及更多内容。
When you use it, your code will look as following (I don't have Java example, but it should be similar to this):当你使用它时,你的代码将如下所示(我没有 Java 示例,但应该与此类似):
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "some_checkpoint_location")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()
1. Java Bean for DealFeedSchema
import java.util.List;
public class DealFeedSchema {
private long id;
private String name;
private String cat;
private List<String> tag;
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getCat() {
return cat;
}
public void setCat(String cat) {
this.cat = cat;
}
public List<String> getTag() {
return tag;
}
public void setTag(List<String> tag) {
this.tag = tag;
}
}
2. Load the test data
Dataset<Row> dataFrame = spark.createDataFrame(Arrays.asList(
RowFactory.create(1L, "foo", "cat1", Arrays.asList("tag1", "tag2"))
), getSchema());
dataFrame.show(false);
dataFrame.printSchema();
/**
* +---+----+----+------------+
* |id |name|cat |tag |
* +---+----+----+------------+
* |1 |foo |cat1|[tag1, tag2]|
* +---+----+----+------------+
*
* root
* |-- id: long (nullable = true)
* |-- name: string (nullable = true)
* |-- cat: string (nullable = true)
* |-- tag: array (nullable = true)
* | |-- element: string (containsNull = true)
*/
3. Convert Dataset<Row> to Dataset<DealFeedSchema>
Dataset<DealFeedSchema> dealFeedSchemaDataset = dataFrame.as(Encoders.bean(DealFeedSchema.class));
dealFeedSchemaDataset.show(false);
dealFeedSchemaDataset.printSchema();
/**
* +---+----+----+------------+
* |id |name|cat |tag |
* +---+----+----+------------+
* |1 |foo |cat1|[tag1, tag2]|
* +---+----+----+------------+
*
* root
* |-- id: long (nullable = true)
* |-- name: string (nullable = true)
* |-- cat: string (nullable = true)
* |-- tag: array (nullable = true)
* | |-- element: string (containsNull = true)
*/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.