简体   繁体   English

将 Spark 结构化流与 StreamingKMeans 结合使用

[英]Use Spark structured streaming with StreamingKMeans

I want cluster a streaming dataset using Spark.我想要使用 Spark 对流式数据集进行聚类。 I first tried to use Kmeans but it throws a runtime exception on calling fit method saying it cannot be used with streaming data:我首先尝试使用 Kmeans,但它在调用 fit 方法时抛出运行时异常,表示它不能与流数据一起使用:

org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();

Then I tried to use StreamingKmeans but it seams this model works only with legacy streaming in Spark and accepts DStream.然后我尝试使用 StreamingKmeans,但它接缝这个 model 仅适用于 Spark 中的旧流媒体并接受 DStream。 Does anyone know a workaround for this or other solutions to this problem?有谁知道这个问题的解决方法或其他解决方案?

Codes I've written sofar is as follow:到目前为止我写的代码如下:

        Dataset<Row> df = spark.readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", topic)
                .load()
                .selectExpr("CAST(value AS String)")
                .select(functions.from_json(new Column("value"), schema).as("data"))
                .select("data.*");

        VectorAssembler assembler = new VectorAssembler()
                .setInputCols(features)
                .setOutputCol("features");
        df = assembler.transform(df);


        StreamingKMeans kmeans = new StreamingKMeans().setK(3).setDecayFactor(1.0);
        StreamingKMeansModel model = kmeans.predictOn(df);

Cannot resolve method 'predictOn(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>)无法解析方法'predictOn(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>)

Finally I found oud it's not possible so I switched to DStream instead of Structured Streaming最后我发现这是不可能的,所以我切换到 DStream 而不是 Structured Streaming

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM