简体   繁体   English

Spark 从 kafka 主题流式传输数据并写入外部路径中的文本文件

[英]Spark streaming data from kafka topic and write into the text files in external path

I want to read a data from kafka topic and group by key values, and write into text files..我想从 kafka 主题中读取数据并按键值分组,然后写入文本文件..

public static void main(String[] args) throws Exception {
        SparkSession spark=SparkSession
                .builder()
                .appName("Sparkconsumer")
                .master("local[*]")
                .getOrCreate();
        SQLContext sqlContext = spark.sqlContext();
        SparkContext context = spark.sparkContext();
        Dataset<Row>lines=spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe","test-topic")
                .load();
    Dataset<Row> r=  lines.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
        r.printSchema();
        r.createOrReplaceTempView("basicView");
        sqlContext.sql("select * from basicView")
        .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
        .writeStream()
        .outputMode("append")
        .format("console")
        .option("path","usr//path")
        .start()
        .awaitTermination(); 

Following points are misleading in your code:以下几点在您的代码中具有误导性:

  • To read from Kafka and write into a file, you would not need SparkContext or SQLContext ,要从 Kafka 读取并写入文件,您不需要SparkContextSQLContext
  • You are casting your key and value twice into a string,您将keyvalue两次转换为字符串,
  • the format of your output query should not be console if you want to store the data into a file.如果要将数据存储到文件中,则输出查询的format不应为控制台

An example can be looked up in the Spark Structured Streaming + Kafka Integration Guide and the Spark Structured Streaming Programming Guide可以在Spark Structured Streaming + Kafka 集成指南Spark Structured Streaming Programming Guide 中查找示例

public static void main(String[] args) throws Exception {

  SparkSession spark = SparkSession
    .builder()
    .appName("Sparkconsumer")
    .master("local[*]")
    .getOrCreate();

  Dataset<Row> lines = spark
    .readStream()
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe","test-topic")
    .load();

  Dataset<Row> r = lines
    .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
    // do some more processing such as 'groupBy'
    ;

  r.writeStream
    .format("parquet")        // can be "orc", "json", "csv", etc.
    .outputMode("append")
    .option("path", "path/to/destination/dir")
    .option("checkpointLocation", "/path/to/checkpoint/dir")
    .start()
    .awaitTermination();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Structured Streaming foreach Sink 自定义编写器无法从 Kafka 主题读取数据 - Spark Structured Streaming foreach Sink custom writer is not able to read data from Kafka topic Spark Streaming作业如何在Kafka主题上发送数据并将其保存在Elastic中 - Spark Streaming job how to send data on Kafka topic and saving it in Elastic Spark Streaming:写入从Kafka主题读取的行数 - Spark Streaming: Writing number of rows read from a Kafka topic 如何使用 Java Spark 结构化流从 Kafka 主题正确消费 - How to consume correctly from Kafka topic with Java Spark structured streaming 如何使用KStreams将数据从Kafka主题写入文件? - How to write data from Kafka topic to file using KStreams? 在 Spark 流中,是否可以将批处理数据从 kafka 插入到 Hive? - In Spark streaming, Is it possible to upsert batch data from kafka to Hive? 如何在火花流中映射kafka主题名称和相应记录 - How to map kafka topic names and respective records in spark streaming Kafka-Streaming:如何收集消息对并写入新主题 - Kafka-Streaming: How to collect pairs of messages and write to a new topic Kafka和TextSocket Stream中的Spark Streaming数据传播 - Spark Streaming data dissemination in Kafka and TextSocket Stream 使用kafka进行火花流传输一位消费者正在读取数据 - spark streaming with kafka one consumer is reading the data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM