Spark通过json文件逐块处理rdd并发布到Kafka主题

Question

I am new to Spark & scala. 我是Spark＆Scala的新手。 I have a requirement to process number of json files say from s3 location. 我需要处理从s3位置说的json文件数量。 These data is basically batch data which would be kept for reproccessing sometime later. 这些数据基本上是批处理数据，以后将保留以备后用。 Now my spark job should process these files in such a way that it should pick 5 raw json records and should send a message to Kafka topic. 现在，我的spark作业应以以下方式处理这些文件：它应选择5个原始json记录，并应向Kafka主题发送消息。 The reason of picking only 5 records is kafka topic is processing both real time and batch data simultaneously on the same topic. 只选择5条记录的原因是kafka主题是在同一主题上同时处理实时数据和批处理数据。 so the batch processing should not delay the real time processing. 因此，批处理不应延迟实时处理。

I need to process the whole json file sequentially and so I would pick only 5 records at a time and post a message to kafka and pick next 5 records of json file and so on... 我需要按顺序处理整个json文件，因此我一次只能选择5条记录，然后向kafka发送一条消息，然后选择json文件的下5条记录，依此类推...

I have written a piece of code which would read from json files and post it to kafka topic. 我写了一段代码，将从json文件中读取并将其发布到kafka主题。

        val jsonRDD = sc.textFile(s3Location)

        var count = 0

        val buf = new StringBuilder

        jsonRDD.collect().foreach(line => {
            count += 1
                    buf ++= line
                    if (count > 5) {
                        println(s"Printing 5 jsons $buf")
                        count = 0
                        buf.setLength(0)
                        SendMessagetoKakfaTopic(buf) // psuedo cod for sending message to kafkatopic 
                        Thread.sleep(10000)
                    }
        })
        if (buf != null) {
            println(s"Printing remaining jsons $buf")
            SendMessagetoKakfaTopic(buf)
        }

I believe there is a more efficient way of processing JSONs in Spark. 我相信在Spark中有一种更有效的处理JSON的方法。

And also I should also be looking for any other parameters like memory, resources etc. Since the data might go beyond 100's of gig. 而且我还应该寻找其他任何参数，例如内存，资源等。因为数据可能会超过100个演出。

Answer 1

That looks like a case for Spark Streaming or (recommended) Spark Structured Streaming . 看起来像是Spark Streaming或（推荐）Spark Structured Streaming的情况。

In either case you monitor a directory and process new files every batch interval (configurable). 在这两种情况下，您都将监视目录并在每个批处理间隔（可配置）中处理新文件。

You could handle it using SparkContext.textFile (with wildcards) or SparkContext.wholeTextFiles . 您可以使用SparkContext.textFile （带有通配符）或SparkContext.wholeTextFiles来处理它。 In either case, you'll eventually end up with RDD[String] to represent the rows in your JSON files (one line per JSON file). 无论哪种情况，您最终都将以RDD[String]结束，以表示JSON文件中的行（每个JSON文件一行）。

If your requirement is to process the files sequentially, 5-line chunk by 5-line chunk, you could make the transformation pipeline slightly more efficient by using RDD.toLocalIterator : 如果您的要求是依次处理文件（5行块到5行块），则可以使用RDD.toLocalIterator来使转换管道效率RDD.toLocalIterator ：

toLocalIterator: Iterator[T]

Return an iterator that contains all of the elements in this RDD. 返回一个包含此RDD中所有元素的迭代器。 The iterator will consume as much memory as the largest partition in this RDD. 迭代器将消耗与该RDD中最大的分区一样多的内存。

See RDD API. 请参阅RDD API。

With Iterator of JSONs, you'd do sliding with 5 elements. 使用JSON 迭代器，您可以sliding 5个元素。

That would give you pretty efficient pipeline. 那将为您提供非常有效的管道。

I once again strongly recommend reading up on Structured Streaming in Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher) (it's about reading but writing is also supported). 我再次强烈建议您阅读《结构化流+ Kafka集成指南》（Kafka经纪人0.10.0或更高版本）中的结构化流（有关阅读，但也支持编写）。

Spark通过json文件逐块处理rdd并发布到Kafka主题

问题描述

1 个解决方案

解决方案1
0 2017-05-19 07:20:35

Spark通过json文件逐块处理rdd并发布到Kafka主题

问题描述

1 个解决方案

解决方案1 0 2017-05-19 07:20:35

解决方案1
0 2017-05-19 07:20:35