通过Kafka和Spark消费大数据

Question

I have a stream data in Json that provides by Websocket that size is changing between 1MB and 60 MB per second. 我在Json中有一个Websocket提供的流数据，大小在每秒1MB到60MB之间变化。

I got to decode the data then parse it and finally write to mysql. 我必须解码数据，然后解析它，最后写入mysql。

I thought 2 ideas: 我想到了两个想法：

1) To read data from Socket then decode the data and send to Consumer via Avro in Producer, Then to get the data and write to mysql on Spark map, reduce in Consumer 1）要从Socket读取数据，然后对数据进行解码并通过Producer中的Avro发送给Consumer，然后要获取数据并在Spark map上写入mysql，请减少Consumer

2) To read data from Socket then send the data to Consumer in Producer, Then to get the data in Consumer then decode on Spark and send the parsed data to Spark Job for writing to mysql. 2）从Socket读取数据，然后将数据发送到Producer中的Consumer，然后在Consumer中获取数据，然后在Spark上解码，然后将解析后的数据发送到Spark Job以写入mysql。

Do you have any idea? 你有什么主意吗？

Producer 制片人

/*
 * To change this license header, choose License Headers in Project Properties.
 * To change this template file, choose Tools | Templates
 * and open the template in the editor.
 */
package com.tan;


import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.ProducerConfig;

import java.util.Properties;


import java.util.stream.Stream;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
/**
 *
 * @author Tan
 */
public class MainKafkaProducer {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws InterruptedException {
        // TODO code application logic here
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArraySerializer");
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");

        //props.put("group.id", "mygroup");
        //props.put("max.partition.fetch.bytes", "100000000");
        //props.put("serializer.class", "kafka.serializer.StringEncoder");
        //props.put("partitioner.class","kafka.producer.DefaultPartitioner");
        //props.put("request.required.acks", "1");

         KafkaProducer<String, String> producer = new KafkaProducer<>(props);

         // Read the data from websocket and send it to consumer
         //for (int i = 0; i < 100; i++) {
            String fileName = "/Users/Tan/Desktop/feed.json";
            try{
                BufferedReader file = new BufferedReader(new FileReader(fileName));
                String st = file.readLine();
                for(int i = 0; i < 100; i++)
                {
                    ProducerRecord<String, String> record = new ProducerRecord<>("mytopic", st);
                    producer.send(record);
                }
            }catch(IOException e){
                e.printStackTrace();
            }
        //}

        /*
        for(int i = 0; i < 100; i++)
        {
            ProducerRecord<String, String> record2 = new ProducerRecord<>("mytopic", "Hasan-" + i);
            producer.send(record2);
        }
        */


        producer.close();
    }

}

Consumer 消费者

/*
 * To change this license header, choose License Headers in Project Properties.
 * To change this template file, choose Tools | Templates
 * and open the template in the editor.
 */
package com.tan;

import kafka.serializer.DefaultDecoder;
import kafka.serializer.StringDecoder;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
/**
 *
 * @author Tan
 */
public class MainKafkaConsumer {
    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {

        SparkConf conf = new SparkConf()
                .setAppName(MainKafkaConsumer.class.getName())
                .setMaster("local[*]");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));

        Set<String> topics = Collections.singleton("mytopic");
        Map<String, String> kafkaParams = new HashMap<>();
        kafkaParams.put("metadata.broker.list", "localhost:9092");

        JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(ssc, 
                String.class, String.class, 
                StringDecoder.class, StringDecoder.class, 
                kafkaParams, topics);

        directKafkaStream.foreachRDD(rdd -> {

            rdd.foreach(records -> {

                System.out.println(records._2);

            });

        });
        /*
        directKafkaStream.foreachRDD(rdd -> {
            System.out.println("--- New RDD with " + rdd.partitions().size()
                    + " partitions and " + rdd.count() + " records");
            rdd.foreach(record -> {
                System.out.println(record._2);
            });
        });
        */



        ssc.start();
        ssc.awaitTermination();

    }

}

Answer 1

Your process is good, the point is just for the avro conversion. 您的过程很好，重点仅在于avro转换。 Your data is not that big, 1Mb to 60Mb. 您的数据不是很大，从1Mb到60Mb。

Here I have a similar process, reading from an MQ, process the data, convert to avro, send to kafka, consume from kafka, parse the data and post in other MQ. 在这里，我有一个类似的过程，从MQ读取，处理数据，转换为avro，发送到kafka，从kafka消费，解析数据并发布到其他MQ中。

The Avro help a lot when our data is huge, like >= 1Gb. 当我们的数据量很大（例如> = 1Gb）时，Avro会有很大帮助。 But in some cases our data is really small like < 10Mb. 但是在某些情况下，我们的数据确实很小，例如<10Mb。 In this case Avro make our processing be a little bit slow, there is no gain in network transmission. 在这种情况下，Avro使我们的处理速度变慢，网络传输没有收益。

What I suggest to you, if your network is good enought to not convert to avro, better do it without avro. 我对您的建议是，如果您的网络足够好，不能转换为avro，最好不要使用avro。 To improove performance in Spark Side configure the topic of kafka with a good number of partitions, because if you have just one partition, your spark will not do the parallization correctly. 要提高Spark Side的性能，请为kafka主题配置大量分区，因为如果只有一个分区，那么spark将无法正确执行并行化。 Check this text that can help you. 选中此文本可以为您提供帮助。

通过Kafka和Spark消费大数据

问题描述

1 个解决方案

解决方案1
2 2017-06-07 16:39:52

通过Kafka和Spark消费大数据

问题描述

1 个解决方案

解决方案1 2 2017-06-07 16:39:52

解决方案1
2 2017-06-07 16:39:52