将Spark Streaming输出写入HDFS时跳过了数据

Question

I'm running a Spark Streaming application for every 10 seconds, its job is to consume data from kafka, transform it and store it into HDFS based on the key. 我每10秒运行一个Spark Streaming应用程序，它的工作是消耗来自kafka的数据，将其转换并将其存储到基于密钥的HDFS中。 ie, a file per unique key. 即，每个唯一密钥一个文件。 I'm using the Hadoop's saveAsHadoopFile() API to store the output, I see that a file gets generated for every unique key, but the issue is that only one row gets stored for each of the unique key though the DStream has more rows for the same key. 我正在使用Hadoop的saveAsHadoopFile（）API来存储输出，我看到为每个唯一键都生成了一个文件，但是问题是，尽管DStream中有更多行用于每个唯一键，但只为每个唯一键存储一行。相同的键。

For example, consider the following DStream which has one unique key, 例如，考虑以下具有一个唯一密钥的DStream，

  key                  value
 =====   =====================================
 Key_1   183.33 70.0 0.12 1.0 1.0 1.0 11.0 4.0 
 Key_1   184.33 70.0 1.12 1.0 1.0 1.0 11.0 4.0 
 Key_1   181.33 70.0 2.12 1.0 1.0 1.0 11.0 4.0 
 Key_1   185.33 70.0 1.12 1.0 1.0 1.0 11.0 4.0 
 Key_1   185.33 70.0 0.12 1.0 1.0 1.0 11.0 4.0

I see only one row (instead of 5 rows) gets stored in the HDFS file, 我看到HDFS文件中仅存储了一行（而不是5行），

185.33 70.0 0.12 1.0 1.0 1.0 11.0 4.0

The following code is used to store the output into HDFS, 以下代码用于将输出存储到HDFS中，

dStream.foreachRDD(new Function<JavaPairRDD<String, String>, Void> () {
    @Override
    public Void call(JavaPairRDD<String, String> pairRDD) throws Exception {
        long timestamp = System.currentTimeMillis();
        int randomInt = random.nextInt();
        pairRDD.saveAsHadoopFile("hdfs://localhost:9000/application-" + timestamp +"-"+ randomInt, String.class, String.class, RDDMultipleTextOutputFormat.class);
    }
});

where the implementation of RDDMultipleTextOutputFormat is as follows, RDDMultipleTextOutputFormat的实现如下所示，

public class RDDMultipleTextOutputFormat<K,V> extends    MultipleTextOutputFormat<K,V> {

    public K generateActualKey(K key, V value) { 
        return null;
    }

    public String generateFileNameForKeyValue(K key, V value, String name) { 
        return key.toString();
    }
}

Please let me know if I'm missing anything? 请让我知道我是否缺少任何东西？ Thanks for your help. 谢谢你的帮助。

Answer 1

由于密钥相同，因此每次都会替换该值，因此您将获得提供给hadoop的最后一个值。

将Spark Streaming输出写入HDFS时跳过了数据

问题描述

1 个解决方案

解决方案1
1 2015-10-12 08:36:26

将Spark Streaming输出写入HDFS时跳过了数据

问题描述

1 个解决方案

解决方案1 1 2015-10-12 08:36:26

解决方案1
1 2015-10-12 08:36:26