简体   繁体   English

Hadoop Spark:如何在JavaRDD中创建不同的元素?

[英]Hadoop Spark: How to make distinct of elements in JavaRDD?

I want to store distinct of some JavaRDD collection to a file in Spark? 我想将一些JavaRDD集合的不同存储到Spark中的文件中?

By using distinct() method of RDD, I couldn't achieve the same. 通过使用RDD的distinct()方法,我无法实现相同的目标。

My guess is RDD treats each element as a individual instance. 我的猜测是RDD将每个元素视为一个单独的实例。 How can we achieve the distinct in this case. 在这种情况下,我们如何才能实现独特。

Following is code snippet, Can anyone please help? 以下是代码段,任何人都可以帮忙吗?

public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("Xml Spark Demo");
    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaSQLContext sqlContext = new org.apache.spark.sql.api.java.JavaSQLContext(
            sc);


    // Load a text file and convert each line to a JavaBean.
    JavaRDD<String> dataFromFile = sc.textFile(
            "/home/kedarnath/Rentals/inputData/temp-01.xml").map(
            new ParseAgentFromXml());

    //Need distinct values here
    dataFromFile.distinct().saveAsTextFile("/home/kedarnath/Rentals/testOutputDistinct.txt");

}

Thanks in advance, 提前致谢,

~Kedar 〜基达

I am not sure if this would be the most efficient way of doing it from a performance perspective, but I would try dividing the process in two different steps: dinstinct and mapping to pair. 我不确定这是否是从性能角度来看这是最有效的方法,但我会尝试将过程分为两个不同的步骤:dinstinct和mapping to pair。 Consider the following example: 请考虑以下示例:

Original dataset:          Desired output (distinct elements)

Apple                      1, Apple
Tree                       2, Tree
Car                        3, Car
Priest                     4, Priest
Apple                      5, Phone
Tree
Apple
Phone
  • Distinct: 不同:

By using the distinct() transformation, you would obtain a new RDD dataset with all the distinct element. 通过使用distinct()转换,您将获得具有所有distinct元素的新RDD数据集。 In this case, it would return something like: 在这种情况下,它将返回如下内容:

Apple
Tree
Car
Priest
Phone
  • Map to pair: 地图配对:

The next step would be to associate a key to every value in the RDD dataset, transforming it into a key-value format. 下一步是将密钥与RDD数据集中的每个值相关联,将其转换为键值格式。 For this, the transformation mapToPair() could be used. 为此,可以使用转换mapToPair() The output would result on the desired output. 输出将产生所需的输出。

1, Apple
2, Tree
3, Car
4, Priest
5, Phone

Visit this page to get more information about the different available methods. 访问页面以获取有关不同可用方法的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM