Apache Spark（Scala）-打印RDD / pairRDD的1个条目

Question

When using an RDD I have grouped the items within the RDD by Key. 使用RDD时，我已按键将RDD中的项目分组。

    val pairRDD = oldRDD.map(x => (x.user, x.product)).groupByKey

pairRDD is of type: RDD(Int, Iterable[Int])) pairRDD的类型为： RDD(Int, Iterable[Int]))

What I am having trouble with is simply accessing a particular element. 我遇到的麻烦仅仅是访问特定元素。 What is the point of having a key when I can't seemingly access the item in the RDD by key? 当我似乎无法通过密钥访问RDD中的项目时，拥有密钥有什么意义？

At the minute I filter the RDD down to a single item, however I still have an RDD, and as such I have to do a foreach on the RDD to print it out: 在开始的那一刻，我将RDD filter为单个项目，但是我仍然有一个RDD，因此我必须对RDD进行foreach才能将其打印出来：

    val userNumber10 = pairRDD.filter(_._1 == 10)
    userNumber10.foreach(x => println("user number = " + x._1))

Alternatively, I can filter the RDD and then take(1) which returns an Array of size 1: 另外，我可以filter RDD，然后take(1)返回大小为1的数组：

    val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)

Alternatively to that I can select the first element of that returned array: 除此之外，我可以选择该返回数组的第一个元素：

    val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)(0)

Which returns me the pair as required. 根据需要返回给我这对。 But... clearly, this is inconvenient and I would hazard a guess at saying that this is not how an RDD is meant to be used! 但是……显然，这很不方便，我冒昧地说一句，这并不是RDD的使用本意！

Why am I doing this you may ask! 您可能会问我为什么要这样做！ Well, the reason it's come about is because I simply wanted to "see" what was in my RDD for my own testing purposes. 嗯，之所以会这样，是因为我只是想出于自己的测试目的“查看” RDD中的内容。 So, is there a way to access individual items in an RDD (more strictly a pairRDD) and if so, how? 那么，有没有一种方法可以访问RDD（更严格地说是pairRDD）中的单个项目，如果可以，如何？ If not, what is the purpose of a pairRDD? 如果不是，pairRDD的目的是什么？

Answer 1

Use the lookup function, which belongs to PairRDDFunctions . 使用属于PairRDDFunctions的lookup功能。 From the official documentation: 从官方文档中：

Return the list of values in the RDD for key key. 返回RDD中键值的值列表。 This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. 如果RDD仅通过搜索键映射到的分区，就可以有效地完成此操作。

https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html

And if you just want to see the contents of your RDD, you simply call collect . 如果只想查看RDD的内容，则只需调用collect 。

Apache Spark（Scala）-打印RDD / pairRDD的1个条目

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-03-20 16:48:38

Apache Spark（Scala）-打印RDD / pairRDD的1个条目

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-03-20 16:48:38

解决方案1
4 已采纳 2015-03-20 16:48:38