[英]Apache Spark (Scala) - print 1 entry of an RDD / pairRDD
When using an RDD I have grouped the items within the RDD by Key. 使用RDD时,我已按键将RDD中的项目分组。
val pairRDD = oldRDD.map(x => (x.user, x.product)).groupByKey
pairRDD
is of type: RDD(Int, Iterable[Int]))
pairRDD
的类型为: RDD(Int, Iterable[Int]))
What I am having trouble with is simply accessing a particular element. 我遇到的麻烦仅仅是访问特定元素。 What is the point of having a key when I can't seemingly access the item in the RDD by key?
当我似乎无法通过密钥访问RDD中的项目时,拥有密钥有什么意义?
At the minute I filter
the RDD down to a single item, however I still have an RDD, and as such I have to do a foreach
on the RDD to print it out: 在开始的那一刻,我将RDD
filter
为单个项目,但是我仍然有一个RDD,因此我必须对RDD进行foreach
才能将其打印出来:
val userNumber10 = pairRDD.filter(_._1 == 10)
userNumber10.foreach(x => println("user number = " + x._1))
Alternatively, I can filter
the RDD and then take(1)
which returns an Array of size 1: 另外,我可以
filter
RDD,然后take(1)
返回大小为1的数组:
val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)
Alternatively to that I can select the first element of that returned array: 除此之外,我可以选择该返回数组的第一个元素:
val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)(0)
Which returns me the pair as required. 根据需要返回给我这对。 But... clearly, this is inconvenient and I would hazard a guess at saying that this is not how an RDD is meant to be used!
但是……显然,这很不方便,我冒昧地说一句,这并不是RDD的使用本意!
Why am I doing this you may ask! 您可能会问我为什么要这样做! Well, the reason it's come about is because I simply wanted to "see" what was in my RDD for my own testing purposes.
嗯,之所以会这样,是因为我只是想出于自己的测试目的“查看” RDD中的内容。 So, is there a way to access individual items in an RDD (more strictly a pairRDD) and if so, how?
那么,有没有一种方法可以访问RDD(更严格地说是pairRDD)中的单个项目,如果可以,如何? If not, what is the purpose of a pairRDD?
如果不是,pairRDD的目的是什么?
Use the lookup
function, which belongs to PairRDDFunctions
. 使用属于
PairRDDFunctions
的lookup
功能。 From the official documentation: 从官方文档中:
Return the list of values in the RDD for key key.
返回RDD中键值的值列表。 This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
如果RDD仅通过搜索键映射到的分区,就可以有效地完成此操作。
https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html
And if you just want to see the contents of your RDD, you simply call collect
. 如果只想查看RDD的内容,则只需调用
collect
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.