简体   繁体   English

Apache Spark(Scala)-打印RDD / pairRDD的1个条目

[英]Apache Spark (Scala) - print 1 entry of an RDD / pairRDD

When using an RDD I have grouped the items within the RDD by Key. 使用RDD时,我已按键将RDD中的项目分组。

    val pairRDD = oldRDD.map(x => (x.user, x.product)).groupByKey

pairRDD is of type: RDD(Int, Iterable[Int])) pairRDD的类型为: RDD(Int, Iterable[Int]))

What I am having trouble with is simply accessing a particular element. 我遇到的麻烦仅仅是访问特定元素。 What is the point of having a key when I can't seemingly access the item in the RDD by key? 当我似乎无法通过密钥访问RDD中的项目时,拥有密钥有什么意义?

At the minute I filter the RDD down to a single item, however I still have an RDD, and as such I have to do a foreach on the RDD to print it out: 在开始的那一刻,我将RDD filter为单个项目,但是我仍然有一个RDD,因此我必须对RDD进行foreach才能将其打印出来:

    val userNumber10 = pairRDD.filter(_._1 == 10)
    userNumber10.foreach(x => println("user number = " + x._1))

Alternatively, I can filter the RDD and then take(1) which returns an Array of size 1: 另外,我可以filter RDD,然后take(1)返回大小为1的数组:

    val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)

Alternatively to that I can select the first element of that returned array: 除此之外,我可以选择该返回数组的第一个元素:

    val userNumber10Array = pairRDD.filter(_._1 == 10).take(1)(0)

Which returns me the pair as required. 根据需要返回给我这对。 But... clearly, this is inconvenient and I would hazard a guess at saying that this is not how an RDD is meant to be used! 但是……显然,这很不方便,我冒昧地说一句,这并不是RDD的使用本意!

Why am I doing this you may ask! 您可能会问我为什么要这样做! Well, the reason it's come about is because I simply wanted to "see" what was in my RDD for my own testing purposes. 嗯,之所以会这样,是因为我只是想出于自己的测试目的“查看” RDD中的内容。 So, is there a way to access individual items in an RDD (more strictly a pairRDD) and if so, how? 那么,有没有一种方法可以访问RDD(更严格地说是pairRDD)中的单个项目,如果可以,如何? If not, what is the purpose of a pairRDD? 如果不是,pairRDD的目的是什么?

Use the lookup function, which belongs to PairRDDFunctions . 使用属于PairRDDFunctionslookup功能。 From the official documentation: 从官方文档中:

Return the list of values in the RDD for key key. 返回RDD中键值的值列表。 This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. 如果RDD仅通过搜索键映射到的分区,就可以有效地完成此操作。

https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/PairRDDFunctions.html

And if you just want to see the contents of your RDD, you simply call collect . 如果只想查看RDD的内容,则只需调用collect

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM