简体   繁体   中英

Return row with highest value per key without loosing whole row in RDD

I started to play with pyspark RDD and DF. With knowledge of SQL I was comfortable with DF and its SQL module. However I'm struggling to filter rows in just plain RDD without converting it to DF. In below example I want to find highest third column for first column and return whole row or just second row and sort it by first column. In DF I would use windowing by first column and rank each row then filter rows based on rank.

Data = sc.parallelize([((12, u'IL'), -1.4944293272864724),
                       ((10, u'NM'), 14.230100203137535),
                       ((12, u'ND'), -9.687170853837522),
                       ((5, u'MO'), 18.73167803079034),
                       ((12, u'NH'), -3.329505034062821)])

Desired output

Data.collect()
[[5, u'MO', 18.73167803079034], [10, u'NM', 14.230100203137535], [12, u'IL', -1.4944293272864724]]

Alternatively

Data.collect()
[u'MO', u'NM', u'IL']

For RDDs you have dedicated operators that can do this, to achieve what you want without losing the content of your RDD, you can proceed like so:

Sorted = Data.sortBy(lambda x: x[1],ascending= False)
Mapped = Sorted.map(lambda x : x[0][1])
Mapped.collect()

The output of the above sequence of instructions would be:

['MO', 'NM', 'IL', 'NH', 'ND']

You can play with the second instruction (map operator) however you want to retrieve any element, not just the labels you mentionned.

If you only want the first three elements, instead of the last instruction, you can use:

Mapped.take(3)

The output would then be:

['MO', 'NM', 'IL']

sortBy method can be used.

Data.sortBy(lambda x: x[1],ascending=False).collect()

To get only the required column, pass sortBy result to map method to fetch only the required columns.

Data.sortBy(lambda x: x[1],ascending=False).map(lambda x: x[0][1]).collect()

You can use reduceByKey() to find the row that corresponds to the max key and then use sortByKey() to get the final sorted RDD. Here it is step by step with the intermediate results shown:

>>> Data = sc.parallelize([((12, u'IL'), -1.4944293272864724),
...                        ((10, u'NM'), 14.230100203137535),
...                        ((12, u'ND'), -9.687170853837522),
...                        ((5, u'MO'), 18.73167803079034),
...                        ((12, u'NH'), -3.329505034062821)])

First, transform the RDD to have the first value as a key and the rest as the value:

>>> rdd1 = Data.map(lambda x: (x[0][0], (x[0][1], x[1])))
>>> pprint(rdd1.collect())
[(12, (u'IL', -1.4944293272864724)),
 (10, (u'NM', 14.230100203137535)),
 (12, (u'ND', -9.687170853837522)),
 (5, (u'MO', 18.73167803079034)),
 (12, (u'NH', -3.329505034062821))]

Use reduceByKey() to get the pair with the largest value for a given key:

>>> rdd2 = rdd1.reduceByKey(lambda x, y: x if x[1] > y[1] else y)
>>> pprint(rdd2.collect())
[(5, (u'MO', 18.73167803079034)),
 (10, (u'NM', 14.230100203137535)),
 (12, (u'IL', -1.4944293272864724))]

By coincidence the result is already sorted, but don't rely on that:

>>> rdd3 = rdd2.sortByKey()

Map to the desired output format and collect:

>>> rdd3.map(lambda x: list((x[0],) + x[1])).collect()
[[5, u'MO', 18.73167803079034], [10, u'NM', 14.230100203137535], [12, u'IL', -1.4944293272864724]]

In a single expression:

>>> Data.map(lambda x: (x[0][0], (x[0][1], x[1]))) \
...     .reduceByKey(lambda x, y: x if x[1] > y[1] else y) \
...     .sortByKey() \
...     .map(lambda x: list((x[0],) + x[1])) \
...     .collect()
[[5, u'MO', 18.73167803079034], [10, u'NM', 14.230100203137535], [12, u'IL', -1.4944293272864724]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM