简体   繁体   中英

List Transformation With Lambdas in Spark

I am attempting to take an RDD containing pairs of integer ranges, and transform it so that each pair has a third term which iterates through the possible values in the range. Basically, I've got this:

[[1,10], [11,20], [21,30]]

And I'd like to end up with this:

[[1,1,10], [2,1,10], [3,1,10], [4,1,10], [5,1,10]...]

The file I'd like to transform is very large, which is why I'm looking to do this with PySpark rather than just Python on a local machine (I've got a way to do it locally on a CSV file, but the process takes several hours given the file's size). So far, I've got this:

a = [[1,10], [11,20], [21,30]]
b = sc.parallelize(a)
c = b.map(lambda x: [range(x[0], x[1]+1), x[0], x[1]])
c.collect()

Which yields:

>>> c.collect()
[[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1, 10], [[11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 11, 20], [[21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 21, 30]]

I can't figure out what the next step needs to be from here, to iterate over the expanded range, and pair each of those with the range delimiters.

Any ideas?

EDIT 5/8/2017 3:00PM

The local Python technique that works on a CSV input is:

import csv
import gzip
csvfile_expanded = gzip.open('C:\output.csv', 'wb')
ranges_expanded = csv.writer(csvfile_expanded, delimiter=',', quotechar='"')
csvfile = open('C:\input.csv', 'rb')
ranges = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in ranges:
    for i in range(int(row[0]),int(row[1])+1):
         ranges_expanded.writerow([i,row[0],row[1])

The PySpark script I'm questioning begins with the CSV file already having been loaded into HDFS and cast as an RDD.

Try this:

c = b.flatMap(lambda x: ([y, x[0], x[1]] for y in xrange(x[0], x[1]+1)))

The flatMap() ensures that you get one output record per element of the range. Note also the outer ( ) in conjunction with the xrange -- this is a generator expression that avoids materialising the entire range in memory of the executor.

Note: xrange() is Python2. If you are running Python3, use range()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM