简体   繁体   中英

Column operation on Spark RDDs in Python

I have a RDD with MANY columns (eg hundreds), and most of my operation is on columns, eg I need to create many intermediate variables from different columns.

What is the most efficient way to do this?

I create a RDD from a CSV file:

dataRDD = sc.textFile("/...path/*.csv").map(lambda line: line.split(",”))

For example, this will give me an RDD like below:

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
...... 
29, 94, 956, ..., 758 

I need to create a new column or a variable as calculatedvalue = 2ndCol+19thCol and create a new RDD.

123, 523, 534, ..., 893, calculatedvalue 
536, 98, 1623, ..., 98472, calculatedvalue 
537, 89, 83640, ..., 9265, calculatedvalue 
7297, 98364, 9, ..., 735, calculatedvalue 
...... 
29, 94, 956, ..., 758, calculatedvalue

What is the best way of doing this?

With just a map it would be enough:

rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])

# just replace my index with yours
newrdd = rdd.map(lambda x: x + (x[1] + x[2],)) 

newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM