Column operation on Spark RDDs in Python

Question

I have a RDD with MANY columns (eg hundreds), and most of my operation is on columns, eg I need to create many intermediate variables from different columns.

What is the most efficient way to do this?

I create a RDD from a CSV file:

dataRDD = sc.textFile("/...path/*.csv").map(lambda line: line.split(",”))

For example, this will give me an RDD like below:

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
...... 
29, 94, 956, ..., 758

I need to create a new column or a variable as calculatedvalue = 2ndCol+19thCol and create a new RDD.

123, 523, 534, ..., 893, calculatedvalue 
536, 98, 1623, ..., 98472, calculatedvalue 
537, 89, 83640, ..., 9265, calculatedvalue 
7297, 98364, 9, ..., 735, calculatedvalue 
...... 
29, 94, 956, ..., 758, calculatedvalue

What is the best way of doing this?

Answer 1

With just a map it would be enough:

rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])

# just replace my index with yours
newrdd = rdd.map(lambda x: x + (x[1] + x[2],)) 

newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]

Column operation on Spark RDDs in Python

Question

1 answers

solution1
1 ACCPTED 2016-02-06 14:37:50

Column operation on Spark RDDs in Python

Question

1 answers

solution1 1 ACCPTED 2016-02-06 14:37:50

solution1
1 ACCPTED 2016-02-06 14:37:50