I have a RDD with MANY columns (eg hundreds), and most of my operation is on columns, eg I need to create many intermediate variables from different columns.
What is the most efficient way to do this?
I create a RDD from a CSV file:
dataRDD = sc.textFile("/...path/*.csv").map(lambda line: line.split(",”))
For example, this will give me an RDD like below:
123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
......
29, 94, 956, ..., 758
I need to create a new column or a variable as calculatedvalue = 2ndCol+19thCol and create a new RDD.
123, 523, 534, ..., 893, calculatedvalue
536, 98, 1623, ..., 98472, calculatedvalue
537, 89, 83640, ..., 9265, calculatedvalue
7297, 98364, 9, ..., 735, calculatedvalue
......
29, 94, 956, ..., 758, calculatedvalue
What is the best way of doing this?
With just a map it would be enough:
rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])
# just replace my index with yours
newrdd = rdd.map(lambda x: x + (x[1] + x[2],))
newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.