Python中Spark RDD的列操作

Question

I have a RDD with MANY columns (eg hundreds), and most of my operation is on columns, eg I need to create many intermediate variables from different columns. 我有一个包含许多列（例如数百个）的RDD，并且我的大部分操作都在列上，例如，我需要从不同的列创建许多中间变量。

What is the most efficient way to do this? 最有效的方法是什么？

I create a RDD from a CSV file: 我从CSV文件创建一个RDD：

dataRDD = sc.textFile("/...path/*.csv").map(lambda line: line.split(",”))

For example, this will give me an RDD like below: 例如，这将为我提供如下的RDD：

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
...... 
29, 94, 956, ..., 758

I need to create a new column or a variable as calculatedvalue = 2ndCol+19thCol and create a new RDD. 我需要创建一个新列或一个变量，作为计算值= 2ndCol + 19thCol并创建一个新的RDD。

123, 523, 534, ..., 893, calculatedvalue 
536, 98, 1623, ..., 98472, calculatedvalue 
537, 89, 83640, ..., 9265, calculatedvalue 
7297, 98364, 9, ..., 735, calculatedvalue 
...... 
29, 94, 956, ..., 758, calculatedvalue

What is the best way of doing this? 最好的方法是什么？

Answer 1

With just a map it would be enough: 仅需一张地图就足够了：

rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])

# just replace my index with yours
newrdd = rdd.map(lambda x: x + (x[1] + x[2],)) 

newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]

Python中Spark RDD的列操作

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-02-06 14:37:50

Python中Spark RDD的列操作

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-02-06 14:37:50

解决方案1
1 已采纳 2016-02-06 14:37:50