简体   繁体   English

Python中Spark RDD的列操作

[英]Column operation on Spark RDDs in Python

I have a RDD with MANY columns (eg hundreds), and most of my operation is on columns, eg I need to create many intermediate variables from different columns. 我有一个包含许多列(例如数百个)的RDD,并且我的大部分操作都在列上,例如,我需要从不同的列创建许多中间变量。

What is the most efficient way to do this? 最有效的方法是什么?

I create a RDD from a CSV file: 我从CSV文件创建一个RDD:

dataRDD = sc.textFile("/...path/*.csv").map(lambda line: line.split(",”))

For example, this will give me an RDD like below: 例如,这将为我提供如下的RDD:

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
...... 
29, 94, 956, ..., 758 

I need to create a new column or a variable as calculatedvalue = 2ndCol+19thCol and create a new RDD. 我需要创建一个新列或一个变量,作为计算值= 2ndCol + 19thCol并创建一个新的RDD。

123, 523, 534, ..., 893, calculatedvalue 
536, 98, 1623, ..., 98472, calculatedvalue 
537, 89, 83640, ..., 9265, calculatedvalue 
7297, 98364, 9, ..., 735, calculatedvalue 
...... 
29, 94, 956, ..., 758, calculatedvalue

What is the best way of doing this? 最好的方法是什么?

With just a map it would be enough: 仅需一张地图就足够了:

rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])

# just replace my index with yours
newrdd = rdd.map(lambda x: x + (x[1] + x[2],)) 

newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM