如何根据列索引列表从 pyspark 中的 csv 文件中选择某些列，然后确定它们的不同长度

Question

I have this code in pyspark where in I pass the index value of columns as a list .我在pyspark中有此代码，其中我将列的index值作为list传递。 Now I want to select the columns from csv file for these corresponding indexes:现在我想从csv文件中为这些相应的索引选择列：

def ml_test(input_col_index):

    sc = SparkContext(master='local', appName='test')

    inputData = sc.textFile('hdfs://localhost:/dir1').zipWithIndex().filter(lambda (line, rownum): rownum >= 0).map(lambda (line, rownum): line)

if __name__ == '__main__':

    input_col_index = sys.argv[1] # For example - ['1','2','3','4']

    ml_test(input_col_index)

Now if I have a static or hardcoded set of columns that I want to select from above csv file, I can do that but here the indexes of desired columns is being passed as a parameter.现在，如果我想从上面的csv文件中选择一组静态或硬编码的列，我可以这样做，但这里所需列的indexes作为参数传递。 Also I have to calculate the distinct length of each of the selected columns which I know can be done by colmn_1 = input_data.map(lambda x: x[0]).distinct().collect() but how do I do this for set of columns which are not pre-known and are determined based on the index list passed at runtime?此外，我必须计算每个选定列的不同长度，我知道这可以通过colmn_1 = input_data.map(lambda x: x[0]).distinct().collect()但我如何为一组未知的列，是根据运行时传递的索引列表确定的？

NOTE: I have to calculate the distinct length of columns because I have to pass that length as a parameter to Pysparks RandomForest algorithm.注意：我必须计算列的不同长度，因为我必须将该长度作为参数传递给Pysparks RandomForest算法。

Answer 1

You can use list comprehensions.您可以使用列表推导式。

# given a list of indicies...
indicies = [int(i) for i in input_col_index]

# select only those columns from each row
rdd = rdd.map(lambda x: [x[idx] for idx in indicies])

# for all rows, choose longest columns
longest_per_column = rdd.reduce(
    lambda x, y: [max(a, b, key=len) for a, b in zip(x, y)])

# get lengths of longest columns
print([len(x) for x in longest_per_column])

The reducing function takes two lists, loops over each of their values simultaneously, and creates a new list by selecting (for each column) whichever one was longer.减少函数采用两个列表，同时循环遍历它们的每个值，并通过选择（对于每一列）较长的一个来创建一个新列表。

UPDATE: To pass the lengths into the RandomForest constructor, you can do something like this:更新：要将长度传递给RandomForest构造函数，您可以执行以下操作：

column_lengths = [len(x) for x in longest_per_column]

model = RandomForest.trainRegressor(
    categoricalFeaturesInfo=dict(enumerate(column_lengths)),
    maxBins=max(column_lengths),
    # ...
)

Answer 2

I would recommend this simple solution.我会推荐这个简单的解决方案。

Assume we have following structure of CSV file [1]:假设我们有以下 CSV 文件结构 [1]：

"TRIP_ID","CALL_TYPE","ORIGIN_CALL","ORIGIN_STAND","TAXI_ID","TIMESTAMP","DAY_TYPE","MISSING_DATA","POLYLINE"
"1372636858620000589","C","","","20000589","1372636858","A","False","[[-8.618643,41.141412],[-8.618499,41.141376]]"

And you want to select only columns: CALL_TYPE, TIMESTAMP, POLYLINE First you need format your data, then just split and select columns you need.并且您只想选择列： CALL_TYPE, TIMESTAMP, POLYLINE首先您需要格式化数据，然后拆分并选择您需要的列。 It's simple:这很简单：

from pyspark import SparkFiles
raw_data = sc.textFile("data.csv")
callType_days = raw_data.map(lambda x: x.replace('""','"NA"').replace('","', '\n').replace('"','')) \
    .map(lambda x: x.split()) \
    .map(lambda x: (x[1],x[5],x[8]))

callType_days.take(2)

Results will be:结果将是：

[(u'CALL_TYPE', u'TIMESTAMP', u'POLYLINE'),
 (u'C',
  u'1372636858',
  u'[[-8.618643,41.141412],[-8.618499,41.141376]]')]

Afterwards, it's really easy to work with structured data like this.之后，处理这样的结构化数据真的很容易。

[1]: Taxi Service Trajectory - Prediction Challenge, ECML PKDD 2015 Data Set [1]：出租车服务轨迹 - 预测挑战，ECML PKDD 2015 数据集

如何根据列索引列表从 pyspark 中的 csv 文件中选择某些列，然后确定它们的不同长度

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-04-25 04:30:47

解决方案2
0 2016-04-25 15:47:04

如何根据列索引列表从 pyspark 中的 csv 文件中选择某些列，然后确定它们的不同长度

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-04-25 04:30:47

解决方案2 0 2016-04-25 15:47:04

解决方案1
1 已采纳 2016-04-25 04:30:47

解决方案2
0 2016-04-25 15:47:04