如何仅使用rdd而不是数据帧从spark中的csv中获取N（th）列

Question

I want to select a column from a csv file using only rdd function, not dataframe in spark.我想仅使用 rdd 函数而不是 spark 中的数据框从 csv 文件中选择一列。 I have written code for it.我已经为它编写了代码。

Some of the other columns have an empty value.其他一些列有一个空值。

rdd = spark.sparkContext.textFile(filename)    
rdd_parc = rdd.map(lambda x: x.split(","))
rdd1 = rdd_parc.map(lambda x: x[6] != "")

But this solution seems that it will not select the column 6 but it instead selects the element 6 of the map.但是这个解决方案似乎不会选择第 6 列，而是选择地图的元素 6。

Sample data:样本数据：

Lachine,H,0,,,,"SUMMERLEA, PARC",FR,2016,287826.19201900001,5033164.33193000030,-73.7169573579999,45.4382054
Lachine,H,0,,,,"SUMMERLEA, PARC",FR,2016,287826.19201900001,5033164.33193000030,-73.7169573579999,45.4382054

You can see that some values are empty.可以看到有些值为空。 I want my answer to be我希望我的答案是

SUMMERLEA, PARC
SUMMERLEA, PARC

Answer 1

Done try using to to convert to rdd first by完成尝试使用 to 首先转换为 rdd

rdd = spark.read.csv(filename, header=True).rdd

Then extract the sixth column by然后通过提取第六列

rdd_parc = rdd.map(lambda x: x[6])

You will get the desired output您将获得所需的输出

如何仅使用rdd而不是数据帧从spark中的csv中获取N（th）列

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-02-08 15:17:18

如何仅使用rdd而不是数据帧从spark中的csv中获取N（th）列

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-02-08 15:17:18

解决方案1
1 已采纳 2019-02-08 15:17:18