简体   繁体   English

在Spark RDD(Scala)中指定元素的子集

[英]Specify subset of elements in Spark RDD (Scala)

My dataset is a RDD[Array[String]] with more than 140 columns. 我的数据集是一个包含140多列的RDD[Array[String]] How can I select a subset of columns without hard-coding the column numbers (.map(x => (x(0),x(3),x(6)...)) ? 如何在不对列号进行硬编码的情况下选择列的子集(.map(x => (x(0),x(3),x(6)...))

This is what I've tried so far (with success): 这是我迄今为止尝试过的(成功):

val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))

However, I need more than a few columns, and would like to avoid hard-coding them. 但是,我需要多个列,并且希望避免对它们进行硬编码。

This is what I've tried so far (that I think would be better, but has failed): 这是我到目前为止所尝试的(我认为会更好,但失败了):

// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))

// Error output from attempt 1:
<console>:28: error: type mismatch;
 found   : List[Int]
 required: Int
       val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))

// Attempt 2
colIndices map peopleTups.lift

// Attempt 3
colIndices map peopleTups

// Attempt 4
colIndices.map(index => peopleTups.apply(index))

I found this question and tried it, but because I'm looking at an RDD instead of an array, it didn't work: How can I select a non-sequential subset elements from an array using Scala and Spark? 我找到了这个问题并试了一下,但因为我正在查看RDD而不是数组,所以它不起作用: 如何使用Scala和Spark从数组中选择非顺序子集元素?

You should map over the RDD instead of the indices. 您应该映射RDD而不是索引。

val list = List.fill(2)(Array.range(1, 6))
// List(Array(1, 2, 3, 4, 5), Array(1, 2, 3, 4, 5))

val rdd = sc.parallelize(list) // RDD[Array[Int]]
val indices = Array(0, 2, 3)

val selectedColumns = rdd.map(array => indices.map(array)) // RDD[Array[Int]]

selectedColumns.collect() 
// Array[Array[Int]] = Array(Array(1, 3, 4), Array(1, 3, 4))

What about this? 那这个呢?

val data = sc.parallelize(List("a,b,c,d,e", "f,g,h,i,j"))
val indices =  List(0,3,4)
data.map(_.split(",")).map(ss => indices.map(ss(_))).collect

This should give 这应该给

res1: Array[List[String]] = Array(List(a, d, e), List(f, i, j))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM