[英]Extract only certain columns in Java Spark
I have a file with 10 columns.我有一个包含 10 列的文件。 What's the most elegant way to extract only first 3 columns or specific columns?仅提取前 3 列或特定列的最优雅方法是什么?
For example, this is how my file looks like:例如,这就是我的文件的样子:
john,smith,84,male,kansas
john,doe,48,male,california
tim,jones,22,male,delaware
And I want to extract into this:我想提取到这个:
[john, smith, kansas]
[john, doe, california]
[tim, jones, delaware]
What I have is this, but it doesn't specifically chose the columns that I want:我所拥有的是这个,但它并没有专门选择我想要的列:
JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
peopleRDD.cache().map(lines -> Arrays.asList(lines.split(",")))
.forEach(person -> LOG.info(person));
I read the following two Stackoverflow posts but I still can't decide how to do this.我阅读了以下两篇Stackoverflow 帖子,但我仍然无法决定如何执行此操作。
EDIT: I ended up doing the following:编辑:我最终做了以下事情:
JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
peopleRDD.cache().map(lines -> Arrays.asList(new String[]{lines.split(",")[0],
lines.split(",")[1],
lines.split(",")[3]}
.forEach(person -> LOG.info(person));
Not the most elegant solution but if you have a better way, please post here.不是最优雅的解决方案,但如果您有更好的方法,请在此处发布。 Thanks.谢谢。
EDIT : Apologies, I just realized you were asking for a Java solution, but I've used Scala.编辑:抱歉,我刚刚意识到您要的是 Java 解决方案,但我使用了 Scala。 Only the 3rd of my suggestions has an equivalent in Java (added at the bottom of the answer)... Spark is really much nicer in Scala though :-)只有我的第 3 个建议在 Java 中有等价物(添加在答案的底部)……尽管 Spark 在 Scala 中确实要好得多:-)
One way is to perform the split
, then pattern match on the result to select the columns you want:一种方法是执行split
,然后对结果进行模式匹配以选择您想要的列:
peopleRDD.cache().map(_.split(",") match { case Array(a,b,_,_,e) => List(a,b,e) })
Another (depending on which combinations of elements you want) is to use take
and drop
, using a val
to avoid splitting repeatedly.另一个(取决于您想要的元素组合)是使用take
和drop
,使用val
避免重复拆分。
peopleRDD.cache().map{ line =>
val parts = line.split(",")
parts.take(2) ++ parts.drop(4)
}
(You can add a toList
after the split
if you want a List
rather than an Array
for each result element in the RDD) (如果你想要一个List
而不是 RDD 中的每个结果元素的Array
,你可以在split
后添加一个toList
)
In fact the same approach can be used to simplify your original solution, eg:事实上,可以使用相同的方法来简化您的原始解决方案,例如:
peopleRDD.cache().map{ line =>
val parts = line.split(",")
List(parts[0], parts[1], parts[4])
}
In Java8, you can probably do the equivalent, which is a slight improvement as we avoid calling split
repeatedly - something like:在 Java8 中,您可能可以做等效的事情,这是一个轻微的改进,因为我们避免了重复调用split
- 例如:
peopleRDD.cache().map( line -> {
Array<String> parts = line.split(",");
Arrays.asList(new String[]{parts[0], parts[1], parts[4]});
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.