简体   繁体   English

仅提取 Java Spark 中的某些列

[英]Extract only certain columns in Java Spark

I have a file with 10 columns.我有一个包含 10 列的文件。 What's the most elegant way to extract only first 3 columns or specific columns?仅提取前 3 列或特定列的最优雅方法是什么?

For example, this is how my file looks like:例如,这就是我的文件的样子:

john,smith,84,male,kansas
john,doe,48,male,california
tim,jones,22,male,delaware

And I want to extract into this:我想提取到这个:

[john, smith, kansas]
[john, doe, california]
[tim, jones, delaware]

What I have is this, but it doesn't specifically chose the columns that I want:我所拥有的是这个,但它并没有专门选择我想要的列:

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
peopleRDD.cache().map(lines -> Arrays.asList(lines.split(",")))
                 .forEach(person -> LOG.info(person));

I read the following two Stackoverflow posts but I still can't decide how to do this.我阅读了以下两篇Stackoverflow 帖子,但我仍然无法决定如何执行此操作。

EDIT: I ended up doing the following:编辑:我最终做了以下事情:

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
    peopleRDD.cache().map(lines -> Arrays.asList(new String[]{lines.split(",")[0], 
                                                        lines.split(",")[1], 
                                                        lines.split(",")[3]}
                     .forEach(person -> LOG.info(person));

Not the most elegant solution but if you have a better way, please post here.不是最优雅的解决方案,但如果您有更好的方法,请在此处发布。 Thanks.谢谢。

EDIT : Apologies, I just realized you were asking for a Java solution, but I've used Scala.编辑:抱歉,我刚刚意识到您要的是 Java 解决方案,但我使用了 Scala。 Only the 3rd of my suggestions has an equivalent in Java (added at the bottom of the answer)... Spark is really much nicer in Scala though :-)只有我的第 3 个建议在 Java 中有等价物(添加在答案的底部)……尽管 Spark 在 Scala 中确实要好得多:-)

One way is to perform the split , then pattern match on the result to select the columns you want:一种方法是执行split ,然后对结果进行模式匹配以选择您想要的列:

peopleRDD.cache().map(_.split(",") match { case Array(a,b,_,_,e) => List(a,b,e) }) 

Another (depending on which combinations of elements you want) is to use take and drop , using a val to avoid splitting repeatedly.另一个(取决于您想要的元素组合)是使用takedrop ,使用val避免重复拆分。

peopleRDD.cache().map{ line => 
    val parts = line.split(",") 
    parts.take(2) ++ parts.drop(4)
}

(You can add a toList after the split if you want a List rather than an Array for each result element in the RDD) (如果你想要一个List而不是 RDD 中的每个结果元素的Array ,你可以在split后添加一个toList

In fact the same approach can be used to simplify your original solution, eg:事实上,可以使用相同的方法来简化您的原始解决方案,例如:

peopleRDD.cache().map{ line => 
  val parts = line.split(",")
  List(parts[0], parts[1], parts[4])
}

In Java8, you can probably do the equivalent, which is a slight improvement as we avoid calling split repeatedly - something like:在 Java8 中,您可能可以做等效的事情,这是一个轻微的改进,因为我们避免了重复调用split - 例如:

peopleRDD.cache().map( line -> {
  Array<String> parts = line.split(",");
  Arrays.asList(new String[]{parts[0], parts[1], parts[4]});
})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM