仅提取 Java Spark 中的某些列

Question

I have a file with 10 columns.我有一个包含 10 列的文件。 What's the most elegant way to extract only first 3 columns or specific columns?仅提取前 3 列或特定列的最优雅方法是什么？

For example, this is how my file looks like:例如，这就是我的文件的样子：

john,smith,84,male,kansas
john,doe,48,male,california
tim,jones,22,male,delaware

And I want to extract into this:我想提取到这个：

[john, smith, kansas]
[john, doe, california]
[tim, jones, delaware]

What I have is this, but it doesn't specifically chose the columns that I want:我所拥有的是这个，但它并没有专门选择我想要的列：

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
peopleRDD.cache().map(lines -> Arrays.asList(lines.split(",")))
                 .forEach(person -> LOG.info(person));

I read the following two Stackoverflow posts but I still can't decide how to do this.我阅读了以下两篇Stackoverflow 帖子，但我仍然无法决定如何执行此操作。

EDIT: I ended up doing the following:编辑：我最终做了以下事情：

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
    peopleRDD.cache().map(lines -> Arrays.asList(new String[]{lines.split(",")[0], 
                                                        lines.split(",")[1], 
                                                        lines.split(",")[3]}
                     .forEach(person -> LOG.info(person));

Not the most elegant solution but if you have a better way, please post here.不是最优雅的解决方案，但如果您有更好的方法，请在此处发布。 Thanks.谢谢。

Answer 1

EDIT : Apologies, I just realized you were asking for a Java solution, but I've used Scala.编辑：抱歉，我刚刚意识到您要的是 Java 解决方案，但我使用了 Scala。 Only the 3rd of my suggestions has an equivalent in Java (added at the bottom of the answer)... Spark is really much nicer in Scala though :-)只有我的第 3 个建议在 Java 中有等价物（添加在答案的底部）……尽管 Spark 在 Scala 中确实要好得多:-)

One way is to perform the split , then pattern match on the result to select the columns you want:一种方法是执行split ，然后对结果进行模式匹配以选择您想要的列：

peopleRDD.cache().map(_.split(",") match { case Array(a,b,_,_,e) => List(a,b,e) })

Another (depending on which combinations of elements you want) is to use take and drop , using a val to avoid splitting repeatedly.另一个（取决于您想要的元素组合）是使用take和drop ，使用val避免重复拆分。

peopleRDD.cache().map{ line => 
    val parts = line.split(",") 
    parts.take(2) ++ parts.drop(4)
}

(You can add a toList after the split if you want a List rather than an Array for each result element in the RDD) （如果你想要一个List而不是 RDD 中的每个结果元素的Array ，你可以在split后添加一个toList ）

In fact the same approach can be used to simplify your original solution, eg:事实上，可以使用相同的方法来简化您的原始解决方案，例如：

peopleRDD.cache().map{ line => 
  val parts = line.split(",")
  List(parts[0], parts[1], parts[4])
}

In Java8, you can probably do the equivalent, which is a slight improvement as we avoid calling split repeatedly - something like:在 Java8 中，您可能可以做等效的事情，这是一个轻微的改进，因为我们避免了重复调用split - 例如：

peopleRDD.cache().map( line -> {
  Array<String> parts = line.split(",");
  Arrays.asList(new String[]{parts[0], parts[1], parts[4]});
})

仅提取 Java Spark 中的某些列

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-27 19:00:31

仅提取 Java Spark 中的某些列

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-27 19:00:31

解决方案1
1 已采纳 2016-04-27 19:00:31