Spark（Java）：从文件名列表中获取文件名/内容对

Question

I'm currently working on a Spark project in Java, and I ran into a problem that I am not sure how to solve. 我目前正在Java中的Spark项目上工作，但遇到了不确定如何解决的问题。 I'm unfamiliar with the various join/union methods, so I figure one of those is the answer. 我不熟悉各种加入/联合方法，因此我想其中的一种就是答案。

I currently want to input a list of filenames (or paths) and get a JavaPairRDD object, consisting of pairs of filenames/paths and text content. 我当前要输入文件名（或路径）列表，并获取一个JavaPairRDD对象，该对象由成对的文件名/路径和文本内容组成。

I know I can use standard Java to get the text content and just input a List of filename-content tuples, but I feel that there has to be a "Spark" way of doing this. 我知道我可以使用标准Java来获取文本内容，而只需输入文件名内容元组列表，但是我觉得必须有一种“火花”方式。

I also know there is a wholeTextFile method but that only grabs everything in a directory and I'm not sure that will be the format I get (I might use Amazon S3 for example and I'm not sure if I can make the assumption about a directory there). 我也知道有一个WholeTextFile方法，但是它只能捕获目录中的所有内容，而且我不确定这将是我得到的格式（例如，我可能使用Amazon S3，并且不确定是否可以对那里的目录）。

Furthermore, I am aware that I can parallelize each file separately in a loop, but how do I join these back together? 此外，我知道我可以在一个循环中分别并行化每个文件，但是如何将它们重新结合在一起？

docs = //List<String> of document filenames
JavaRDD<String> documents = sc.parallelize(docs);
JavaPairRDD<String, String> = documents.???

Thanks in advance. 提前致谢。

Edit : I am tempted to create a JavaPairRDD of <Filename, JavaRDD<String> filecontents> , but I'm not sure how to proceed from there. 编辑：我很想创建<Filename, JavaRDD<String> filecontents> ，但是我不确定如何从那里继续。 I'm also wary of this because it just sounds wrong (ie am I overriding the parallelism somehow?). 我也对此保持警惕，因为这听起来似乎是错误的（即我是否以某种方式覆盖了并行性？）。

I know I could have Spark create a JavaRDD object from each document, convert them to List objects, and then feed them in as tuples, but is there a Spark specific way of doing this? 我知道我可以让Spark从每个文档中创建一个JavaRDD对象，将它们转换为List对象，然后将它们作为元组输入，但是有Spark的特定方法吗？

Edit 2 Apparently, I misunderstood how text files are loaded into a JavaRDD object. 编辑2显然，我误解了如何将文本文件加载到JavaRDD对象中。 They don't load the entire string in as one object, they break it up by line. 他们不会将整个字符串作为一个对象加载，而是按行将其分解。 This makes me rethink my approach, as I do need things to break across lines for various reasons. 这使我重新思考自己的方法，因为出于各种原因，我确实需要事情跨越界限。 So I'm thinking I have to go with the "hackish" approach of using spark to load the file and then converting it back into a List. 因此，我认为我必须采用“火花式”方法，即使用spark加载文件，然后将其转换回List。 However, I'll leave the question up in case somebody has a clever solution for this. 但是，如果有人对此有一个聪明的解决方案，我将不提问题。

Answer 1

I'm going to switch to wholeTextFiles() instead, as I am running into more and more issues trying to get the data into the proper format. 我将改用全文本文件（），因为越来越多的问题试图将数据转换为正确的格式。

Namely, I don't actually want the files to be broken up into lines, I want to break it up in a special way myself. 也就是说，我实际上并不希望将文件分解成几行，我想自己以一种特殊的方式分解它。

Answer 2

If you go by wholeTestFiles() way doesn't it would read whole data at once and then parallelize it over your standalone Spark cluster / workers ? 如果您通过WholeTestFiles（）方式进行操作，它不会立即读取全部数据，然后在独立的Spark集群/工作线程中并行化它们吗？ Your driver code needs to be run at higher memory. 您的驱动程序代码需要在更高的内存上运行。

Answer 3

In Scala you can get the file name spark stream or spark sc using this query: 在Scala中，您可以使用以下查询获取文件名spark流或spark sc：

object GetFileNameFromStream extends java.io.Serializable {
   def getFileName(file: RDD[String]) :String ={
   file.toDebugString
  }
}

Spark（Java）：从文件名列表中获取文件名/内容对

问题描述

3 个解决方案

解决方案1
0 2015-03-11 22:28:51

解决方案2
0 2015-03-29 10:58:25

解决方案3
0 2018-12-06 14:31:23

Spark（Java）：从文件名列表中获取文件名/内容对

问题描述

3 个解决方案

解决方案1 0 2015-03-11 22:28:51

解决方案2 0 2015-03-29 10:58:25

解决方案3 0 2018-12-06 14:31:23

解决方案1
0 2015-03-11 22:28:51

解决方案2
0 2015-03-29 10:58:25

解决方案3
0 2018-12-06 14:31:23