简体   繁体   English

Spark(Java):从文件名列表中获取文件名/内容对

[英]Spark (Java): Get Filename/Content pairs from a list of file names

I'm currently working on a Spark project in Java, and I ran into a problem that I am not sure how to solve. 我目前正在Java中的Spark项目上工作,但遇到了不确定如何解决的问题。 I'm unfamiliar with the various join/union methods, so I figure one of those is the answer. 我不熟悉各种加入/联合方法,因此我想其中的一种就是答案。

I currently want to input a list of filenames (or paths) and get a JavaPairRDD object, consisting of pairs of filenames/paths and text content. 我当前要输入文件名(或路径)列表,并获取一个JavaPairRDD对象,该对象由成对的文件名/路径和文本内容组成。

I know I can use standard Java to get the text content and just input a List of filename-content tuples, but I feel that there has to be a "Spark" way of doing this. 我知道我可以使用标准Java来获取文本内容,而只需输入文件名内容元组列表,但是我觉得必须有一种“火花”方式。

I also know there is a wholeTextFile method but that only grabs everything in a directory and I'm not sure that will be the format I get (I might use Amazon S3 for example and I'm not sure if I can make the assumption about a directory there). 我也知道有一个WholeTextFile方法,但是它只能捕获目录中的所有内容,而且我不确定这将是我得到的格式(例如,我可能使用Amazon S3,并且不确定是否可以对那里的目录)。

Furthermore, I am aware that I can parallelize each file separately in a loop, but how do I join these back together? 此外,我知道我可以在一个循环中分别并行化每个文件,但是如何将它们重新结合在一起?

docs = //List<String> of document filenames
JavaRDD<String> documents = sc.parallelize(docs);
JavaPairRDD<String, String> = documents.???

Thanks in advance. 提前致谢。

Edit : I am tempted to create a JavaPairRDD of <Filename, JavaRDD<String> filecontents> , but I'm not sure how to proceed from there. 编辑 :我很想创建<Filename, JavaRDD<String> filecontents> ,但是我不确定如何从那里继续。 I'm also wary of this because it just sounds wrong (ie am I overriding the parallelism somehow?). 我也对此保持警惕,因为这听起来似乎是错误的(即我是否以某种方式覆盖了并行性?)。

I know I could have Spark create a JavaRDD object from each document, convert them to List objects, and then feed them in as tuples, but is there a Spark specific way of doing this? 我知道我可以让Spark从每个文档中创建一个JavaRDD对象,将它们转换为List对象,然后将它们作为元组输入,但是有Spark的特定方法吗?

Edit 2 Apparently, I misunderstood how text files are loaded into a JavaRDD object. 编辑2显然,我误解了如何将文本文件加载到JavaRDD对象中。 They don't load the entire string in as one object, they break it up by line. 他们不会将整个字符串作为一个对象加载,而是按行将其分解。 This makes me rethink my approach, as I do need things to break across lines for various reasons. 这使我重新思考自己的方法,因为出于各种原因,我确实需要事情跨越界限。 So I'm thinking I have to go with the "hackish" approach of using spark to load the file and then converting it back into a List. 因此,我认为我必须采用“火花式”方法,即使用spark加载文件,然后将其转换回List。 However, I'll leave the question up in case somebody has a clever solution for this. 但是,如果有人对此有一个聪明的解决方案,我将不提问题。

I'm going to switch to wholeTextFiles() instead, as I am running into more and more issues trying to get the data into the proper format. 我将改用全文本文件(),因为越来越多的问题试图将数据转换为正确的格式。

Namely, I don't actually want the files to be broken up into lines, I want to break it up in a special way myself. 也就是说,我实际上并不希望将文件分解成几行,我想自己以一种特殊的方式分解它。

If you go by wholeTestFiles() way doesn't it would read whole data at once and then parallelize it over your standalone Spark cluster / workers ? 如果您通过WholeTestFiles()方式进行操作,它不会立即读取全部数据,然后在独立的Spark集群/工作线程中并行化它们吗? Your driver code needs to be run at higher memory. 您的驱动程序代码需要在更高的内存上运行。

In Scala you can get the file name spark stream or spark sc using this query: 在Scala中,您可以使用以下查询获取文件名spark流或spark sc:

object GetFileNameFromStream extends java.io.Serializable {
   def getFileName(file: RDD[String]) :String ={
   file.toDebugString
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM