简体   繁体   English

火花文本文件加载文件而不是行

[英]spark textfile load file instead of lines

In Spark, we can use textFile to load file into lines and try to do some operations with these lines as follows. 在Spark中,我们可以使用textFile将文件加载到各行中,然后尝试对这些行进行一些操作,如下所示。

val lines = sc.textFile("xxx")
val counts = lines.filter(line => lines.contains("a")).count()

However, in my situation, I would like to load the file into blocks because the data in files and the block will be kind of as follow. 但是,在我的情况下,我想将文件加载到块中,因为文件和块中的数据将如下所示。 Blocks will be separated with empty line in files. 块将在文件中用空行分隔。

user: 111
book: 222
comments: like it!

Therefore, I hope the textFile function or any other solutions can help me load the file with blocks, which may be achieved as follows. 因此,我希望textFile函数或任何其他解决方案可以帮助我使用块加载文件,这可以通过以下方式实现。

val blocks = sc.textFile("xxx", 3 line)

Does anyone face this situation before? 有人遇到过这种情况吗? Thanks 谢谢

I suggest you to implement your own file reader function from Hdfs. 我建议您从Hdfs实现自己的文件读取器功能。 Look at textFile function, it's built on top of hadoopFile function and it uses TextInputFormat : 看一下textFile函数,它建立在hadoopFile函数之上,并且使用TextInputFormat

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

But this TextInputFormat can be customized via hadoop properties as described in this answer . 但是,可以通过此响应中所述的hadoop属性来自定义此TextInputFormat In your case delimiter could be: 在您的情况下,定界符可以是:

conf.set("textinputformat.record.delimiter", "\n\n")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM