简体   繁体   English

在 Scala 中快速写入文件?

[英]Fast file writing in scala?

So I have a scala program that iterates through a graph and writes out data line by line to a text file.所以我有一个 Scala 程序,它遍历一个图形并将数据一行一行地写到一个文本文件中。 It is essentially an edge list file for use with graphx.它本质上是一个用于 graphx 的边列表文件。

The biggest slow down is actually creating this text file, were talking maybe million records it writes to this text file.最大的减速实际上是创建这个文本文件,正在谈论它写入这个文本文件的可能数百万条记录。 Is there a way I can somehow parallel this task or making faster in any way by somehow storing it in memory or anything?有没有一种方法可以通过某种方式将其存储在内存或其他任何东西中,以某种方式并行执行此任务或以任何方式提高速度?

More info: I am using a hadoop cluster to iterate through a graph and here is my code snippet for my text file creation im doing now to write to HDFS:更多信息:我正在使用一个 hadoop 集群来遍历一个图形,这里是我的文本文件创建代码片段,我现在正在写入 HDFS:

val fileName = dbPropertiesFile + "-edgelist-" + System.currentTimeMillis()
val path = new Path("/home/user/graph/" + fileName + ".txt")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://host001:8020")

val fs = FileSystem.newInstance(conf)
val os = fs.create(path)
while (edges.hasNext) {
val current = edges.next()
os.write(current.inVertex().id().toString.getBytes())
os.write(" ".getBytes())
os.write(current.outVertex().id().toString.getBytes())
os.write("\n".toString.getBytes())
}
fs.close()

Writing files to HDFS is never fast.将文件写入 HDFS 永远不会很快。 Your tags seem to suggest that you are already using spark anyway, so you could as well, take advantage of it.您的标签似乎表明您已经在使用 spark,因此您也可以利用它。

    sparkContext
      .makeRDD(20, edges.toStream)  
      .map(e => e.inVertex.id -> e.outVertex.id)
      .toDF
      .write
      .delimiter(" ")
      .csv(path)

This splits your input into 20 partitions (you can control that number with the numeric parameter to makeRDD above), and writes them in parallel to 20 different chunks in hdfs, that represent your resulting file.这将您的输入分成 20 个分区(您可以使用上面makeRDD的数字参数控制该数字),并将它们并行写入 hdfs 中的 20 个不同块,代表您的结果文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM