简体   繁体   English

如何使用Scala使用Hadoop客户端在HDFS中附加文本文件?

[英]How to append text files in HDFS using Hadoop client using Scala?

I want to write text files into HDFS. 我想将文本文件写入HDFS。 The path to which files has to be written to HDFS is dynamically generated. 动态生成必须将文件写入HDFS的路径。 If a file path(including file name) is new, then the file should be created and text should be written to it. 如果文件路径(包括文件名)是新的,则应创建文件并向其中写入文本。 If the file path(including file) already exists, then the string must be appended to the existing file. 如果文件路径(包括文件)已经存在,则必须将字符串附加到现有文件之后。

I used the following code. 我用下面的代码。 File creation is working fine. 文件创建工作正常。 But cannot append text to existing files. 但是不能将文本追加到现有文件中。

def writeJson(uri: String, Json: JValue, time: Time): Unit = {
    val path = new Path(generateFilePath(Json, time))
    val conf = new Configuration()
    conf.set("fs.defaultFS", uri)
    conf.set("dfs.replication", "1")
    conf.set("dfs.support.append", "true")
    conf.set("dfs.client.block.write.replace-datanode-on-failure.enable","false")

    val Message = compact(render(Json))+"\n"
    try{
      val fileSystem = FileSystem.get(conf)
      if(fileSystem.exists(path).equals(true)){
        println("File exists.")
        val outputStream = fileSystem.append(path)
        val bufferedWriter = new BufferedWriter(new OutputStreamWriter(outputStream))
        bufferedWriter.write(Message.toString)
        bufferedWriter.close()
        println("Appended to file in path : " + path)
      }
      else {
        println("File does not exist.")
        val outputStream = fileSystem.create(path, true)
        val bufferedWriter = new BufferedWriter(new OutputStreamWriter(outputStream))
        bufferedWriter.write(Message.toString)
        bufferedWriter.close()
        println("Created file in path : " + path)
      }
    }catch{
      case e:Exception=>
        e.printStackTrace()
    }
  }

Hadoop version : 2.7.0 Hadoop版本:2.7.0

Whenever append has to be done, the following error is generated: 每当必须执行附加操作时,就会产生以下错误:

org.apache.hadoop.ipc.RemoteException(java.lang.ArrayIndexOutOfBoundsException) org.apache.hadoop.ipc.RemoteException(java.lang.ArrayIndexOutOfBoundsException)

I can see 3 possibilities: 我可以看到3种可能性:

  1. probably the easiest is to use external commands provided by hdfs which is sitting on your Hadoop cluster, see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html . 可能最简单的方法是使用位于Hadoop集群上的hdfs提供的外部命令,请参阅: https : //hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html Or even WebHDFS REST functionality: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html 甚至WebHDFS REST功能: https ://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
  2. If you don't want to use hdfs commnads, then you might use hdfs API provided by hadoop-hdfs library http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs/2.7.1 如果您不想使用hdfs commnads,则可以使用hadoop-hdfshttp://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs/2.7.1提供的hdfs API。
  3. Use Spark, if you want clean Scala solution, eg http://spark.apache.org/docs/latest/programming-guide.html or https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html 如果您想要干净的Scala解决方案,请使用Spark,例如http://spark.apache.org/docs/latest/programming-guide.htmlhttps://databricks.gitbooks.io/databricks-spark-reference-applications/content /logs_analyzer/chapter3/save_the_rdd_to_files.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM