使用python在Hadoop上逐行写入文件

Question

I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line. 我正在使用行格式各异的文件，因此我需要解析每行并根据需要将文件逐行写入HDFS的决策依据。

Is there a way to achieve that in python? 有办法在python中实现吗？

Answer 1

You can use IOUtils from sc._gateway.jvm and use it to stream from one hadoop file(or local) to file on hadoop. 您可以使用来自sc._gateway.jvm并使用它从一个hadoop文件（或本地文件）流式传输到hadoop上的文件。

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())

使用python在Hadoop上逐行写入文件

问题描述

1 个解决方案

解决方案1
4 2018-02-09 12:31:26

使用python在Hadoop上逐行写入文件

问题描述

1 个解决方案

解决方案1 4 2018-02-09 12:31:26

解决方案1
4 2018-02-09 12:31:26