简体   繁体   English

使用python在Hadoop上逐行写入文件

[英]Writing Files on Hadoop Line by line using python

I'm working with files which has varying schema for lines, so i need to parse each line and take decisions basis that which needs me write files to HDFS line by line. 我正在使用行格式各异的文件,因此我需要解析每行并根据需要将文件逐行写入HDFS的决策依据。

Is there a way to achieve that in python? 有办法在python中实现吗?

You can use IOUtils from sc._gateway.jvm and use it to stream from one hadoop file(or local) to file on hadoop. 您可以使用来自sc._gateway.jvm并使用它从一个hadoop文件(或本地文件)流式传输到hadoop上的文件。

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM