[英]Reading / Writing Files from hdfs using python with subprocess, Pipe, Popen gives error
我正在嘗試在Python腳本中的hdfs中讀取(打開)和寫入文件。 但是有錯誤。 有人可以告訴我這是怎么回事。
代碼(完整):sample.py
#!/usr/bin/python
from subprocess import Popen, PIPE
print "Before Loop"
cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
stdout=PIPE)
print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=PIPE)
print "After Loop 2"
for line in cat.stdout:
line += "Blah"
print line
print "Inside Loop"
put.stdin.write(line)
cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()
當我執行時:
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -file ./sample.py -mapper './sample.py' -input sample.txt -output fileRead
它執行正常,我找不到應該在hdfsmodifiedfile中創建的文件
當我執行時:
hadoop fs -getmerge ./fileRead/ file.txt
在file.txt中,我得到了:
Before Loop
Before Loop
After Loop 1
After Loop 1
After Loop 2
After Loop 2
有人可以告訴我我在做什么錯嗎? 我不認為它從sample.txt中讀取
嘗試更改您的put
子流程,以通過更改此方式自行獲取cat
stdout
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=PIPE)
進入這個
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=cat.stdout)
完整腳本:
#!/usr/bin/python
from subprocess import Popen, PIPE
print "Before Loop"
cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
stdout=PIPE)
print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=cat.stdout)
put.communicate()
有人可以告訴我我在做什么錯嗎?
您的sample.py
可能不是正確的映射器。 映射器可能會在stdin上接受其輸入,並將結果寫入其stdout中,例如blah.py
:
#!/usr/bin/env python
import sys
for line in sys.stdin: # print("Blah\n".join(sys.stdin) + "Blah\n")
line += "Blah"
print(line)
用法:
$ hadoop ... -file ./blah.py -mapper './blah.py' -input sample.txt -output fileRead
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.