[英]How to save a file in hadoop with python
題:
我開始學習hadoop,但是,我需要使用python將大量文件保存到它中。 我似乎無法弄清楚我做錯了什么。 誰能幫我這個?
以下是我的代碼。 我認為HDFS_PATH
是正確的,因為我在安裝時沒有在設置中更改它。 pythonfile.txt
在我的桌面上(通過命令行運行的python代碼也是如此)。
碼:
import hadoopy
import os
hdfs_path ='hdfs://localhost:9000/python'
def main():
hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())])
main()
輸出當我運行上面的代碼時,我得到的是python本身的一個目錄。
iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
14/10/28 11:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r-- 1 Brian supergroup 236 2014-10-28 11:30 /python
這是subprocess
模塊的一個非常典型的任務。 解決方案如下所示:
put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()
完整的例子
假設您在服務器上並且與hdfs建立了經過驗證的連接(例如,您已經調用了.keytab
)。
您剛從pandas.DataFrame
創建了一個csv,並希望將其放入hdfs。
然后,您可以將文件上載到hdfs,如下所示:
import os
import pandas as pd
from subprocess import PIPE, Popen
# define path to saved file
file_name = "saved_file.csv"
# create a pandas.DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd.DataFrame.from_dict(sales)
# save your pandas.DataFrame to csv (this could be anything, not necessarily a pandas.DataFrame)
df.to_csv(file_name)
# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)
# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
然后,csv文件將存在於/user/<your-user-name/saved_file.csv
。
注 - 如果您是從Hadoop中調用的python腳本創建此文件,則中間csv文件可能存儲在某些隨機節點上。 由於此文件(大概)不再需要,因此最好將其刪除,以免每次調用腳本時都污染節點。 您可以簡單地添加os.remove(file_name)
作為上述腳本的最后一行來解決此問題。
我有一種感覺,你寫入一個名為'/ python'的文件,而你打算將它作為存儲文件的目錄
是什么
hdfs dfs -cat /python
告訴你?
如果它顯示文件內容,你需要做的就是編輯你的hdfs_path以包含文件名(你應該首先使用-rm刪除/ python)否則,使用pydoop(pip install pydoop)並執行以下操作:
import pydoop.hdfs as hdfs
from_path = '/tmp/infile.txt'
to_path ='hdfs://localhost:9000/python/outfile.txt'
hdfs.put(from_path, to_path)
我在這里找到了這個答案:
import subprocess
def run_cmd(args_list):
"""
run linux commands
"""
# import subprocess
print('Running system command: {0}'.format(' '.join(args_list)))
proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
s_output, s_err = proc.communicate()
s_return = proc.returncode
return s_return, s_output, s_err
#Run Hadoop ls command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path'])
lines = out.split('\n')
#Run Hadoop get command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path'])
#Run Hadoop put command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path'])
#Run Hadoop copyFromLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path'])
#Run Hadoop copyToLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file'])
hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently
#Run Hadoop remove file command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path'])
#rm -r
#HDFS Command to remove the entire directory and all of its content from #HDFS.
#Usage: hdfs dfs -rm -r <path>
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path'])
#Check if a file exist in HDFS
#Usage: hadoop fs -test -[defsz] URI
#Options:
#-d: f the path is a directory, return 0.
#-e: if the path exists, return 0.
#-f: if the path is a file, return 0.
#-s: if the path is not empty, return 0.
#-z: if the file is zero length, return 0.
#Example:
#hadoop fs -test -e filename
hdfs_file_path = '/tmpo'
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path]
ret, out, err = run_cmd(cmd)
print(ret, out, err)
if ret:
print('file does not exist')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.