简体   繁体   English

如何使用python在hadoop中保存文件

[英]How to save a file in hadoop with python

Question: 题:

I am starting to learn hadoop, however, I need to save a lot of files into it using python. 我开始学习hadoop,但是,我需要使用python将大量文件保存到它中。 I cannot seem to figure out what i am doing wrong. 我似乎无法弄清楚我做错了什么。 Can anyone help me with this? 谁能帮我这个?

Below is my code. 以下是我的代码。 I think the HDFS_PATH is correct as I didn't change it in the settings while installing. 我认为HDFS_PATH是正确的,因为我在安装时没有在设置中更改它。 the pythonfile.txt is on my desktop (so is the python code running through the command line). pythonfile.txt在我的桌面上(通过命令行运行的python代码也是如此)。

Code: 码:

import hadoopy
import os
hdfs_path ='hdfs://localhost:9000/python' 

def main():
    hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())])

main()

Output When I run the above code all I get is a directory in python itself. 输出当我运行上面的代码时,我得到的是python本身的一个目录。

iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/28 11:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 Brian supergroup        236 2014-10-28 11:30 /python

This is a pretty typical task for the subprocess module. 这是subprocess模块的一个非常典型的任务。 The solution looks like this: 解决方案如下所示:

put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()

Full Example 完整的例子

Let's assume you're on a server and have a verified connection with hdfs (eg you already called a .keytab ). 假设您在服务器上并且与hdfs建立了经过验证的连接(例如,您已经调用了.keytab )。

You just created a csv from a pandas.DataFrame and want to put it into hdfs. 您刚从pandas.DataFrame创建了一个csv,并希望将其放入hdfs。

You can then upload the file to hdfs as follows: 然后,您可以将文件上载到hdfs,如下所示:

import os 

import pandas as pd

from subprocess import PIPE, Popen


# define path to saved file
file_name = "saved_file.csv"

# create a pandas.DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd.DataFrame.from_dict(sales)

# save your pandas.DataFrame to csv (this could be anything, not necessarily a pandas.DataFrame)
df.to_csv(file_name)

# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)

# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()

The csv file will then exist at /user/<your-user-name/saved_file.csv . 然后,csv文件将存在于/user/<your-user-name/saved_file.csv

Note - If you created this file from a python script called in Hadoop, the intermediate csv file may be stored on some random nodes. - 如果您是从Hadoop中调用的python脚本创建此文件,则中间csv文件可能存储在某些随机节点上。 Since this file is (presumably) no longer needed, it's best practice to remove it so as not to pollute the nodes everytime the script is called. 由于此文件(大概)不再需要,因此最好将其删除,以免每次调用脚本时都污染节点。 You can simply add os.remove(file_name) as the last line of the above script to solve this issue. 您可以简单地添加os.remove(file_name)作为上述脚本的最后一行来解决此问题。

I have a feeling that you're writing into a file called '/python' while you intend it to be the directory in which the file is stored 我有一种感觉,你写入一个名为'/ python'的文件,而你打算将它作为存储文件的目录

what does 是什么

hdfs dfs -cat /python

show you? 告诉你?

if it shows the file contents, all you need to do is edit your hdfs_path to include the file name (you should delete /python first with -rm) Otherwise, use pydoop (pip install pydoop) and do this: 如果它显示文件内容,你需要做的就是编辑你的hdfs_path以包含文件名(你应该首先使用-rm删除/ python)否则,使用pydoop(pip install pydoop)并执行以下操作:

import pydoop.hdfs as hdfs

from_path = '/tmp/infile.txt'
to_path ='hdfs://localhost:9000/python/outfile.txt'
hdfs.put(from_path, to_path)

I found this answer here : 我在这里找到了这个答案:

import subprocess

def run_cmd(args_list):
        """
        run linux commands
        """
        # import subprocess
        print('Running system command: {0}'.format(' '.join(args_list)))
        proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        s_output, s_err = proc.communicate()
        s_return =  proc.returncode
        return s_return, s_output, s_err 

#Run Hadoop ls command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path'])
lines = out.split('\n')


#Run Hadoop get command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path'])


#Run Hadoop put command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path'])


#Run Hadoop copyFromLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path'])

#Run Hadoop copyToLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file'])


hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently
#Run Hadoop remove file command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path'])


#rm -r
#HDFS Command to remove the entire directory and all of its content from #HDFS.
#Usage: hdfs dfs -rm -r <path>

(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path'])




#Check if a file exist in HDFS
#Usage: hadoop fs -test -[defsz] URI


#Options:
#-d: f the path is a directory, return 0.
#-e: if the path exists, return 0.
#-f: if the path is a file, return 0.
#-s: if the path is not empty, return 0.
#-z: if the file is zero length, return 0.
#Example:

#hadoop fs -test -e filename

hdfs_file_path = '/tmpo'
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path]
ret, out, err = run_cmd(cmd)
print(ret, out, err)
if ret:
    print('file does not exist')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM