简体   繁体   English

在python代码中使用hadoop fs -put命令将文件从本地文件系统传输到hdfs中的问题

[英]Issue in using hadoop fs -put command in python code to transfer file from local filesystem to hdfs

I am using this code in python which reads records from a file, does some processing on them and then write the outcome to a new file.Then I transfer the file from my local filesystem to hdfs: 我在python中使用此代码,该代码从文件中读取记录,对其进行一些处理,然后将结果写入新文件中。然后将文件从本地文件系统传输到hdfs:

read = open('file_read.txt', 'r')

for line in read:
    fields = line.split('|')
    columns.append(fields)

category = [-1,1,2,3,4,5,6]
out = open('file_write.txt', 'w')

for line in columns:
    out.write('{0}|{1}|{2}|{3}'.format(line[0], line[1], line[5], line[6].rstrip().replace('-','')))
    for val in category:
        if int(line[4]) == val:
            out.write('|{0}'.format(line[2]))
        else:
            out.write('|')
    for val in category:
        if int(line[4]) == val:
            out.write('|{0}'.format(line[3]))
        else:
            out.write('|')
    out.write('\n')
str = "HADOOP_USER_NAME=hdfs hadoop fs -put file_write.txt /folder1/folder2/"
result = system(str)

The problem is that during the transfer some of the last few records are getting lost from the file. 问题在于,在传输过程中,最后几条记录中的一些记录已从文件中丢失。 The file that gets moved to hdfs has about 10 records less than the file which is on my local file system. 移到hdfs的文件比本地文件系统上的文件少10条记录。 I have tried -moveFromLocal also but same result occurs. 我也尝试过-moveFromLocal但是发生相同的结果。 Though if I execute the any of the above command from the terminal then complete file gets moved but its just when I am executing it from within a python script the issue comes. 虽然如果我从终端执行上述任何命令,则完整文件将被移动,但是当我从python脚本中执行该文件时,问题就来了。

Why is this issue coming and what could I do to resolve it? 为什么会出现此问题,我该怎么解决?

UPDATE: The issue of missing records are coming only if I execute the part above the hadoop fs - put command. 更新:仅当我执行hadoop fs - put命令上方的部分时,才会出现缺少记录的问题。 If I do not execute it and just move simple file then there is no loss of data occurring. 如果我不执行它,而只是移动简单文件,那么就不会丢失数据。 I have tried to see if there any special character that is getting inserted and which may be causing the loss of last few records but couldn't find one (I tried to look for them by going through the file). 我尝试查看是否插入了任何特殊字符,这可能会导致丢失最后几条记录,但找不到一个(我试图通过浏览文件来查找它们)。

I can not reproduce the issue. 我无法重现该问题。

$ < /dev/urandom tr -dc "\n [:alnum:]" | head -c10000000 > test.txt
$ cat python_hdfs.py 
from os import system

str = "HADOOP_USER_NAME=hdfs hadoop fs -put test.txt /tmp/"
print system(str)
$ cat test.txt | wc -l
155682
$ python python_hdfs.py 
0
$ hadoop fs -cat /tmp/test.txt | wc -l
155682

Maybe config related? 也许与配置有关?

  • Is the exit status of the system call 0? 系统调用的退出状态是否为0? Are you on linux or windows? 您使用的是Linux还是Windows?
  • How big is the file? 文件有多大? Does it happen only with this specific one, or other files as well? 它仅在此特定文件或其他文件中发生吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM