如何使用 Python 在 HDFS 中查看传入文件的目录？（Python Script 由 Docker Container 执行；HDFS 中没有 Cronjob）

Question

Scenario: My Python Script is running in a docker container which is deployed in rancher (kubernetes cluster).场景：我的 Python 脚本在部署在 Rancher（kubernetes 集群）中的 docker 容器中运行。 Therefore the container is always running.因此容器始终在运行。 I want to implement a method which is watching a directory in my HDFS for incoming files.我想实现一种方法，该方法正在我的 HDFS 中监视传入文件的目录。 if new files are there , i want the script to execute further actions (preprocessing steps to wrangle data).如果有新文件，我希望脚本执行进一步的操作（处理数据的预处理步骤）。 When the new files have been processed they should be deleted.处理完新文件后，应将其删除。 after that the script is waiting for new incoming files to process them as well.之后脚本也在等待新的传入文件来处理它们。 Therefore it should not be a cronjob in HDFS.因此它不应该是 HDFS 中的 cronjob。 i need the code in the script which is executed by docker container.我需要由 docker 容器执行的脚本中的代码。 Currently i am using hdfs cli to connect to my HDFS.目前我正在使用 hdfs cli 连接到我的 HDFS。 For Java I found INotify but i need to do it with python.对于 Java，我找到了 INotify，但我需要用 python 来做。

Does anybody know a Python Lib or some other possibility to get this going?有没有人知道 Python Lib 或其他一些实现这一目标的可能性？

Answer 1

#Schedule below script in crontab for interval of 1 min or 5 min based on your requirement
#Update the parameters(HDFSLocation,FileName,etc) as per the requirement
#Update the script to trigger alert(send mail/trigger another script if newHDFSFileCount > #previousHDFSFileCount)

import subprocess
import os

#Parameters
cwd=os.getcwd()
file='HDFSFileCount.txt'
fileWithPath=cwd+"/"+file
HDFSLocation="/tmp"
previousHDFSFileCount=0
newHDFSFileCount=0
#Calculate New HDFS file count
out = subprocess.Popen(['hadoop','fs','-ls', '/tmp/'], stdout=subprocess.PIPE).communicate()
if out[0][0]==0:
        newHDFSFileCount=0
else:
        newHDFSFileCount=out[0][6]

#
if os.path.exists(fileWithPath):
        f=open(fileWithPath,"r")
        previousHDFSFileCount=f.read()
else:
        f=open(fileWithPath,"w+")
        f.write(newHDFSFileCount)
        previousHDFSFileCount=newHDFSFileCount

f.close()

if (newHDFSFileCount>previousHDFSFileCount):
        f=open(fileWithPath,"w")
        f.write(newHDFSFileCount)
        #print(previousHDFSFileCount)
        #print(newHDFSFileCount)
        f.close()

如何使用 Python 在 HDFS 中查看传入文件的目录？（Python Script 由 Docker Container 执行；HDFS 中没有 Cronjob）

问题描述

1 个解决方案

解决方案1
0 2019-04-26 10:42:08

如何使用 Python 在 HDFS 中查看传入文件的目录？ （Python Script 由 Docker Container 执行；HDFS 中没有 Cronjob）

问题描述

1 个解决方案

解决方案1 0 2019-04-26 10:42:08

如何使用 Python 在 HDFS 中查看传入文件的目录？（Python Script 由 Docker Container 执行；HDFS 中没有 Cronjob）

解决方案1
0 2019-04-26 10:42:08