简体   繁体   中英

How to watch a Directory in HDFS for incoming Files using Python? (Python Script is executed by Docker Container; No Cronjob in HDFS)

Scenario: My Python Script is running in a docker container which is deployed in rancher (kubernetes cluster). Therefore the container is always running. I want to implement a method which is watching a directory in my HDFS for incoming files. if new files are there , i want the script to execute further actions (preprocessing steps to wrangle data). When the new files have been processed they should be deleted. after that the script is waiting for new incoming files to process them as well. Therefore it should not be a cronjob in HDFS. i need the code in the script which is executed by docker container. Currently i am using hdfs cli to connect to my HDFS. For Java I found INotify but i need to do it with python.

Does anybody know a Python Lib or some other possibility to get this going?

#Schedule below script in crontab for interval of 1 min or 5 min based on your requirement
#Update the parameters(HDFSLocation,FileName,etc) as per the requirement
#Update the script to trigger alert(send mail/trigger another script if newHDFSFileCount > #previousHDFSFileCount)

import subprocess
import os

#Parameters
cwd=os.getcwd()
file='HDFSFileCount.txt'
fileWithPath=cwd+"/"+file
HDFSLocation="/tmp"
previousHDFSFileCount=0
newHDFSFileCount=0
#Calculate New HDFS file count
out = subprocess.Popen(['hadoop','fs','-ls', '/tmp/'], stdout=subprocess.PIPE).communicate()
if out[0][0]==0:
        newHDFSFileCount=0
else:
        newHDFSFileCount=out[0][6]

#
if os.path.exists(fileWithPath):
        f=open(fileWithPath,"r")
        previousHDFSFileCount=f.read()
else:
        f=open(fileWithPath,"w+")
        f.write(newHDFSFileCount)
        previousHDFSFileCount=newHDFSFileCount

f.close()

if (newHDFSFileCount>previousHDFSFileCount):
        f=open(fileWithPath,"w")
        f.write(newHDFSFileCount)
        #print(previousHDFSFileCount)
        #print(newHDFSFileCount)
        f.close()


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM