简体   繁体   中英

Python: Reading and processing Multiple gzip files in remote server

Problem Statement:

I have multiple(1000+) *.gz files in a remote server. I have to read these files and check for certain strings. If the strings matches, I have to return the file name. I have tried the following code. The following program is working but doesnot seem efficient as there is a huge IO involved. Can you please suggest an efficient way to do this.

My Code:

import gzip
import os
import paramiko
import multiprocessing
from bisect import insort
synchObj=multiprocessing.Manager()
hostname = '192.168.1.2' 
port = 22
username='may'
password='Apa$sW0rd'

def miniAnalyze():
    ifile_list=synchObj.list([]) # A synchronized list to Store the File names containing the matched String.

    def analyze_the_file(file_single):
       strings = ("error 72","error 81",) # Hard Coded the Strings that needs to be searched.
       try:
          ssh=paramiko.SSHClient()
          #Code to FTP the file to local system from the remote machine.
          .....
          ........
          path_f='/home/user/may/'+filename

          #Read the Gzip file in local system after FTP is done

          with gzip.open(path_f, 'rb') as f:
            contents = f.read()
            if any(s in contents for s in strings):
                print "File " + str(path_f) + " is  a hit."
                insort(ifile_list, filename) # Push the file into the list if there is a match.
                os.remove(path_f)
            else:
                os.remove(path_f)
       except Exception, ae:
          print "Error while Analyzing file "+ str(ae)

       finally:
           if ifile_list:
             print "The Error is at "+ ifile_list
           ftp.close()
           ssh.close()


    def assign_to_proc():
        # Code to glob files matching a pattern and pass to another function via multiprocess .
        apath = '/home/remotemachine/log/'
        apattern = '"*.gz"'
        first_command = 'find {path} -name {pattern}'
        command = first_command.format(path=apath, pattern=apattern)

        try:
            ssh=paramiko.SSHClient()
            ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
            ssh.connect(hostname,username=username,password=password)
            stdin, stdout, stderr = ssh.exec_command(command)
            while not stdout.channel.exit_status_ready():
                time.sleep(2)
            filelist = stdout.read().splitlines()

            jobs = []

            for ifle in filelist:
                p = multiprocessing.Process(target=analyze_the_file,args=(ifle,))
                jobs.append(p)
                p.start()

            for job in jobs:
                job.join()


        except Exception, fe:
            print "Error while getting file names "+ str(fe)

        finally:
            ssh.close()


if __name__ == '__main__':
    miniAnalyze()

The above code is slow. There are lot of IO while getting the GZ file to local system. Kindly help me to find a better way to do it.

Execute a remote OS command such as zgrep, and process the command results locally. This way, you won't have to transfer the whole file contents on your local machine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM