简体   繁体   中英

extracting specific data from a text file in python

If I have a text file containing:

 Proto  Local Address          Foreign Address        State           PID
  TCP    0.0.0.0:11             0.0.0.0:0              LISTENING       12   dns.exe
  TCP    0.0.0.0:95             0.0.0.0:0              LISTENING       589  lsass.exe
  TCP    0.0.0.0:111            0.0.0.0:0              LISTENING       888  svchost.exe
  TCP    0.0.0.0:123            0.0.0.0:0              LISTENING       123  lsass.exe
  TCP    0.0.0.0:449            0.0.0.0:0              LISTENING       2    System

Is there a way to extract ONLY the process ID names such as dns.exe, lsass.exe, etc..?

I tried using split() so I could get the info right after the string LISTENING . Then I took whats left ( 12 dns.exe, 589 lsass.exe, etc... ), and checked the length of each string. So if the len() of 12 dns.exe was between 17 or 20 for example, I would get the substring of that string with specific numbers. I only took into account the length of the PID numbers(which can be anywhere between 1 to 4 digits) but then forgot that the length of each process name varies (there are hundreds). Is there a simpler way to do this or am I out of luck?

You can use pandas DataFrames to do this without getting into the hassle of split :

parsed_file = pandas.read_csv("filename", header = 0)

will automatically read this into a DataFrame for you. You can then filter by those rows containing dns.exe , etc. You may need to define your own header


Here is a more general replacement for read_csv if you want more control. I've assumed your columns are all tab separated, but you can feel free to change the splitting character however you like:

with open('filename','r') as logs:
    logs.readline() # skip header so you can can define your own.
    columns = ["Proto","Local Address","Foreign Address","State","PID", "Process"]
    formatted_logs = pd.DataFrame([dict(zip(columns,line.split('\t'))) for line in logs])

Then you can just filter the rows by

formatted_logs = formatted_logs[formatted_logs['Process'].isin(['dns.exe','lsass.exe', ...])]

If you want just the process names, it is even simpler. Just do

processes = formatted_logs['Process'] # returns a Series object than can be iterated through

split should work just fine so long you ignore the header in your file

processes = []

with open("file.txt", "r") as f:
    lines = f.readlines()

    # Loop through all lines, ignoring header.
    # Add last element to list (i.e. the process name)
    for l in lines[1:]:
        processes.append(l.split()[-1])

print processes

Result:

['dns.exe', 'lsass.exe', 'svchost.exe', 'lsass.exe', 'System']

You could simply use re.split :

import re

rx = re.compile(" +")
l = rx.split("       12   dns.exe") #  => ['', '12', 'dns.exe']
pid = l[1]

it will split the string on a arbitrary number of spaces, and you take second element.

You could also use simply split and treat the line step by step, one by one like this:

def getAllExecutables(textFile):
    execFiles = []
    with open(textFile) as f:
        fln = f.readline()
        while fln:
            pidname = str.strip(list(filter(None, fln.split(' ')))[-1]) #splitting the line, removing empty entry, stripping unnecessary chars, take last element
            if (pidname[-3:] == 'exe'): #check if the pidname ends with exe
                execFiles.append(pidname) #if it does, adds it
            fln = f.readline() #read the next line
    return  execFiles

exeFiles = getAllExecutables('file.txt')
print(exeFiles)

Some remarks on the code above:

  1. Filter all the unnecessary empty element in the file line by filter
  2. stripping all the unnecessary characters in the file (such as \\n ) by str.strip
  3. Get the last element of the line after split using l[-1]
  4. Check if the last 3 chars of that element is exe . If it is, adds it to the resulting list.

Results:

['dns.exe', 'lsass.exe', 'svchost.exe', 'lsass.exe']
with open(txtfile) as txt:
    lines = [line for line in txt]
process_names = [line.split()[-1] for line in lines[1:]]

This opens your input file and reads all the lines into a list. Next, the list is iterated over starting at the second element (because the first is the header row) and each line is split() . The last item in the resulting list is then added to process_names .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM