简体   繁体   中英

Python: text log file processing and transposing rows to columns

I am new to python and stuck with a log file in text format, where it has following repetitive structure and I am required to extract the data from rows and change it into column depending upon the data. eg

First 50 line are trash like below(in first six lines):

    ------------------------------------------------------------- 
Logging to file  xyz.
Char 
1,
 3 
r
 =

 ---------------------------------------------- 
Pid                             0 
Name                                   SAB=1, XYZ=3
---------------------------------------------- 
a              1 
b              2 
c              3
---------------------------------------------- 
Pid                             0 
Name                                   SAB=1, XYZ=3, P_NO=546467
---------------------------------------------- 
Test_data_1              00001 
Test_data_2              FOXABC 
Test_data_3         SHEEP123
Country             US
---------------------------------------------- 
Pid                             0 
Name                                   SAB=1
---------------------------------------------- 
Sno                  893489423

Log FileFormat

------------Continues for another million lines.

Now the required output is like below:

Required output format

PID, Name,       a,b,c
0, "SAB=1, XYZ=3", 1,2,3

PID, Name         , Test_data_1, Test_data_2, Test_data_3, Country
0, "SAB=1, XYZ=3, P_NO=546467", 00001, FOXABC, SHEEP123, US

Pid, Name, Sno
0, SAB=1, 893489423

I tried to write a code but failed to get the desired results: My attempt was as below:

'''
fn=open(file_name,'r')
for i,line in enumerate(fn ):
   if i >= 50 and "Name " in line:   # for first 50 line deletion/or starting point
         last_tag=line.split(",")[-1]
         last_element=last_tag.split("=")[0]
         print(last_element)    

'''

Any help would be really appreciated.

Newly Discovered Structure

RBY Structure

The solution I came up with is a bit messy but it works, check it out below:

import sys
import re
import StringIO


ifile = open(sys.argv[1],'r')   #Input log file as command-line argument
ofile = open(sys.argv[1][:-4]+"_formatted.csv",'w') #output formatted log txt

stringOut = ""

i = 0
flagReturn = True
j = 0

reVal = re.compile("Pid[\s]+(.*)\nName[\s]+(.*)\n[-]+\<br\>(.*)\<br\>") #Regex pattern for separating the Pid & Name from the variables
reVar = re.compile("(.*)[ ]+(.*)") #Regex pattern for getting vars and their values
reVarStr = re.compile(">>> [0-9]+.(.*)=(.*)") #Regex Pattern for Struct
reVarStrMatch = re.compile("Struct(.*)+has(.*)+members:") #Regex pattern for Struct check


for lines in ifile.readlines():
    if(i>8): #Omitting the first 9 lines of Garbage values
        if(lines.strip()=="----------------------------------------------"): #Checking for separation between PID & Name group and the Var group
            j+=1 #variable keeping track of whether we are inside the vars section or not (between two rows of hyphens)
            flagReturn = not flagReturn #To print the variables in single line to easily separate them with regex pattern reVal

        if(not flagReturn):
            stringTmp = lines.strip()+"<br>" #adding break to the end of each vars line in order for easier separation
        else:
            stringTmp = lines #if not vars then save each line as is

        stringOut += stringTmp #concatenating each lines to form the searchable string

    i+=1 #incrementing for omitting lines (useless after i=8)

    if(j==2):   #Once a complete set of PIDs, Names and Vars have been collected
        j=0     #Reset j
        matchObj = reVal.match(stringOut) #Match for PID, Name & Vars
        line1 = "Pid,Name,"
        line2 = matchObj.group(1).strip()+",\""+matchObj.group(2)+"\","
        buf = StringIO.StringIO(matchObj.group(3).replace("<br>","\n"))
        structFlag = False
        for line in buf.readlines(): #Separate each vars and add to the respective strings for writing to file
            if(not (reVarStrMatch.match(line) is None)):
                structFlag = True
            elif(structFlag and (not (reVarStr.match(line) is None))):
                matchObjVars = reVarStr.match(line)
                line1 += matchObjVars.group(1).strip()+","
                line2 += matchObjVars.group(2).strip()+","

            else:
                structFlag = False
                matchObjVars = reVar.match(line)
                try:
                    line1 += matchObjVars.group(1).strip()+","
                    line2 += matchObjVars.group(2).strip()+","
                except:
                    line1 += line.strip()+","
                    line2 += " ,"

        ofile.writelines(line1[:-1]+"\n")
        ofile.writelines(line2[:-1]+"\n")
        ofile.writelines("\n")
        stringOut = "" #Reseting the string 

ofile.close()
ifile.close()   

EDIT This is what I came up with to include the new pattern as well.

I suggest you do the following:

  1. Run the parser script on a copy of the log file and see where it fails next.
  2. Identify and write down the new pattern that broke the parser.
  3. Delete all data in the newly identified pattern.
  4. Repeat from Step 1 till all patterns have been identified.
  5. Create individual regular expressions pattern for each type of pattern and call them in separate functions to write to the string.

EDIT 2

structFlag = False
RBYflag = False
for line in buf.readlines(): #Separate each vars and add to the respective strings for writing to file
            if(not (reVarStrMatch.match(line) is None)):
                structFlag = True
            elif(structFlag and (not (reVarStr.match(line) is None))):
                matchObjVars = reVarStr.match(line)
                if(matchObjVars.group(1).strip()=="RBY" and not RBYFlag):
                    line1 += matchObjVars.group(1).strip()+","
                    line2 += matchObjVars.group(2).strip()+"**"
                    RBYFlag = True
                elif(matchObjVars.group(1).strip()=="RBY"):
                    line2 += matchObjVars.group(2).strip()+"**"
                else:
                    if(RBYFlag):
                        line2 = line2[:-2]
                        RBYFlag = False
                    line1 += matchObjVars.group(1).strip()+","
                    line2 += matchObjVars.group(2).strip()+","

        else:
            structFlag = False
            if(RBYFlag):
                line2 = line2[:-2]
                RBYFlag = False
            matchObjVars = reVar.match(line)
            try:
                line1 += matchObjVars.group(1).strip()+","
                line2 += matchObjVars.group(2).strip()+","
            except:
                line1 += line.strip()+","
                line2 += " ,"`

NOTE This loop has become very bloated and it is better to create a separate function to identify the type of data and return some value accordingly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM