简体   繁体   English

Python:文本日志文件处理并将行转换为列

[英]Python: text log file processing and transposing rows to columns

I am new to python and stuck with a log file in text format, where it has following repetitive structure and I am required to extract the data from rows and change it into column depending upon the data. 我是python的新手,并且停留在文本格式的日志文件中,该文件具有以下重复结构,因此我需要从行中提取数据,然后根据数据将其更改为列。 eg 例如

First 50 line are trash like below(in first six lines): 前50行是以下垃圾内容(前6行):

    ------------------------------------------------------------- 
Logging to file  xyz.
Char 
1,
 3 
r
 =

 ---------------------------------------------- 
Pid                             0 
Name                                   SAB=1, XYZ=3
---------------------------------------------- 
a              1 
b              2 
c              3
---------------------------------------------- 
Pid                             0 
Name                                   SAB=1, XYZ=3, P_NO=546467
---------------------------------------------- 
Test_data_1              00001 
Test_data_2              FOXABC 
Test_data_3         SHEEP123
Country             US
---------------------------------------------- 
Pid                             0 
Name                                   SAB=1
---------------------------------------------- 
Sno                  893489423

Log FileFormat 日志文件格式

------------Continues for another million lines. ------------继续进行另外一百万行。

Now the required output is like below: 现在所需的输出如下:

Required output format 所需的输出格式

PID, Name,       a,b,c
0, "SAB=1, XYZ=3", 1,2,3

PID, Name         , Test_data_1, Test_data_2, Test_data_3, Country
0, "SAB=1, XYZ=3, P_NO=546467", 00001, FOXABC, SHEEP123, US

Pid, Name, Sno
0, SAB=1, 893489423

I tried to write a code but failed to get the desired results: My attempt was as below: 我尝试编写代码,但未能获得预期的结果:我的尝试如下:

'''
fn=open(file_name,'r')
for i,line in enumerate(fn ):
   if i >= 50 and "Name " in line:   # for first 50 line deletion/or starting point
         last_tag=line.split(",")[-1]
         last_element=last_tag.split("=")[0]
         print(last_element)    

''' '''

Any help would be really appreciated. 任何帮助将非常感激。

Newly Discovered Structure 新发现的结构

RBY Structure RBY结构

The solution I came up with is a bit messy but it works, check it out below: 我想出的解决方案有点混乱,但它可以用,请在下面查看:

import sys
import re
import StringIO


ifile = open(sys.argv[1],'r')   #Input log file as command-line argument
ofile = open(sys.argv[1][:-4]+"_formatted.csv",'w') #output formatted log txt

stringOut = ""

i = 0
flagReturn = True
j = 0

reVal = re.compile("Pid[\s]+(.*)\nName[\s]+(.*)\n[-]+\<br\>(.*)\<br\>") #Regex pattern for separating the Pid & Name from the variables
reVar = re.compile("(.*)[ ]+(.*)") #Regex pattern for getting vars and their values
reVarStr = re.compile(">>> [0-9]+.(.*)=(.*)") #Regex Pattern for Struct
reVarStrMatch = re.compile("Struct(.*)+has(.*)+members:") #Regex pattern for Struct check


for lines in ifile.readlines():
    if(i>8): #Omitting the first 9 lines of Garbage values
        if(lines.strip()=="----------------------------------------------"): #Checking for separation between PID & Name group and the Var group
            j+=1 #variable keeping track of whether we are inside the vars section or not (between two rows of hyphens)
            flagReturn = not flagReturn #To print the variables in single line to easily separate them with regex pattern reVal

        if(not flagReturn):
            stringTmp = lines.strip()+"<br>" #adding break to the end of each vars line in order for easier separation
        else:
            stringTmp = lines #if not vars then save each line as is

        stringOut += stringTmp #concatenating each lines to form the searchable string

    i+=1 #incrementing for omitting lines (useless after i=8)

    if(j==2):   #Once a complete set of PIDs, Names and Vars have been collected
        j=0     #Reset j
        matchObj = reVal.match(stringOut) #Match for PID, Name & Vars
        line1 = "Pid,Name,"
        line2 = matchObj.group(1).strip()+",\""+matchObj.group(2)+"\","
        buf = StringIO.StringIO(matchObj.group(3).replace("<br>","\n"))
        structFlag = False
        for line in buf.readlines(): #Separate each vars and add to the respective strings for writing to file
            if(not (reVarStrMatch.match(line) is None)):
                structFlag = True
            elif(structFlag and (not (reVarStr.match(line) is None))):
                matchObjVars = reVarStr.match(line)
                line1 += matchObjVars.group(1).strip()+","
                line2 += matchObjVars.group(2).strip()+","

            else:
                structFlag = False
                matchObjVars = reVar.match(line)
                try:
                    line1 += matchObjVars.group(1).strip()+","
                    line2 += matchObjVars.group(2).strip()+","
                except:
                    line1 += line.strip()+","
                    line2 += " ,"

        ofile.writelines(line1[:-1]+"\n")
        ofile.writelines(line2[:-1]+"\n")
        ofile.writelines("\n")
        stringOut = "" #Reseting the string 

ofile.close()
ifile.close()   

EDIT This is what I came up with to include the new pattern as well. 编辑这是我想出的也包括新模式的内容。

I suggest you do the following: 我建议您执行以下操作:

  1. Run the parser script on a copy of the log file and see where it fails next. 在日志文件的副本上运行解析器脚本,然后查看下一步失败的地方。
  2. Identify and write down the new pattern that broke the parser. 识别并写下破坏解析器的新模式。
  3. Delete all data in the newly identified pattern. 删除新标识的模式中的所有数据。
  4. Repeat from Step 1 till all patterns have been identified. 从第1步开始重复,直到所有模式都被识别。
  5. Create individual regular expressions pattern for each type of pattern and call them in separate functions to write to the string. 为每种类型的模式创建单独的正则表达式模式,然后在单独的函数中调用它们以写入字符串。

EDIT 2 编辑2

structFlag = False
RBYflag = False
for line in buf.readlines(): #Separate each vars and add to the respective strings for writing to file
            if(not (reVarStrMatch.match(line) is None)):
                structFlag = True
            elif(structFlag and (not (reVarStr.match(line) is None))):
                matchObjVars = reVarStr.match(line)
                if(matchObjVars.group(1).strip()=="RBY" and not RBYFlag):
                    line1 += matchObjVars.group(1).strip()+","
                    line2 += matchObjVars.group(2).strip()+"**"
                    RBYFlag = True
                elif(matchObjVars.group(1).strip()=="RBY"):
                    line2 += matchObjVars.group(2).strip()+"**"
                else:
                    if(RBYFlag):
                        line2 = line2[:-2]
                        RBYFlag = False
                    line1 += matchObjVars.group(1).strip()+","
                    line2 += matchObjVars.group(2).strip()+","

        else:
            structFlag = False
            if(RBYFlag):
                line2 = line2[:-2]
                RBYFlag = False
            matchObjVars = reVar.match(line)
            try:
                line1 += matchObjVars.group(1).strip()+","
                line2 += matchObjVars.group(2).strip()+","
            except:
                line1 += line.strip()+","
                line2 += " ,"`

NOTE This loop has become very bloated and it is better to create a separate function to identify the type of data and return some value accordingly. 注意此循环变得非常膨胀,最好创建一个单独的函数来识别数据类型并相应地返回一些值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM