繁体   English   中英

使用Python解析文本文件

[英]Parsing text files using Python

我是Python的新手,我希望用它来解析文本文件。 该文件具有以下格式的250-300行:

---- Mark Grey (mark.grey@gmail.com) changed status from Busy to Available @ 14/07/2010 16:32:36 ----
----  Silvia Pablo (spablo@gmail.com) became Available @ 14/07/2010 16:32:39 ----

我需要将以下信息存储到该文件中所有条目的另一个文件(excel或文本)中

UserName/ID  Previous Status New Status Date Time

所以我的结果文件看起来应该是这样的

Mark Grey/mark.grey@gmail.com  Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo@gmail.com  NaN  Available 14/07/2010 16:32:39

提前致谢,

任何帮助将非常感激

为了帮助您入门:

result = []
regex = re.compile(
    r"""^-*\s+
    (?P<name>.*?)\s+
    \((?P<email>.*?)\)\s+
    (?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
    (?P<new>.*?)\s+@\s+
    (?P<date>\S+)\s+
    (?P<time>\S+)\s+
    -*$""", re.VERBOSE)
with open("inputfile") as f:
    for line in f:
        match = regex.match(line)
        if match:
            result.append([
                match.group("name"),
                match.group("email"),
                match.group("previous")
                # etc.
            ])
        else:
            # Match attempt failed

会得到一系列比赛的部分。 然后,我建议您使用csv module以标准格式存储结果。

import re

pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) @ (.*?) ----\s*")
with open("data.txt") as f:
    for line in f:
        (name, email, prev, curr, date) = pat.match(line).groups()
        print "{0}/{1}  {2} {3} {4}".format(name, email, prev or "NaN", curr, date)

这假设了空白,并假设每一行都符合模式。 如果要优雅地处理脏输入,可能需要添加错误检查(例如检查pat.match()不返回None )。

感兴趣的两种RE模式似乎是......:

p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'

所以我会这样做:

import csv, re, sys

# assign p1, p2 as above (or enhance them, etc etc)

r1 = re.compile(p1)
r2 = re.compile(p2)
data = []

with open('somefile.txt') as f:
    for line in f:
        m = p1.match(line)
        if m:
            data.append(m.groups())
            continue
        m = p2.match(line)
        if not m:
            print>>sys.stderr, "No match for line: %r" % line
            continue
        listofgroups = m.groups()
        listofgroups.insert(2, 'NaN')
        data.append(listofgroups)

with open('result.csv', 'w') as f:
    w = csv.writer(f)
    w.writerow('UserName/ID Previous Status New Status Date Time'.split())
    w.writerows(data)

如果我描述的两种模式不够通用,当然可能需要进行调整,但我认为这种通用方法很有用。 虽然Stack Overflow上的许多Python用户非常不喜欢RE,但我发现它们对于这种实用的临时文本处理非常有用。

也许其他人想要将RE用于荒谬的用途,例如对CSV,HTML,XML的临时解析,以及许多其他类型的结构化文本格式,其中存在非常好的解析器,这可能是不喜欢的。 而且,其他任务远远超出了RE的“舒适区”,而是需要像pyparsing这样的可靠的通用解析器系统。 或者在另一个极端超简单的任务完成得很好用简单的字符串(例如,我记得最近的SO问题,其使用的if re.search('something', s):而不是if 'something' in s: !)。

但是对于相当广泛的任务(不包括最简单的一端,以及在另一端解析结构化或有些复杂的语法)RE是合适的,使用它们确实没有错,我建议所有程序员至少要学习RE的基础知识。

亚历克斯提到了pyparsing,所以这里是一个针对同样问题的pyparsing方法:

from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime

DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()@")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")

statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR + 
            (statechange1 | statechange2) +
            AT + date('date') + time('time') + DASHES)

def convertFields(tokens):
    if 'fromstate' not in tokens:
        tokens['fromstate'] = 'NULL'
    tokens['name'] = tokens.name.strip()
    tokens['email'] = tokens.email.strip()
    d,mon,yr = map(int, tokens.date.split('/'))
    h,m,s = map(int, tokens.time.split(':'))
    tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print "%(name)s/%(email)s  %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields

打印:

Mark Grey/mark.grey@gmail.com  Busy       Available  2010-07-14 16:32:36
Silvia Pablo/spablo@gmail.com  NULL       Available  2010-07-14 16:32:39

pyparsing允许您将名称附加到结果字段(就像Tom Pietzcker的RE样式答案中的命名组一样),以及用于处理或操作已解析操作的分析时操作 - 请注意将单独的日期和时间字段转换为一个真正的日期时间对象,在解析后已经转换并准备好进行处理,没有额外的麻烦。

这是一个修改过的循环,只是转出解析的标记和每行的命名字段:

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print fields.dump()

打印:

['Mark Grey ', 'mark.grey@gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey@gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo@gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo@gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available

我怀疑,当您继续处理此问题时,您会发现输入文本格式的其他变体,指定用户状态的更改方式。 在这种情况下,你只需添加一个定义,就像statechange1statechange2 ,并将其插入到linefmt与他人。 我觉得pyparsing的解析器定义结构可以帮助开发人员在事情发生变化后回到解析器,并轻松扩展他们的解析程序。

好吧,如果我要解决这个问题,我可能会先将每个条目拆分成自己的单独字符串。 这看起来可能是面向行的,所以inputfile.split('\\n')可能就足够了。 从那里我可能会创建一个正则表达式来匹配每个可能的状态更改,子组包含每个重要的字段。

非常感谢你的所有评论。 它们非常有用。 我使用目录功能编写了代码。 它的作用是读取文件,并为每个用户创建一个输出文件,并提供所有状态更新。 这是下面粘贴的代码。

#Script to extract info from individual data files and print out a data file combining info from these files

import os
import commands

dataFileDir="data/";

#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};

#User id  to user name mapping to check for duplicate names
usrId2Name={};

#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};

#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):

    userName="";
    for i in range(mailInd-1,0,-1):

        if info[i].endswith("-") or info[i].endswith("+"):
            break;

        userName=info[i]+" "+userName;

    userName=userName.strip();
    userName=userName.replace("  "," ");
    userName=userName.replace(" ","_");

    return userName;

#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
    timeStamp="";
    for i in range(timeStartInd+1,len(info)):
        timeStamp=timeStamp+" "+info[i];

    timeStamp=timeStamp.replace("-","");
    timeStamp=timeStamp.strip();
    return timeStamp;

#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
    msg="";
    for i in range(startInd,endInd):
        msg=msg+" "+info[i];
    msg=msg.strip();
    msg=msg.replace(" ","_");
    return msg;

#Extract and store info from each line in the datafile
def extractLineInfo(line):

    print line;
    info=line.split(" ");

    mailInd=-1;userId="-NONE-";
    timeStartInd=-1;timeStamp="-NONE-";
    becameInd="-1";
    statusMsg="-NONE-";

    #Find indices of email id and "@" char indicating start of timestamp
    for i in range(0,len(info)):
        #print (str(i)+" "+info[i]);
        if(info[i].startswith("(") and info[i].endswith("@in.ibm.com)")):
            mailInd=i;
        if(info[i]=="@"):
            timeStartInd=i;

        if(info[i]=="became"):
            becameInd=i;

    #Debug print of mail and time stamp start inds
    """print "\n";
    print "Index of mail id: "+str(mailInd);
    print "Index of time start index: "+str(timeStartInd);
    print "\n";"""

    #Extract IBM user id and name for lines with ibm id
    if(mailInd>=0):
        userId=info[mailInd].replace("(","");
        userId=userId.replace(")","");
        userName=getUserName(info,mailInd);
    #Lines with no ibm id are of the form "Suraj Godar Mr became idle @ 15/07/2010 16:30:18"
    elif(becameInd>0):
        userName=getUserName(info,becameInd);

    #Time stamp info
    if(timeStartInd>=0):
        timeStamp=getTimeStamp(info,timeStartInd);
        if(mailInd>=0):
            statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
        elif(becameInd>0):
            statusMsg=getStatusMsg(info,becameInd,timeStartInd);

    print userId;
    print userName;
    print timeStamp
    print statusMsg+"\n";

    if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
        usrName2Id[userName]=userId;

    #Store status messages keyed by user email ids
    timeDict={};

    #Retrieve user id corresponding to user name
    if userName in usrName2Id:
        userId=usrName2Id[userName];

    #For valid user ids, store status message in the dict within dict data str arrangement
    if not(userId=="-NONE-"):

        if not(userId in infoDict.keys()):
            infoDict[userId]={};

        timeDict=infoDict[userId];
        if not(timeStamp in timeDict.keys()):
            timeDict[timeStamp]=statusMsg;
        else:
            timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;


#Print for each user a file containing status
def printStatusFiles(dataFileDir):


    volNum=0;

    for userName in usrName2Id:
        volNum=volNum+1;

        filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
        file = open(filename,"w");

        print "Printing output file name: "+filename;
        print volNum,userName,usrName2Id[userName]+"\n";
        file.write(userName+" "+usrName2Id[userName]+"\n");

        timeDict=infoDict[usrName2Id[userName]];
        for time in sorted(timeDict.keys()):
            file.write(time+" "+timeDict[time]+"\n");


#Read and store data from individual data files
def readDataFiles(dataFileDir):

    #Process each datafile
    files=os.listdir(dataFileDir)
    files.sort();
    for i in range(0,len(files)):
    #for i in range(0,1):

        file=files[i];

        #Do not process other non-data files lying around in that dir
        if not file.endswith(".txt"):
            continue

        print "Processing data file: "+file
        dataFile=dataFileDir+str(file);
        inpFile=open(dataFile,"r");
        lines=inpFile.readlines();

        #Process lines
        for line in lines:

            #Clean lines
            line=line.strip();
            line=line.replace("/India/Contr/IBM","");
            line=line.strip();

            #Skip header line of the file and L's sign in sign out times
            if(line.startswith("System log for account") or line.find("signed")>-1):
                continue;


            extractLineInfo(line);


print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM