繁体   English   中英

使用python使用相同的输入文件解析多个xml

[英]Parse multiple xml with the same input file with python

我的输入文件如下所示,单个文件中的记录多达100k

<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>
<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>

我已经将列表理解和Xpath用作解析值的逻辑

def parsexml():
 net=[]
 tree = ET.parse('pain1.xml')
 root = tree.getroot()



 grp1x = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/MsgId')]
 grp1y = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
 grp1 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/Nm')]
 grp2 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
 grp3 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/NbOfTxs')]
 grp4 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CtrlSum')]
 grp5 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/StrtNm')]
 grp6 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/BldgNb')]
 grp7 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/PstCd')]
 grp8 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/TwnNm')]
 grp9 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/Ctry')]
 grp10 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtInfId')]
 grp11 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtMtd')]
 grp12 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/BtchBookg')]
 grp13 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/ReqdExctnDt')]
 grp14 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/Nm')]
 grp15 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/StrtNm')]
 grp16 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/BldgNb')]
 grp17 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/PstCd')]
 grp18 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/TwnNm')]
 grp19 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/Ctry')]
 grp20 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAcct/Id/Othr/Id')]
 grp21 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAgt/FinInstnId/BICFI')]
 grp22 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/InstrId')]
 grp23 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/EndToEndId')]
 grp24 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt')]
 grp25= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt[@Ccy="JPY"]')]
 grp26 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/ChrgBr')]
 grp27 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAgt/FinInstnId/BICFI')]
 grp28 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/Nm')]
 grp29 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[1]')]
 grp30 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[2]')]
 grp31 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[3]')]
 grp32 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[4]')]
 grp33 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAcct/Id/Othr/Id')]
 grp34 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Purp/Cd')]
 grp35 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Tp/CdOrPrtry/Cd')]
 grp36 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Nb')]
 grp37= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/RltdDt')]


  net = ",".join(grp1x+grp1y+grp1 + grp2 + grp3 + grp4 +grp5+grp6+grp7+grp8+grp9+grp10+grp11+grp12+grp13+grp14+grp15+grp16+grp17+grp18+grp19+grp20+grp21+grp22+grp23+grp24+grp25+grp26+grp27+grp28+grp29+grp30+grp31+grp32+grp33+grp34+grp35+grp36+grp37)
 return net 

我在下面出现错误

Traceback (most recent call last):
  File "C:\Python27\parsefunc.py", line 10, in <module>
    tree = ET.parse('pain1.xml')
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
    parser.feed(data)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: junk after document element: line 2, column 0

解析后我需要的输出如下所示

ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08
ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08

有没有比使用元素树的列表理解更好的方法,或者如何以上述方式解析并获取输出以解析同一文件中的其他xml?

更新资料

我可以使用Parfait建议的新方法在一行中解析和生成文件,但是当我尝试为多个xml实现以下解决方案时,仍然遇到相同的错误

导入sys导入lxml.etree作为ET

net = []

tree = ET.parse('pain001.xml')
root = tree.getroot()

line= tree.xpath('//text()')

line = map(lambda line: line.strip(), line)
net = filter(bool, line)
#str_list = filter(None, str_list)
#net = root.xpath('//*') 
net = ",".join(net)

这不是一个好方法。 如果文件太大,则会消耗进程内存。 如果文件始终具有相同的结构,则可以直接逐行处理并进行输出。 您也可以直接为一行构造输出,而不是创建列表。

考虑文档中所有子项的XPath表达式,该表达式返回元素标签和文本的列表:

net = tree.xpath('//*')

但是,要遍历每个重复的子根<pain001>并迁移到行和列的csv格式,请考虑子根出现的每个节点的迭代并提取相应的标记和文本。

import os, sys
import csv
import lxml.etree as ET

# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))

# ITERATE THROUGH ALL XML FILES
for item in os.listdir(cd):
    if item.endswith(".xml"):
        tree = ET.parse(os.path.join(cd,item))

        subroot = tree.xpath("//CstmrCdtTrfInitn")

        with open(os.path.join(cd,'MultipleXPaths.csv'), 'ab') as m:
            writer = csv.writer(m)    

            for i in range(1,len(subroot)+1):        
                nodes = tree.xpath('//CstmrCdtTrfInitn[{0}]//*'.format(i))

                cols = []
                rows = []
                for elem in nodes:
                    cols.append(elem.tag)
                    rows.append(elem.text.replace('\n','').strip())

                if i == 1:
                    print ', '.join(cols)+"\n"
                    writer.writerow(cols)    

                print ', '.join(rows)+"\n"
                writer.writerow(rows)

控制台打印输出 (但CSV文件中的列和行)

GrpHdr, MsgId, CreDtTm, NbOfTxs, CtrlSum, InitgPty, Nm, PstlAdr, StrtNm, 
BldgNb, PstCd, TwnNm, Ctry, PmtInf, PmtInfId, PmtMtd, BtchBookg, 
ReqdExctnDt, Dbtr, Nm, PstlAdr, StrtNm, BldgNb, PstCd, TwnNm, Ctry, 
DbtrAcct, Id, Othr, Id, DbtrAgt, FinInstnId, BICFI, CdtTrfTxInf, PmtId, 
InstrId, EndToEndId, Amt, InstdAmt, ChrgBr, CdtrAgt, FinInstnId, BICFI, 
Cdtr, Nm, PstlAdr, AdrLine, AdrLine, AdrLine, AdrLine, CdtrAcct, Id, 
Othr, Id, Purp, Cd, RmtInf, Strd, RfrdDocInf, Tp, CdOrPrtry, Cd, Nb, RltdDt

, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC 
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086, 
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New 
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01, 
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn 
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , , 
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08

, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC 
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086, 
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New 
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01, 
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn 
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , ,    
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08

ET.parse('pain001.xml')失败,因为该文件不是真正的xml文件。 但是它每行确实有一个xml文档,这很好,因为这意味着您不必将整个文档加载到内存中就可以对其进行处理。

您可以继续执行您的操作,但可以将其放在for xmltext in open('somefile'):循环中的for xmltext in open('somefile'):但同时也可以减少工作量。 我有点兴奋,因为我在使用ElementTree时是在lxml编写的,但是您可以切换或修改脚本。 想法是为列表中的每个字段写出XPath选择器,然后使用该列表为每一行提取数据。 确定敲打每个。

import lxml.etree
import csv

# compile xpath selectors for element text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]

# open result csv file
with open('pain.csv', 'w') as paincsv:
    writer = csv.writer(paincsv)
    # read file with 1 'CstmrCdtTrfInitn' record per line
    with open('pain.xml') as painxml:
        # process each record
        for index, line in enumerate(painxml):
            if not line.strip(): # allow empty lines
                continue
            try:
                # each line is an xml doc
                pain001 = lxml.etree.fromstring(line)
                # move to the customer elem
                elem = pain001.find('CstmrCdtTrfInitn')
                # select each value and write to csv
                writer.writerow([xp(elem)[0].strip() for xp in xpath])
            except Exception, e:
                # give a hint where things go bad
                sys.stderr.write("Error line {}, {}".format(index, str(e)))
                raise

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM