简体   繁体   English

使用python使用相同的输入文件解析多个xml

[英]Parse multiple xml with the same input file with python

I have the input file as below going upto 100k Records in a SINGLE file 我的输入文件如下所示,单个文件中的记录多达100k

<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>
<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>

I have used list comprehension with Xpath as my logic to parse the value 我已经将列表理解和Xpath用作解析值的逻辑

def parsexml():
 net=[]
 tree = ET.parse('pain1.xml')
 root = tree.getroot()



 grp1x = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/MsgId')]
 grp1y = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
 grp1 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/Nm')]
 grp2 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
 grp3 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/NbOfTxs')]
 grp4 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CtrlSum')]
 grp5 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/StrtNm')]
 grp6 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/BldgNb')]
 grp7 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/PstCd')]
 grp8 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/TwnNm')]
 grp9 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/Ctry')]
 grp10 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtInfId')]
 grp11 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtMtd')]
 grp12 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/BtchBookg')]
 grp13 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/ReqdExctnDt')]
 grp14 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/Nm')]
 grp15 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/StrtNm')]
 grp16 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/BldgNb')]
 grp17 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/PstCd')]
 grp18 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/TwnNm')]
 grp19 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/Ctry')]
 grp20 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAcct/Id/Othr/Id')]
 grp21 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAgt/FinInstnId/BICFI')]
 grp22 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/InstrId')]
 grp23 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/EndToEndId')]
 grp24 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt')]
 grp25= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt[@Ccy="JPY"]')]
 grp26 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/ChrgBr')]
 grp27 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAgt/FinInstnId/BICFI')]
 grp28 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/Nm')]
 grp29 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[1]')]
 grp30 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[2]')]
 grp31 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[3]')]
 grp32 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[4]')]
 grp33 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAcct/Id/Othr/Id')]
 grp34 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Purp/Cd')]
 grp35 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Tp/CdOrPrtry/Cd')]
 grp36 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Nb')]
 grp37= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/RltdDt')]


  net = ",".join(grp1x+grp1y+grp1 + grp2 + grp3 + grp4 +grp5+grp6+grp7+grp8+grp9+grp10+grp11+grp12+grp13+grp14+grp15+grp16+grp17+grp18+grp19+grp20+grp21+grp22+grp23+grp24+grp25+grp26+grp27+grp28+grp29+grp30+grp31+grp32+grp33+grp34+grp35+grp36+grp37)
 return net 

I am getting error below 我在下面出现错误

Traceback (most recent call last):
  File "C:\Python27\parsefunc.py", line 10, in <module>
    tree = ET.parse('pain1.xml')
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
    parser.feed(data)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: junk after document element: line 2, column 0

The output which I need is after parsing is shown below 解析后我需要的输出如下所示

ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08
ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08

Is there a better approach than List Comprehension with element tree or how can I parse and get the output in the above manner to parse the other xml in the same file 有没有比使用元素树的列表理解更好的方法,或者如何以上述方式解析并获取输出以解析同一文件中的其他xml?

Update 更新资料

I was able to parse and produce in a single line with a new approach suggested by Parfait,but still am getting the same error when I tried to implement the solution below for more than one xml 我可以使用Parfait建议的新方法在一行中解析和生成文件,但是当我尝试为多个xml实现以下解决方案时,仍然遇到相同的错误

import sys import lxml.etree as ET 导入sys导入lxml.etree作为ET

net = []

tree = ET.parse('pain001.xml')
root = tree.getroot()

line= tree.xpath('//text()')

line = map(lambda line: line.strip(), line)
net = filter(bool, line)
#str_list = filter(None, str_list)
#net = root.xpath('//*') 
net = ",".join(net)

This is not a good approach. 这不是一个好方法。 If your file is too big you will blow up your process memory. 如果文件太大,则会消耗进程内存。 If your file has always the same structure, you can directly treat line by line and make the output. 如果文件始终具有相同的结构,则可以直接逐行处理并进行输出。 You can also directly construct your output for a line instead of making a list. 您也可以直接为一行构造输出,而不是创建列表。

Consider XPath expression of all children in document which returns a list of element tags and text: 考虑文档中所有子项的XPath表达式,该表达式返回元素标签和文本的列表:

net = tree.xpath('//*')

However, to iterate through each repeating subroot <pain001> and migrate to a csv format of rows and columns, consider iteration of each node occurrence of subroot and extract corresponding tags and text. 但是,要遍历每个重复的子根<pain001>并迁移到行和列的csv格式,请考虑子根出现的每个节点的迭代并提取相应的标记和文本。

import os, sys
import csv
import lxml.etree as ET

# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))

# ITERATE THROUGH ALL XML FILES
for item in os.listdir(cd):
    if item.endswith(".xml"):
        tree = ET.parse(os.path.join(cd,item))

        subroot = tree.xpath("//CstmrCdtTrfInitn")

        with open(os.path.join(cd,'MultipleXPaths.csv'), 'ab') as m:
            writer = csv.writer(m)    

            for i in range(1,len(subroot)+1):        
                nodes = tree.xpath('//CstmrCdtTrfInitn[{0}]//*'.format(i))

                cols = []
                rows = []
                for elem in nodes:
                    cols.append(elem.tag)
                    rows.append(elem.text.replace('\n','').strip())

                if i == 1:
                    print ', '.join(cols)+"\n"
                    writer.writerow(cols)    

                print ', '.join(rows)+"\n"
                writer.writerow(rows)

CONSOLE PRINT OUTPUT (but cols and rows in csv file) 控制台打印输出 (但CSV文件中的列和行)

GrpHdr, MsgId, CreDtTm, NbOfTxs, CtrlSum, InitgPty, Nm, PstlAdr, StrtNm, 
BldgNb, PstCd, TwnNm, Ctry, PmtInf, PmtInfId, PmtMtd, BtchBookg, 
ReqdExctnDt, Dbtr, Nm, PstlAdr, StrtNm, BldgNb, PstCd, TwnNm, Ctry, 
DbtrAcct, Id, Othr, Id, DbtrAgt, FinInstnId, BICFI, CdtTrfTxInf, PmtId, 
InstrId, EndToEndId, Amt, InstdAmt, ChrgBr, CdtrAgt, FinInstnId, BICFI, 
Cdtr, Nm, PstlAdr, AdrLine, AdrLine, AdrLine, AdrLine, CdtrAcct, Id, 
Othr, Id, Purp, Cd, RmtInf, Strd, RfrdDocInf, Tp, CdOrPrtry, Cd, Nb, RltdDt

, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC 
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086, 
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New 
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01, 
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn 
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , , 
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08

, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC 
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086, 
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New 
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01, 
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn 
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , ,    
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08

ET.parse('pain001.xml') fails because the file isn't really an xml file. ET.parse('pain001.xml')失败,因为该文件不是真正的xml文件。 But it does have an xml document per line, which is good because that means you don't have to load the entire document into memory to process it. 但是它每行确实有一个xml文档,这很好,因为这意味着您不必将整个文档加载到内存中就可以对其进行处理。

You could just continue what you are doing, but put it in a for xmltext in open('somefile'): loop but you can also reduce the total amount of work while you are at it. 您可以继续执行您的操作,但可以将其放在for xmltext in open('somefile'):循环中的for xmltext in open('somefile'):但同时也可以减少工作量。 I'm kinda dope slapping myself because I wrote this in lxml while you use ElementTree but you could either switch over or modify the script. 我有点兴奋,因为我在使用ElementTree时是在lxml编写的,但是您可以切换或修改脚本。 The idea is to write out XPath selectors for each field in a list and then use that list to pull data for each line. 想法是为列表中的每个字段写出XPath选择器,然后使用该列表为每一行提取数据。 Sure beats typing each one out. 确定敲打每个。

import lxml.etree
import csv

# compile xpath selectors for element text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]

# open result csv file
with open('pain.csv', 'w') as paincsv:
    writer = csv.writer(paincsv)
    # read file with 1 'CstmrCdtTrfInitn' record per line
    with open('pain.xml') as painxml:
        # process each record
        for index, line in enumerate(painxml):
            if not line.strip(): # allow empty lines
                continue
            try:
                # each line is an xml doc
                pain001 = lxml.etree.fromstring(line)
                # move to the customer elem
                elem = pain001.find('CstmrCdtTrfInitn')
                # select each value and write to csv
                writer.writerow([xp(elem)[0].strip() for xp in xpath])
            except Exception, e:
                # give a hint where things go bad
                sys.stderr.write("Error line {}, {}".format(index, str(e)))
                raise

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM