[英]Parse multiple xml with the same input file with python
我的輸入文件如下所示,單個文件中的記錄多達100k
<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>
<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>
我已經將列表理解和Xpath用作解析值的邏輯
def parsexml():
net=[]
tree = ET.parse('pain1.xml')
root = tree.getroot()
grp1x = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/MsgId')]
grp1y = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
grp1 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/Nm')]
grp2 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
grp3 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/NbOfTxs')]
grp4 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CtrlSum')]
grp5 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/StrtNm')]
grp6 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/BldgNb')]
grp7 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/PstCd')]
grp8 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/TwnNm')]
grp9 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/Ctry')]
grp10 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtInfId')]
grp11 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtMtd')]
grp12 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/BtchBookg')]
grp13 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/ReqdExctnDt')]
grp14 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/Nm')]
grp15 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/StrtNm')]
grp16 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/BldgNb')]
grp17 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/PstCd')]
grp18 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/TwnNm')]
grp19 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/Ctry')]
grp20 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAcct/Id/Othr/Id')]
grp21 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAgt/FinInstnId/BICFI')]
grp22 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/InstrId')]
grp23 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/EndToEndId')]
grp24 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt')]
grp25= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt[@Ccy="JPY"]')]
grp26 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/ChrgBr')]
grp27 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAgt/FinInstnId/BICFI')]
grp28 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/Nm')]
grp29 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[1]')]
grp30 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[2]')]
grp31 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[3]')]
grp32 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[4]')]
grp33 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAcct/Id/Othr/Id')]
grp34 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Purp/Cd')]
grp35 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Tp/CdOrPrtry/Cd')]
grp36 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Nb')]
grp37= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/RltdDt')]
net = ",".join(grp1x+grp1y+grp1 + grp2 + grp3 + grp4 +grp5+grp6+grp7+grp8+grp9+grp10+grp11+grp12+grp13+grp14+grp15+grp16+grp17+grp18+grp19+grp20+grp21+grp22+grp23+grp24+grp25+grp26+grp27+grp28+grp29+grp30+grp31+grp32+grp33+grp34+grp35+grp36+grp37)
return net
我在下面出現錯誤
Traceback (most recent call last):
File "C:\Python27\parsefunc.py", line 10, in <module>
tree = ET.parse('pain1.xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: junk after document element: line 2, column 0
解析后我需要的輸出如下所示
ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08
ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08
有沒有比使用元素樹的列表理解更好的方法,或者如何以上述方式解析並獲取輸出以解析同一文件中的其他xml?
更新資料
我可以使用Parfait建議的新方法在一行中解析和生成文件,但是當我嘗試為多個xml實現以下解決方案時,仍然遇到相同的錯誤
導入sys導入lxml.etree作為ET
net = []
tree = ET.parse('pain001.xml')
root = tree.getroot()
line= tree.xpath('//text()')
line = map(lambda line: line.strip(), line)
net = filter(bool, line)
#str_list = filter(None, str_list)
#net = root.xpath('//*')
net = ",".join(net)
這不是一個好方法。 如果文件太大,則會消耗進程內存。 如果文件始終具有相同的結構,則可以直接逐行處理並進行輸出。 您也可以直接為一行構造輸出,而不是創建列表。
考慮文檔中所有子項的XPath表達式,該表達式返回元素標簽和文本的列表:
net = tree.xpath('//*')
但是,要遍歷每個重復的子根<pain001>
並遷移到行和列的csv格式,請考慮子根出現的每個節點的迭代並提取相應的標記和文本。
import os, sys
import csv
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# ITERATE THROUGH ALL XML FILES
for item in os.listdir(cd):
if item.endswith(".xml"):
tree = ET.parse(os.path.join(cd,item))
subroot = tree.xpath("//CstmrCdtTrfInitn")
with open(os.path.join(cd,'MultipleXPaths.csv'), 'ab') as m:
writer = csv.writer(m)
for i in range(1,len(subroot)+1):
nodes = tree.xpath('//CstmrCdtTrfInitn[{0}]//*'.format(i))
cols = []
rows = []
for elem in nodes:
cols.append(elem.tag)
rows.append(elem.text.replace('\n','').strip())
if i == 1:
print ', '.join(cols)+"\n"
writer.writerow(cols)
print ', '.join(rows)+"\n"
writer.writerow(rows)
控制台打印輸出 (但CSV文件中的列和行)
GrpHdr, MsgId, CreDtTm, NbOfTxs, CtrlSum, InitgPty, Nm, PstlAdr, StrtNm,
BldgNb, PstCd, TwnNm, Ctry, PmtInf, PmtInfId, PmtMtd, BtchBookg,
ReqdExctnDt, Dbtr, Nm, PstlAdr, StrtNm, BldgNb, PstCd, TwnNm, Ctry,
DbtrAcct, Id, Othr, Id, DbtrAgt, FinInstnId, BICFI, CdtTrfTxInf, PmtId,
InstrId, EndToEndId, Amt, InstdAmt, ChrgBr, CdtrAgt, FinInstnId, BICFI,
Cdtr, Nm, PstlAdr, AdrLine, AdrLine, AdrLine, AdrLine, CdtrAcct, Id,
Othr, Id, Purp, Cd, RmtInf, Strd, RfrdDocInf, Tp, CdOrPrtry, Cd, Nb, RltdDt
, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086,
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01,
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , ,
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08
, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086,
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01,
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , ,
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08
ET.parse('pain001.xml')
失敗,因為該文件不是真正的xml文件。 但是它每行確實有一個xml文檔,這很好,因為這意味着您不必將整個文檔加載到內存中就可以對其進行處理。
您可以繼續執行您的操作,但可以將其放在for xmltext in open('somefile'):
循環中的for xmltext in open('somefile'):
但同時也可以減少工作量。 我有點興奮,因為我在使用ElementTree時是在lxml
編寫的,但是您可以切換或修改腳本。 想法是為列表中的每個字段寫出XPath選擇器,然后使用該列表為每一行提取數據。 確定敲打每個。
import lxml.etree
import csv
# compile xpath selectors for element text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]
# open result csv file
with open('pain.csv', 'w') as paincsv:
writer = csv.writer(paincsv)
# read file with 1 'CstmrCdtTrfInitn' record per line
with open('pain.xml') as painxml:
# process each record
for index, line in enumerate(painxml):
if not line.strip(): # allow empty lines
continue
try:
# each line is an xml doc
pain001 = lxml.etree.fromstring(line)
# move to the customer elem
elem = pain001.find('CstmrCdtTrfInitn')
# select each value and write to csv
writer.writerow([xp(elem)[0].strip() for xp in xpath])
except Exception, e:
# give a hint where things go bad
sys.stderr.write("Error line {}, {}".format(index, str(e)))
raise
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.