Parse multiple xml with the same input file with python

Question

I have the input file as below going upto 100k Records in a SINGLE file

<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>
<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>

I have used list comprehension with Xpath as my logic to parse the value

def parsexml():
 net=[]
 tree = ET.parse('pain1.xml')
 root = tree.getroot()



 grp1x = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/MsgId')]
 grp1y = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
 grp1 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/Nm')]
 grp2 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
 grp3 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/NbOfTxs')]
 grp4 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CtrlSum')]
 grp5 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/StrtNm')]
 grp6 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/BldgNb')]
 grp7 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/PstCd')]
 grp8 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/TwnNm')]
 grp9 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/Ctry')]
 grp10 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtInfId')]
 grp11 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtMtd')]
 grp12 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/BtchBookg')]
 grp13 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/ReqdExctnDt')]
 grp14 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/Nm')]
 grp15 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/StrtNm')]
 grp16 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/BldgNb')]
 grp17 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/PstCd')]
 grp18 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/TwnNm')]
 grp19 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/Ctry')]
 grp20 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAcct/Id/Othr/Id')]
 grp21 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAgt/FinInstnId/BICFI')]
 grp22 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/InstrId')]
 grp23 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/EndToEndId')]
 grp24 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt')]
 grp25= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt[@Ccy="JPY"]')]
 grp26 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/ChrgBr')]
 grp27 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAgt/FinInstnId/BICFI')]
 grp28 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/Nm')]
 grp29 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[1]')]
 grp30 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[2]')]
 grp31 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[3]')]
 grp32 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[4]')]
 grp33 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAcct/Id/Othr/Id')]
 grp34 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Purp/Cd')]
 grp35 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Tp/CdOrPrtry/Cd')]
 grp36 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Nb')]
 grp37= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/RltdDt')]


  net = ",".join(grp1x+grp1y+grp1 + grp2 + grp3 + grp4 +grp5+grp6+grp7+grp8+grp9+grp10+grp11+grp12+grp13+grp14+grp15+grp16+grp17+grp18+grp19+grp20+grp21+grp22+grp23+grp24+grp25+grp26+grp27+grp28+grp29+grp30+grp31+grp32+grp33+grp34+grp35+grp36+grp37)
 return net

I am getting error below

Traceback (most recent call last):
  File "C:\Python27\parsefunc.py", line 10, in <module>
    tree = ET.parse('pain1.xml')
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
    parser.feed(data)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: junk after document element: line 2, column 0

The output which I need is after parsing is shown below

ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08
ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08

Is there a better approach than List Comprehension with element tree or how can I parse and get the output in the above manner to parse the other xml in the same file

Update

I was able to parse and produce in a single line with a new approach suggested by Parfait,but still am getting the same error when I tried to implement the solution below for more than one xml

import sys import lxml.etree as ET

net = []

tree = ET.parse('pain001.xml')
root = tree.getroot()

line= tree.xpath('//text()')

line = map(lambda line: line.strip(), line)
net = filter(bool, line)
#str_list = filter(None, str_list)
#net = root.xpath('//*') 
net = ",".join(net)

Answer 1

This is not a good approach. If your file is too big you will blow up your process memory. If your file has always the same structure, you can directly treat line by line and make the output. You can also directly construct your output for a line instead of making a list.

Answer 2

Consider XPath expression of all children in document which returns a list of element tags and text:

net = tree.xpath('//*')

However, to iterate through each repeating subroot <pain001> and migrate to a csv format of rows and columns, consider iteration of each node occurrence of subroot and extract corresponding tags and text.

import os, sys
import csv
import lxml.etree as ET

# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))

# ITERATE THROUGH ALL XML FILES
for item in os.listdir(cd):
    if item.endswith(".xml"):
        tree = ET.parse(os.path.join(cd,item))

        subroot = tree.xpath("//CstmrCdtTrfInitn")

        with open(os.path.join(cd,'MultipleXPaths.csv'), 'ab') as m:
            writer = csv.writer(m)    

            for i in range(1,len(subroot)+1):        
                nodes = tree.xpath('//CstmrCdtTrfInitn[{0}]//*'.format(i))

                cols = []
                rows = []
                for elem in nodes:
                    cols.append(elem.tag)
                    rows.append(elem.text.replace('\n','').strip())

                if i == 1:
                    print ', '.join(cols)+"\n"
                    writer.writerow(cols)    

                print ', '.join(rows)+"\n"
                writer.writerow(rows)

CONSOLE PRINT OUTPUT (but cols and rows in csv file)

GrpHdr, MsgId, CreDtTm, NbOfTxs, CtrlSum, InitgPty, Nm, PstlAdr, StrtNm, 
BldgNb, PstCd, TwnNm, Ctry, PmtInf, PmtInfId, PmtMtd, BtchBookg, 
ReqdExctnDt, Dbtr, Nm, PstlAdr, StrtNm, BldgNb, PstCd, TwnNm, Ctry, 
DbtrAcct, Id, Othr, Id, DbtrAgt, FinInstnId, BICFI, CdtTrfTxInf, PmtId, 
InstrId, EndToEndId, Amt, InstdAmt, ChrgBr, CdtrAgt, FinInstnId, BICFI, 
Cdtr, Nm, PstlAdr, AdrLine, AdrLine, AdrLine, AdrLine, CdtrAcct, Id, 
Othr, Id, Purp, Cd, RmtInf, Strd, RfrdDocInf, Tp, CdOrPrtry, Cd, Nb, RltdDt

, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC 
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086, 
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New 
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01, 
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn 
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , , 
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08

, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC 
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086, 
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New 
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01, 
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn 
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , ,    
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08

Answer 3

ET.parse('pain001.xml') fails because the file isn't really an xml file. But it does have an xml document per line, which is good because that means you don't have to load the entire document into memory to process it.

You could just continue what you are doing, but put it in a for xmltext in open('somefile'): loop but you can also reduce the total amount of work while you are at it. I'm kinda dope slapping myself because I wrote this in lxml while you use ElementTree but you could either switch over or modify the script. The idea is to write out XPath selectors for each field in a list and then use that list to pull data for each line. Sure beats typing each one out.

import lxml.etree
import csv

# compile xpath selectors for element text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]

# open result csv file
with open('pain.csv', 'w') as paincsv:
    writer = csv.writer(paincsv)
    # read file with 1 'CstmrCdtTrfInitn' record per line
    with open('pain.xml') as painxml:
        # process each record
        for index, line in enumerate(painxml):
            if not line.strip(): # allow empty lines
                continue
            try:
                # each line is an xml doc
                pain001 = lxml.etree.fromstring(line)
                # move to the customer elem
                elem = pain001.find('CstmrCdtTrfInitn')
                # select each value and write to csv
                writer.writerow([xp(elem)[0].strip() for xp in xpath])
            except Exception, e:
                # give a hint where things go bad
                sys.stderr.write("Error line {}, {}".format(index, str(e)))
                raise

Parse multiple xml with the same input file with python

Question

3 answers

solution1
0 2015-11-26 15:50:47

solution2
0 2015-11-26 17:34:11

solution3
0 ACCPTED 2015-11-28 19:05:55

Parse multiple xml with the same input file with python

Question

3 answers

solution1 0 2015-11-26 15:50:47

solution2 0 2015-11-26 17:34:11

solution3 0 ACCPTED 2015-11-28 19:05:55

solution1
0 2015-11-26 15:50:47

solution2
0 2015-11-26 17:34:11

solution3
0 ACCPTED 2015-11-28 19:05:55