简体   繁体   中英

Parse large XML file in Python

Need some help with this. We have a large XML files from an API that I need to convert into a SQL Server table. Using pyodbc to load the table into SQL is no problem but I'm struggling to parse the 16mb XML file. The file is well formed - it is wrapped inside a single root. I've been looking for a solution for weeks and haven't been able to trial-and-error my way through it. Any assistance is appreciated.

In short, the goal is to output all values contained in the XML file in tabular format where each record is listed in its' own row. The Address and Space sections of the XML require proper handling for the sub elements. I've got code that works but I don't think this is the "proper" way to parse the values:

import xml.etree.ElementTree as ET
tree = ET.parse('Z:\\FileLoc\\DealData.xml')
root = tree.getroot()
    for prop in deals[0]:
    for addr1 in prop[4].iter(tag="Street"):
        for addr2 in prop[4].iter(tag="City"):
            for addr3 in prop[4].iter(tag="State"):
                for addr4 in prop[4].iter(tag="PostalCode"):
                    for addr5 in prop[4].iter(tag="Country"):
                        for spaces in prop[5]:
                                print(deals.attrib.get("DealID"), prop.attrib.get("ID"),
                                   prop[0].text[0:5], prop[0].text[6:],
                                   addr1.text, addr2.text, addr3.text, addr4.text, addr5.text,
                                   spaces.attrib.get("ID"))

Here's one example of the XML I'm working with:

    <?xml version="1.0" encoding="UTF-8"?>
<DealExport>
    <jobid>8558e207709c0730</jobid>
    <Deals>
        <Deal DealID="706880">
            <Properties>
                <Property ID="69320">
                    <Name>41381_BuildingName - Bldg 1</Name>
                    <CreatedAt>2015-10-28T04:09:07Z</CreatedAt>
                    <UpdatedAt>2017-12-20T01:09:51Z</UpdatedAt>
                    <Symbols>
                        <Type>SourceID</Type>
                        <Value>41381~41381</Value>
                    </Symbols>
                    <Address>
                        <City>Calgary</City>
                        <State>AB</State>
                        <Street>1234 Five Street</Street>
                        <PostalCode>H0H 0H0</PostalCode>
                        <Country>Canada</Country>
                    </Address>
                    <Spaces>
                        <Space ID="848294">
                            <Suite>800</Suite>
                            <SpaceAvailable>11225</SpaceAvailable>
                            <FloorName>8</FloorName>
                            <FloorPosition>7</FloorPosition>
                            <FloorRentableArea/>
                            <CreatedAt>2016-04-20T13:47:13Z</CreatedAt>
                            <UpdatedAt>2017-10-02T09:17:20Z</UpdatedAt>
                            <Symbols>
                                <Type>SourceID</Type>
                                <Value>800</Value>
                            </Symbols>
                        </Space>
                    </Spaces>
                </Property>
            </Properties>
            <DealType>new</DealType>
            <LastModified>2017-12-20T22:36:21Z</LastModified>
            <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
            <Stage>dead_deal</Stage>
            <Probability>0</Probability>
            <CompetitiveSet/>
            <MoveInDate>2017-12-01</MoveInDate>
            <Tenant>
                <Name>Current Occupant Ltd.</Name>
                <Industry>financial services</Industry>
                <Affiliation/>
                <TenantID>1013325</TenantID>
                <TenantWebsite/>
                <TenantCurrentLocation/>
                <TenantRSFFrom>6000</TenantRSFFrom>
                <TenantRSFTo>7000</TenantRSFTo>
            </Tenant>
            <Brokers></Brokers>
            <TenantContacts></TenantContacts>
            <DealStages>
                <DealStage ID="1533265">
                    <Stage>proposal</Stage>
                    <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
                    <LastModified>2017-06-06T15:09:22Z</LastModified>
                    <StartDate>2017-06-06T15:09:22Z</StartDate>
                    <DurationInDays>197</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>17699</UserId>
                        <FirstName>Person</FirstName>
                        <LastName>Last-Name</LastName>
                        <Email>employee@email.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174388">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:01Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:01Z</LastModified>
                    <StartDate>2017-12-20T22:36:01Z</StartDate>
                    <DurationInDays>0</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174390">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:21Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:21Z</LastModified>
                    <StartDate>2017-12-20T22:36:21Z</StartDate>
                    <DurationInDays>99</DurationInDays>
                    <DeadDealReasons>
                        <DeadDealReason>price</DeadDealReason>
                    </DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
            </DealStages>
            <DealComments></DealComments>
            <DealTerms>
                <DealTerm ID="4278580">
                    <ProposalType>landlord</ProposalType>
                    <DiscountRate>0.08</DiscountRate>
                    <RentableArea>6200</RentableArea>
                    <LeaseType>triple net</LeaseType>
                    <LeaseTerm>60</LeaseTerm>
                    <CommencementDate>2018-01-01</CommencementDate>
                    <TenantImprovements></TenantImprovements>
                    <BuildingImprovements></BuildingImprovements>
                    <FreeRents></FreeRents>
                    <CreatedDate>2017-06-06T15:16:23Z</CreatedDate>
                    <SecurityDeposit/>
                    <NER>20714.1144670763</NER>
                    <NEROverride/>
                    <DateEntered>2017-06-06</DateEntered>
                    <OpEx>
                        <BaseOpEx/>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </OpEx>
                    <RealEstateTaxes>
                        <BaseRealEstateTaxes>7.45</BaseRealEstateTaxes>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </RealEstateTaxes>
                    <NPVperSqFt>103367.15936951</NPVperSqFt>
                    <TotalNPV>640876388.09096</TotalNPV>
                    <MiscDescription/>
                    <NetCashFlow>642137720.0</NetCashFlow>
                    <TotalIncome>502200.0</TotalIncome>
                    <Concessions>64480.0</Concessions>
                    <Payback>1</Payback>
                    <IRR/>
                    <Latest>true</Latest>
                    <BaseRents>
                        <BaseRent ID="45937180">
                            <BeginIn>1</BeginIn>
                            <Rent>15.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>36</Duration>
                        </BaseRent>
                        <BaseRent ID="45937181">
                            <BeginIn>37</BeginIn>
                            <Rent>18.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>24</Duration>
                        </BaseRent>
                    </BaseRents>
                    <RentEscalations></RentEscalations>
                    <OtherCredits></OtherCredits>
                    <Commissions>
                        <Commission ID="260593">
                            <RepType>landlord</RepType>
                            <Firm>LandLord</Firm>
                            <Amount>2.9</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                        <Commission ID="260594">
                            <RepType>landlord</RepType>
                            <Firm>Payee</Firm>
                            <Amount>7.5</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                    </Commissions>
                    <Rights></Rights>
                    <ProposalID>186607</ProposalID>
                    <ProposalEnteredDate>2017-06-06</ProposalEnteredDate>
                    <RecoveryIncomes>
                        <RecoveryIncome>
                            <RecoveryType>ReimbursableExpenseCam</RecoveryType>
                            <ExpenseAmount>12.87</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>12.87</RecoveryAmount>
                        </RecoveryIncome>
                        <RecoveryIncome>
                            <RecoveryType>RealEstateTax</RecoveryType>
                            <ExpenseAmount>7.45</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>7.45</RecoveryAmount>
                        </RecoveryIncome>
                    </RecoveryIncomes>
                </DealTerm>
            </DealTerms>
            <Budgets></Budgets>
            <Appraisals></Appraisals>
            <Tours></Tours>
        </Deal>

For the best of my knowledge, there is no problem working with large files in XML. You can use xml.etree.ElementTree for your data. I tested the following code and it works (parsing and printing):

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

def deepParsing(node, pre=None):
    if pre is None:
        pre = '-'
    print(' ')
    print(pre + '> Tag:',node.tag)
    print(pre + '> Attributes:', node.attrib)
    print(pre + '> Text:', node.text)

    pre += '-'
    for key in node:
        deepParsing(key, pre)
deepParsing(root)

Your data was printed on console according to the given tree.

Your example is missing closing tags:

    </Deals>
</DealExport>

at the end of the xml.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM