简体   繁体   English

用 Python 解析大型 XML 文件

[英]Parse large XML file in Python

Need some help with this.需要一些帮助。 We have a large XML files from an API that I need to convert into a SQL Server table.我们有一个来自 API 的大型 XML 文件,我需要将其转换为 SQL Server 表。 Using pyodbc to load the table into SQL is no problem but I'm struggling to parse the 16mb XML file.使用 pyodbc 将表加载到 SQL 中没有问题,但我正在努力解析 16mb XML 文件。 The file is well formed - it is wrapped inside a single root.该文件格式良好 - 它包裹在单个根中。 I've been looking for a solution for weeks and haven't been able to trial-and-error my way through it.数周以来,我一直在寻找解决方案,但一直无法反复试验。 Any assistance is appreciated.任何帮助表示赞赏。

In short, the goal is to output all values contained in the XML file in tabular format where each record is listed in its' own row.简而言之,目标是以表格格式输出 XML 文件中包含的所有值,其中每条记录都列在其自己的行中。 The Address and Space sections of the XML require proper handling for the sub elements. XML 的地址和空间部分需要对子元素进行适当处理。 I've got code that works but I don't think this is the "proper" way to parse the values:我有有效的代码,但我认为这不是解析值的“正确”方法:

import xml.etree.ElementTree as ET
tree = ET.parse('Z:\\FileLoc\\DealData.xml')
root = tree.getroot()
    for prop in deals[0]:
    for addr1 in prop[4].iter(tag="Street"):
        for addr2 in prop[4].iter(tag="City"):
            for addr3 in prop[4].iter(tag="State"):
                for addr4 in prop[4].iter(tag="PostalCode"):
                    for addr5 in prop[4].iter(tag="Country"):
                        for spaces in prop[5]:
                                print(deals.attrib.get("DealID"), prop.attrib.get("ID"),
                                   prop[0].text[0:5], prop[0].text[6:],
                                   addr1.text, addr2.text, addr3.text, addr4.text, addr5.text,
                                   spaces.attrib.get("ID"))

Here's one example of the XML I'm working with:这是我正在使用的 XML 的一个示例:

    <?xml version="1.0" encoding="UTF-8"?>
<DealExport>
    <jobid>8558e207709c0730</jobid>
    <Deals>
        <Deal DealID="706880">
            <Properties>
                <Property ID="69320">
                    <Name>41381_BuildingName - Bldg 1</Name>
                    <CreatedAt>2015-10-28T04:09:07Z</CreatedAt>
                    <UpdatedAt>2017-12-20T01:09:51Z</UpdatedAt>
                    <Symbols>
                        <Type>SourceID</Type>
                        <Value>41381~41381</Value>
                    </Symbols>
                    <Address>
                        <City>Calgary</City>
                        <State>AB</State>
                        <Street>1234 Five Street</Street>
                        <PostalCode>H0H 0H0</PostalCode>
                        <Country>Canada</Country>
                    </Address>
                    <Spaces>
                        <Space ID="848294">
                            <Suite>800</Suite>
                            <SpaceAvailable>11225</SpaceAvailable>
                            <FloorName>8</FloorName>
                            <FloorPosition>7</FloorPosition>
                            <FloorRentableArea/>
                            <CreatedAt>2016-04-20T13:47:13Z</CreatedAt>
                            <UpdatedAt>2017-10-02T09:17:20Z</UpdatedAt>
                            <Symbols>
                                <Type>SourceID</Type>
                                <Value>800</Value>
                            </Symbols>
                        </Space>
                    </Spaces>
                </Property>
            </Properties>
            <DealType>new</DealType>
            <LastModified>2017-12-20T22:36:21Z</LastModified>
            <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
            <Stage>dead_deal</Stage>
            <Probability>0</Probability>
            <CompetitiveSet/>
            <MoveInDate>2017-12-01</MoveInDate>
            <Tenant>
                <Name>Current Occupant Ltd.</Name>
                <Industry>financial services</Industry>
                <Affiliation/>
                <TenantID>1013325</TenantID>
                <TenantWebsite/>
                <TenantCurrentLocation/>
                <TenantRSFFrom>6000</TenantRSFFrom>
                <TenantRSFTo>7000</TenantRSFTo>
            </Tenant>
            <Brokers></Brokers>
            <TenantContacts></TenantContacts>
            <DealStages>
                <DealStage ID="1533265">
                    <Stage>proposal</Stage>
                    <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
                    <LastModified>2017-06-06T15:09:22Z</LastModified>
                    <StartDate>2017-06-06T15:09:22Z</StartDate>
                    <DurationInDays>197</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>17699</UserId>
                        <FirstName>Person</FirstName>
                        <LastName>Last-Name</LastName>
                        <Email>employee@email.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174388">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:01Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:01Z</LastModified>
                    <StartDate>2017-12-20T22:36:01Z</StartDate>
                    <DurationInDays>0</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174390">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:21Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:21Z</LastModified>
                    <StartDate>2017-12-20T22:36:21Z</StartDate>
                    <DurationInDays>99</DurationInDays>
                    <DeadDealReasons>
                        <DeadDealReason>price</DeadDealReason>
                    </DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
            </DealStages>
            <DealComments></DealComments>
            <DealTerms>
                <DealTerm ID="4278580">
                    <ProposalType>landlord</ProposalType>
                    <DiscountRate>0.08</DiscountRate>
                    <RentableArea>6200</RentableArea>
                    <LeaseType>triple net</LeaseType>
                    <LeaseTerm>60</LeaseTerm>
                    <CommencementDate>2018-01-01</CommencementDate>
                    <TenantImprovements></TenantImprovements>
                    <BuildingImprovements></BuildingImprovements>
                    <FreeRents></FreeRents>
                    <CreatedDate>2017-06-06T15:16:23Z</CreatedDate>
                    <SecurityDeposit/>
                    <NER>20714.1144670763</NER>
                    <NEROverride/>
                    <DateEntered>2017-06-06</DateEntered>
                    <OpEx>
                        <BaseOpEx/>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </OpEx>
                    <RealEstateTaxes>
                        <BaseRealEstateTaxes>7.45</BaseRealEstateTaxes>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </RealEstateTaxes>
                    <NPVperSqFt>103367.15936951</NPVperSqFt>
                    <TotalNPV>640876388.09096</TotalNPV>
                    <MiscDescription/>
                    <NetCashFlow>642137720.0</NetCashFlow>
                    <TotalIncome>502200.0</TotalIncome>
                    <Concessions>64480.0</Concessions>
                    <Payback>1</Payback>
                    <IRR/>
                    <Latest>true</Latest>
                    <BaseRents>
                        <BaseRent ID="45937180">
                            <BeginIn>1</BeginIn>
                            <Rent>15.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>36</Duration>
                        </BaseRent>
                        <BaseRent ID="45937181">
                            <BeginIn>37</BeginIn>
                            <Rent>18.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>24</Duration>
                        </BaseRent>
                    </BaseRents>
                    <RentEscalations></RentEscalations>
                    <OtherCredits></OtherCredits>
                    <Commissions>
                        <Commission ID="260593">
                            <RepType>landlord</RepType>
                            <Firm>LandLord</Firm>
                            <Amount>2.9</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                        <Commission ID="260594">
                            <RepType>landlord</RepType>
                            <Firm>Payee</Firm>
                            <Amount>7.5</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                    </Commissions>
                    <Rights></Rights>
                    <ProposalID>186607</ProposalID>
                    <ProposalEnteredDate>2017-06-06</ProposalEnteredDate>
                    <RecoveryIncomes>
                        <RecoveryIncome>
                            <RecoveryType>ReimbursableExpenseCam</RecoveryType>
                            <ExpenseAmount>12.87</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>12.87</RecoveryAmount>
                        </RecoveryIncome>
                        <RecoveryIncome>
                            <RecoveryType>RealEstateTax</RecoveryType>
                            <ExpenseAmount>7.45</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>7.45</RecoveryAmount>
                        </RecoveryIncome>
                    </RecoveryIncomes>
                </DealTerm>
            </DealTerms>
            <Budgets></Budgets>
            <Appraisals></Appraisals>
            <Tours></Tours>
        </Deal>

For the best of my knowledge, there is no problem working with large files in XML.据我所知,处理 XML 中的大文件没有问题。 You can use xml.etree.ElementTree for your data.您可以将 xml.etree.ElementTree 用于您的数据。 I tested the following code and it works (parsing and printing):我测试了以下代码并且它有效(解析和打印):

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

def deepParsing(node, pre=None):
    if pre is None:
        pre = '-'
    print(' ')
    print(pre + '> Tag:',node.tag)
    print(pre + '> Attributes:', node.attrib)
    print(pre + '> Text:', node.text)

    pre += '-'
    for key in node:
        deepParsing(key, pre)
deepParsing(root)

Your data was printed on console according to the given tree.您的数据根据​​给定的树打印在控制台上。

Your example is missing closing tags:您的示例缺少结束标记:

    </Deals>
</DealExport>

at the end of the xml.在 xml 的末尾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM