简体   繁体   English

将Python与bs4(Lxml)结合使用,在XML标签内编辑文本

[英]Using Python With bs4(Lxml), To edit text inside XML tag

I am new to both python, BS4 and the Lxml parser. 我是python,BS4和Lxml解析器的新手。

I am trying to delete the final three characters from an XML postcode tag to anonymise data. 我正在尝试从XML邮政编码标记中删除最后三个字符以匿名化数据。

The current code runs fine without any errors yet the last three digits are not deleted from the outputted XML file. 当前代码运行正常,没有任何错误,但是最后三位没有从输出的XML文件中删除。

XML MOCK data - XML MOCK数据-

<?xml version="1.0" encoding="UTF-8"?>
<!-- Please note that this file is properly formed, and serves as an example of a file that will load into the ILR DC system.  The data is anonymised and does not refer to a real-world provider, learning delivery or learner.  Based on the ILR specification, version 2, dated April 2018-->
<Message xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="ESFA/ILR/2018-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ESFA/ILR/2018-19">
    <Header>
        <CollectionDetails>
            <Collection>ILR</Collection>
            <Year>1819</Year>
            <FilePreparationDate>2018-01-07</FilePreparationDate>
        </CollectionDetails>
        <Source>
            <ProtectiveMarking>OFFICIAL-SENSITIVE-Personal</ProtectiveMarking>
            <UKPRN>99999999</UKPRN>
            <SoftwareSupplier>SupplierName</SoftwareSupplier>
            <SoftwarePackage>SystemName</SoftwarePackage>
            <Release>1</Release>
            <SerialNo>01</SerialNo>
            <DateTime>2018-06-26T11:14:05</DateTime>
            <!-- This and the next element only appear in files generated by FIS -->
            <ReferenceData>Version5.0, LARS 2017-08-01</ReferenceData>
            <ComponentSetVersion>1</ComponentSetVersion>
        </Source>
    </Header>
    <SourceFiles>
        <!-- The SourceFiles group only appears in files generated by FIS -->
        <SourceFile>
            <SourceFileName>ILR-LLLLLLLL1819-20180626-144401-01.xml</SourceFileName>
            <FilePreparationDate>2018-06-26</FilePreparationDate>
            <SoftwareSupplier>Software Systems Inc.</SoftwareSupplier>
            <SoftwarePackage>GreatStuffMIS</SoftwarePackage>
            <Release>1</Release>
            <SerialNo>01</SerialNo>
            <DateTime>2018-06-26T11:14:05</DateTime>
        </SourceFile>
    </SourceFiles>
    <LearningProvider>
        <UKPRN>99999999</UKPRN>
    </LearningProvider>
    <!-- 16 yr old learner undertaking full time 16-19 (excluding apprenticeships) funded programme -->
    <Learner>
        <LearnRefNumber>16Learner</LearnRefNumber>
        <PMUKPRN>87654321</PMUKPRN>
        <CampId>1234ABCD</CampId>
        <ULN>1061484016</ULN>
        <FamilyName>Smith</FamilyName>
        <GivenNames>Jane</GivenNames>
        <DateOfBirth>1999-02-27</DateOfBirth>
        <Ethnicity>31</Ethnicity>
        <Sex>F</Sex>
        <LLDDHealthProb>2</LLDDHealthProb>
        <Accom>5</Accom>
        <PlanLearnHours>440</PlanLearnHours>
        <PlanEEPHours>100</PlanEEPHours>
        <MathGrade>NONE</MathGrade>
        <EngGrade>D</EngGrade>
        <PostcodePrior>BR1 7SS</PostcodePrior>
        <Postcode>BR1 7SS</Postcode>
        <AddLine1>The Street</AddLine1>
        <AddLine2>ToyTown</AddLine2>
        <LearnerFAM>
            <LearnFAMType>LSR</LearnFAMType>
            <LearnFAMCode>55</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>EDF</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>MCF</LearnFAMType>
            <LearnFAMCode>3</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>FME</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>PPE</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>

Current Code : 当前代码:

#Importing BS4# 
from bs4 import BeautifulSoup

#Opening Origional XML File, Setting soup to BS# 
with open("ILR_mock_data.xml", "r") as infile:
    xml_text = infile.read()

soup = BeautifulSoup(xml_text, 'xml')




#Postcode (Deleting last 3 digits)#
for postcode_tag in soup.find_all("Postcode"):
    postcode_tag.string[:-3]


with open("SEND_ME_TO_RCU.xml", "w") as outfile:
    outfile.write(soup.prettify())

Hopefully where XML has 希望XML在哪里

<Postcode>BR1 7SS</Postcode>

The new postcode will be 新的邮政编码将是

<Postcode>BR1</Postcode>

Fixed the problem using 解决了使用

for pripostcode_tag in soup.find_all("PostcodePrior"):   
    pripostcode_tag.string = pripostcode_tag.string[:-3]

The code below uses a simplified version of the xml ( but should work with the OP's xml as well ). 下面的代码使用xml的简化版本( 但也应与OP的xml一起使用 )。 It does not make any use of an external library. 它不使用任何外部库。

import xml.etree.ElementTree as ET

xml_sample = '''<r><Postcode>ACBDEF</Postcode></r>'''
root = ET.fromstring(xml_sample)
post_codes = root.findall('.//Postcode')
for pc in post_codes:
  pc.text = pc.text[:-3]
ET.dump(root)

output 输出

<r><Postcode>ACB</Postcode></r>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM