简体   繁体   English

如何提取XML美汤的信息?

[英]How to extract the information from XML beautiful soup?

I have a list of XML beautifulsoap tag elements as:我有一个 XML beautifulsoap 标签元素的列表:

[
 <Entry>
    <EffectiveDate>
    <DateFormattedForTHForm>07/01/2022</DateFormattedForTHForm>
    </EffectiveDate>
    <ExpirationDate>
    <DateFormattedForTHForm>07/01/2023</DateFormattedForTHForm>
    </ExpirationDate>
    <FormDescription>Notification Of Settlement</FormDescription>
    <FormNumber>WC 99 06 04</FormNumber>
 </Entry>,
 
 <Entry>
 <AccountContactRole>
 <AccountContact>
    <Contact>
    <DisplayName>Mallesham Yamulla</DisplayName>
    <FEINOrSSN>123-45-6789</FEINOrSSN>
    <formsMaskedSSN_and_NoMaskFEIN>**-***-8834</formsMaskedSSN_and_NoMaskFEIN>
    <PrimaryAddress>
    <AddressLine1>A</AddressLine1>
    <AddressLine123>B</AddressLine123>
    <CityStateZip>ENID, OK 73703</CityStateZip>
    <Country>IND</Country>
    <AddressLine2 xsi:nil="true"/>
    <AddressLine3 xsi:nil="true"/>
    </PrimaryAddress>
    </Contact>
    </AccountContact>
    </AccountContactRole>
 </Entry>
 
 ]

Here I would like to loop through the list of entry xml elements, get a tag name and its contained information's, if any of tag is empty and its information is also empty it should be ignored.在这里,我想遍历条目 xml 元素的列表,获取标签名称及其包含的信息,如果任何标签为空且其信息也为空,则应将其忽略。

From first entry the below tag information is required to be extracted as they hold on information.从第一个条目开始,需要提取以下标签信息,因为它们保存着信息。

[<DateFormattedForTHForm>07/01/2022</DateFormattedForTHForm>,
<DateFormattedForTHForm>07/01/2023</DateFormattedForTHForm>,
<FormDescription>Notification Of Settlement</FormDescription>,
<FormNumber>WC 99 06 04</FormNumber>]

From second entry:从第二个条目:

    <DisplayName>Mallesham Yamulla</DisplayName>
    <FEINOrSSN>123-45-6789</FEINOrSSN>
    <formsMaskedSSN_and_NoMaskFEIN>**-***-6789</formsMaskedSSN_and_NoMaskFEIN>
    <PrimaryAddress>
    <AddressLine1>A</AddressLine1>
    <AddressLine123>B</AddressLine123>
    <CityStateZip>ENID, OK 73703</CityStateZip>
    <Country>IND</Country>

(With "list of XML beautifulsoup tag elements" in variable xTagList ,) you could try something like this (在变量xTagList中使用“XML beautifulsoup 标签元素列表” ,)你可以尝试这样的事情

bsParser = 'html.parser' # 'xml' # 
# xTagList = [BeautifulSoup(str(x), bsParser) for x in xTagList] # should fix some formatting
wCont_xstrs = ['\n'.join([
    str(d) for d in x.descendants if hasattr(d, 'find_all') 
    and not d.find_all() and d.get_text().strip()
]) for x in xTagList]

to get html/xml string.获取 html/xml字符串。


with bsParser = 'xml' , wCont_xstrs looks like使用bsParser = 'xml'wCont_xstrs看起来像

[
<DateFormattedForTHForm>07/01/2022</DateFormattedForTHForm>
<DateFormattedForTHForm>07/01/2023</DateFormattedForTHForm>
<FormDescription>Notification Of Settlement</FormDescription>
<FormNumber>WC 99 06 04</FormNumber>
,
<DisplayName>Mallesham Yamulla</DisplayName>
<FEINOrSSN>123-45-6789</FEINOrSSN>
<formsMaskedSSN_and_NoMaskFEIN>**-***-8834</formsMaskedSSN_and_NoMaskFEIN>
<AddressLine1>A</AddressLine1>
<AddressLine123>B</AddressLine123>
<CityStateZip>ENID, OK 73703</CityStateZip>
<Country>IND</Country>
]

[btw, if your xml had namespaces (as well formed xmls usually do), they would be lost after using xml parser. [顺便说一句,如果你的 xml 有命名空间(格式良好的 xml 通常有),它们会在使用 xml 解析器后丢失。 Using html parser will preserve namespaces, but there will be another issue as you will see below.]使用 html 解析器将保留名称空间,但会出现另一个问题,您将在下面看到。]


with bsParser = 'html.parser' (and probably any other parser other than xml ), wCont_xstrs looks like使用bsParser = 'html.parser' (可能还有xml以外的任何其他解析器), wCont_xstrs看起来像

[
<dateformattedforthform>07/01/2022</dateformattedforthform>
<dateformattedforthform>07/01/2023</dateformattedforthform>
<formdescription>Notification Of Settlement</formdescription>
<formnumber>WC 99 06 04</formnumber>
,
<displayname>Mallesham Yamulla</displayname>
<feinorssn>123-45-6789</feinorssn>
<formsmaskedssn_and_nomaskfein>**-***-8834</formsmaskedssn_and_nomaskfein>
<addressline1>A</addressline1>
<addressline123>B</addressline123>
<citystatezip>ENID, OK 73703</citystatezip>
<country>IND</country>
]

(notice how capitalization has been lost from tag names) (注意标签名称中的大写字母是如何丢失的)


If you want a list bs4 objects, you can do something like如果你想要一个列表 bs4 对象,你可以做类似的事情

wCont_xtags = [BeautifulSoup(x, bsParser) for x in wCont_xstrs]

UNLESS you're using bsParser = 'xml' , because then you need to wrap them in some tag first like除非你正在使用bsParser = 'xml' ,否则你需要先将它们包装在一些标签中,比如

wCont_xtags = [BeautifulSoup(f'<Entry>{x}</Entry>', bsParser).Entry for x in wCont_xstrs]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM