當一些子標簽和結構未知時，如何從 XML 中提取數據？

Question

XML 就像：

<Section>
    <ContainerBlockElement>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"</URLLink></Paragraph>
            </ListItem>
        </UnorderedList>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"</URLLink></Paragraph>
            </ListItem>
        </UnorderedList>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"></URLLink></Paragraph>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Follow these rules:</Paragraph>
            <UnorderedList>
                <ListItem>Don't do this</ListItem>
                <ListItem>Don't do that</ListItem>
                <ListItem>Don't do blablabla</ListItem>
            </UnorderedList>
    </ContainerBlockElement>
</Section>

我想以文本形式提取ContainerBlockElement中的所有數據，但每次的子標簽和結構都不同。

預計 output：

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla

更新：現在我在上面 xml 的末尾添加了一個新元素。

<ContainerBlockElement>
    <Paragraph>Apply the newer update in: <URLLink LinkURL="www.newerupdate.com"></URLLink></Paragraph>
</ContainerBlockElement>

@ACHRAF 現在的答案是 output，順序亂七八糟。 它是順序敏感的，不能用於處理不同的 xml 文件。

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Apply the newer update in: www.newerupdate.com
Don't do this
Don't do that
Don't do blablabla

預期的 output 應該遵循 xml 中的順序。此外，程序應該能夠區分那些存在於同一個ContainerBlockElement中。 （例如，我需要遵循這些規則：，不要這樣做，不要那樣做，不要在同一個數組中做 blablabla。）

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla
Apply the newer update in: www.newerupdate.com

Answer 1

首先，您的示例在 URLLINK 中包含錯誤

<URLLink LinkURL="www.software1.com"</URLLink>

將

<URLLink LinkURL="www.software1.com"/>

完整示例：

<Section>
    <ContainerBlockElement>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software1 from: <URLLink LinkURL="www.software1.com"/></Paragraph>
            </ListItem>
        </UnorderedList>
        <UnorderedList>
            <ListItem>
                <Paragraph>Download the software2 from: <URLLink LinkURL="www.software2.com"/></Paragraph>
            </ListItem>
        </UnorderedList>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Apply the update in: <URLLink LinkURL="www.update.com"/></Paragraph>
    </ContainerBlockElement>

    <ContainerBlockElement>
        <Paragraph>Follow these rules:</Paragraph>
            <UnorderedList>
                <ListItem>Don't do this</ListItem>
                <ListItem>Don't do that</ListItem>
                <ListItem>Don't do blablabla</ListItem>
            </UnorderedList>
    </ContainerBlockElement>
</Section>

關於提取數據，您可以這樣做：

from xml.etree import ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
results =  root.findall('ContainerBlockElement/UnorderedList/ListItem') + root.findall('ContainerBlockElement')  + root.findall('ContainerBlockElement/UnorderedList') 
for elem in results:
    for e in elem:
        if (len(e.text.strip()) == 0):
            continue
        URLLINK_Data = e.find('./URLLink')
        if URLLINK_Data is None:
            print(e.text.strip())
        else:
            print(e.text.strip() +" "+ e.find('./URLLink').attrib['LinkURL'])

Output：

Download the software1 from: www.software1.com
Download the software2 from: www.software2.com
Apply the update in: www.update.com
Follow these rules:
Don't do this
Don't do that
Don't do blablabla

Answer 2

除了@ACHRAF 的回答中提到的更正之外，我還建議使用 lxml 而不是 ElementTree，因為 lxml 更好地支持 xpath：

from lxml import etree
doc = etree.parse('file.xml')
for entry in doc.xpath('//Paragraph'):
    link_target = entry.xpath('./URLLink/@LinkURL')
    ul_target = entry.xpath('./following-sibling::UnorderedList//text()')

    link = link_target[0] if link_target else ''
    ul = " ".join(ul_target) if ul_target  else ''

    print(entry.text,link,ul)

Output：

Download the software1 from:  www.software1.com 
Download the software2 from:  www.software2.com 
Apply the update in:  www.update.com 
Follow these rules:  
                 Don't do this 
                 Don't do that 
                 Don't do blablabla

Answer 3

要獲取具有實際文本或 URLLink 的元素，請使用此 XPath

/Section/ContainerBlockElement//*[URLLink or text()[normalize-space()]]

*表示元素節點。

[URLLink or text()[normalize-space()]]是一個謂詞，用於過濾具有直接 URLLink 元素或 text() 的元素，因為子元素不僅具有空白

然后使用 python 提取 text() 和 URLLink

當一些子標簽和結構未知時，如何從 XML 中提取數據？

問題描述

3 個解決方案

解決方案1
2 2022-05-10 09:40:58

解決方案2
1 2022-05-10 11:18:49

解決方案3
0 2022-05-10 16:53:33

當一些子標簽和結構未知時，如何從 XML 中提取數據？

問題描述

3 個解決方案

解決方案1 2 2022-05-10 09:40:58

解決方案2 1 2022-05-10 11:18:49

解決方案3 0 2022-05-10 16:53:33

解決方案1
2 2022-05-10 09:40:58

解決方案2
1 2022-05-10 11:18:49

解決方案3
0 2022-05-10 16:53:33