迭代 XML 标签并在 Python 中获取元素的 xpath

Question

I want to iterate on every "p" tags in a XML document and be able to get the current element's xpath but I don't find anything that does it.我想迭代 XML 文档中的每个“p”标签，并能够获取当前元素的 xpath 但我没有找到任何可以做到的东西。

The kind of code I tried:我尝试过的那种代码：

from bs4 import BeautifulSoup

xml_file = open("./data.xml", "rb")
soup = BeautifulSoup(xml_file, "lxml")

for i in soup.find_all("p"):
    print(i.xpath) # xpath doesn't work here (None)
    print("\n")

Here is a sample XML file that I try to parse:这是我尝试解析的示例 XML 文件：

<?xml version="1.0" encoding="UTF-8"?>

<article>
    <title>Sample document</title>
    <body>
        <p>This is a <b>sample document.</b></p>
        <p>And there is another paragraph.</p>
    </body>
</article>

I would like my code to output:我希望我的代码为 output：

/article/body/p[0]
/article/body/p[1]

Answer 1

You can use getpath() to get xpath from element:您可以使用 getpath() 从元素中获取 xpath：

result = root.xpath('//*[. = "XML"]')
for r in result:
    print(tree.getpath(r))

you can try to use this function:你可以尝试使用这个 function：

doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
    print b.text



def fast_iter(context, func, *args, **kwargs):
    """
    fast_iter is useful if you need to free memory while iterating through a
    very large XML file.

    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def process_element(elt):
    print(elt.text)

context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)

for more reference you can look here - https://newbedev.com/efficient-way-to-iterate-through-xml-elements如需更多参考，您可以查看此处 - https://newbedev.com/efficient-way-to-iterate-through-xml-elements

Answer 2

Here's how to do it with Python's ElementTree class.这是使用 Python 的ElementTree class 的方法。

It uses a simple list to track an iterator's current path through the XML.它使用一个简单的列表来跟踪迭代器通过 XML 的当前路径。 Whenever you want the XPath for an element, call gen_xpath() to turn that list into the XPath for that element, with logic for dealing with same-named siblings (absolute position).每当您想要一个元素的 XPath 时，调用gen_xpath()将该列表转换为该元素的 XPath ，并具有处理同名兄弟姐妹（绝对位置）的逻辑。

from xml.etree import ElementTree as ET

# A list of elements pushed and popped by the iterator's start and end events
path = []


def gen_xpath():
    '''Start at the root of `path` and figure out if the next child is alone, or is one of many siblings named the same.  If the next child is one of many same-named siblings determine its position.

    Returns the full XPath up to the element in the iterator this function was called.
    '''
    full_path = '/' + path[0].tag

    for i, parent_elem in enumerate(path[:-1]):
        next_elem = path[i+1]

        pos = -1         # acts as counter for all children named the same as next_elem
        next_pos = None  # the position we care about

        for child_elem in parent_elem:
            if child_elem.tag == next_elem.tag:
                pos += 1

            # Compare etree.Element identity
            if child_elem == next_elem:
                next_pos = pos

            if next_pos and pos > 0:
                # We know where next_elem is, and that there are many same-named siblings, no need to count others
                break

        # Use next_elem's pos only if there are other same-named siblings
        if pos > 0:
            full_path += f'/{next_elem.tag}[{next_pos}]'
        else:
            full_path += f'/{next_elem.tag}'

    return full_path


# Iterate the XML
for event, elem in ET.iterparse('input.xml', ['start', 'end']):
    if event == 'start':
        path.append(elem)
        if elem.tag == 'p':
            print(gen_xpath())

    if event == 'end':
        path.pop()

When I run that on this modified sample XML, input.xml :当我在这个修改后的示例 XML 上运行它时， input.xml ：

<?xml version="1.0" encoding="UTF-8"?>
<article>
    <title>Sample document</title>
    <body>
        <p>This is a <b>sample document.</b></p>
        <p>And there is another paragraph.</p>
        <section>
            <p>Parafoo</p>
        </section>
    </body>
</article>

I get:我得到：

/article/body/p[0]
/article/body/p[1]
/article/body/section/p

迭代 XML 标签并在 Python 中获取元素的 xpath

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-12-30 14:43:15

解决方案2
0 2021-12-30 21:51:25

迭代 XML 标签并在 Python 中获取元素的 xpath

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-12-30 14:43:15

解决方案2 0 2021-12-30 21:51:25

解决方案1
2 已采纳 2021-12-30 14:43:15

解决方案2
0 2021-12-30 21:51:25