简体   繁体   中英

How can I get the line of the text where an XML tag is found in Python using bs4 or lxml?

I have an XML document and I want to get the line at which the tag extracted by BeautifulSoup or lxml is found. Is there a way to do that?

For BeautifulSoup this attribute is stored in the sourceline attribute of the Tag class, and is being populated in the parsers here and here .

For lxml this is also possible through the sourceline attribute. Here is an example:

#!/usr/bin/python3
from lxml import etree
xml = '''
<a>
  <b>
    <c>
    </c>
  </b>
  <d>
  </d>
</a>
'''
root = etree.fromstring(xml)

for e in root.iter():
    print(e.tag, e.sourceline)

Output:

a 2
b 3
c 4
d 7

If you want to look at the implementation of the sourceline method it's actually calling xmlGetLineNo which is a binding of xmlGetLineNo from libxml2 that is a wrapper for xmlGetLineNoInternal (where the actual logic for this lives inside libxml2).

You can find the line number of the closing tag as well by checking how many line endings there are in the text representation of the subtree of that tag.

xml.etree.ElementTree can also be extended to provide the line numbers where the elements have been found by the parser (the parser being xmlparser from the module xml.parsers.expat ).

Try using the enumerate() function.

For example, if we have the following HTML:

html = """
<!DOCTYPE html>
<html>
<body>
<h1>My Heading</h1>
<p>My paragraph.</p>
</body>
</html>"""

and we want the find the line number for the <h1> tag ( <h1>My Heading</h1> ).

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for (index, value) in enumerate(
    # Remove all the empty lines, so that they shouldn't be part of the line count
    (x for x in str(soup).splitlines() if x != ""),
    start=1,
):
    # Specify the tag you want to find
    # If the tag is found, it will return `1`, else `-1`
    if value.find("h1") == 1:
        print(f"Line: {index}.\t Found: '{value}' ")
        break

Output:

Line: 4.     Found: '<h1>My Heading</h1>' 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM