简体   繁体   中英

How to find distance between 2 elements in html using beautifulsoup

The goal is to find the distance between 2 tags, eg the first external a href attribute and the title tag, using BeautifulSoup.

html = '<title>stackoverflow</title><a href="https://stackoverflow.com">test</a>'
soup = BeautifulSoup(html)
ext_link = soup.find('a',href=re.compile("^https?:",re.IGNORECASE))
title = soup.title
dist = abs_distance_between_tags(ext_link,title)
print dist
30

How would I do this without using regex?

Note that the order of the tags maybe different, and there maybe more than one match (although we only are taking the first using find() ).

I could not find a method in BeautifulSoup that returns the locations/positions in the html of the matches.

As you noted, it does not seem like you can get the exact character position of an element in BeautifulSoup.

Maybe this answer can help you along:

AFAIK, lxml only offers sourceline, which is insufficient. Cf API : Original line number as found by the parser or None if unknown.

But expat provides the exact offset in the file : CurrentByteIndex.

  • Fetched from start_element handler, it returns tag's start (ie '<') offset.
  • Fetched from char_data handler, it returns data's start (ie 'B' in your example) offset.

Beautiful Soup 4 now supports Tag.sourceline and Tag.sourcepos .

Reference:https://beautiful-soup-4.readthedocs.io/en/latest/#line-numbers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM