简体   繁体   中英

Get xpath to tag which contains certain text

I am trying to find the xpath to some text on a webpage. If you were to go to https://www.york.ac.uk/teaching/cws/wws/webpage1.html and try and get the xpath of "EXERCISE" it would look like "html body html table tbody tr td div h4". If you go to that page, right click on "EXERCISE" and inspect it, you can see the path at the bottom of the code (in chrome).

I have tried numerous paths. None of which have got the desired result. This is the closest i got:

soup = BS(page, 'html.parser')
tags = [{"name":tag.name,"text":tag.text,"attributes":tag.attributes} for tag in soup.find_all()]
s = ''
for t in tags:
    if "EXERCISE" in t['text']:
        s = s + t['name'] + " "
print(s)

At the start i need to get "html body html table tbody tr td div h4", but eventually with more complicated pages, i need to get the tag attributes as well

Thanks!

If you know that the tag you want is always going to have the exact text of "EXERCISE" (no quotes, or different later cases, white space, etc.), then you can just use a .find on the exact text. Though you could also use a regular expression instead in case you do want to check for white space variations and what not.

From there, you can utilize .parents to get a list of the objects ancestors, meaning the element that contains it, the element that contains that element, and so on up to the top of the document. Then just extract the tag names, reverse the list, and join everything together.

thetag = soup.find(string="EXERCISE")
parent_tags = [ p.name for p in list(thetag.parents) ]
print('/'.join(parent_tags[::-1]))

Output:

[document]/html/body/hmtl/table/tr/td/div/h4

If you don't want that " [document] " at the start, you could take it out along the way in any number of ways, for example use these lines instead of the last two:

parent_tags = [ p.name for p in list(thetag.parents)[:-1] ]
print('/' + '/'.join(parent_tags[::-1]))

Output:

/html/body/hmtl/table/tr/td/div/h4

The CSS Selector :contains(EXERCISE):not(:has(:contains(EXERCISE))) will select innermost tag that contains string "EXERCISE".

Then we use method find_parents() to find all parents of this tag and print their names:

import requests
from bs4 import BeautifulSoup

url = 'https://www.york.ac.uk/teaching/cws/wws/webpage1.html'

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

t = soup.select_one(':contains(EXERCISE):not(:has(:contains(EXERCISE)))')
# you can use also this:
# t = soup.find(text="EXERCISE").find_parent()    

#lets print the path
tag_names = [t.name, *[t.name for t in t.find_parents()]]
print(' > '.join(tag_names[::-1]))

Prints:

[document] > hmtl > body > table > tr > td > div > p > p > p > p > h4

Using lxml:

url = 'https://www.york.ac.uk/teaching/cws/wws/webpage1.html'

import requests
from lxml import etree
parser = etree.HTMLParser()
page  = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})

root = etree.fromstring(page.content,parser)

tree = etree.ElementTree(root)
e = root.xpath('.//*[text()="EXERCISE"]')
print(tree.getpath(e[0]))

Output:

/html/body/hmtl/table/tr/td/div[2]/h4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM