獲取指向包含某些文本的標簽的xpath

Question

我正在嘗試找到網頁上某些文本的xpath。 如果您要訪問https://www.york.ac.uk/teaching/cws/wws/webpage1.html並嘗試獲取“ EXERCISE”的xpath，它將類似於“ html body html table tbody tr td div H4" 。 如果轉到該頁面，請右鍵單擊“ EXERCISE”並進行檢查，您可以在代碼底部（chrome）中看到該路徑。

我嘗試了許多路徑。 沒有一個能獲得理想的結果。 這是我得到的最接近的：

soup = BS(page, 'html.parser')
tags = [{"name":tag.name,"text":tag.text,"attributes":tag.attributes} for tag in soup.find_all()]
s = ''
for t in tags:
    if "EXERCISE" in t['text']:
        s = s + t['name'] + " "
print(s)

首先，我需要獲取“ html body html table tbody tr td div h4”，但最終在頁面更復雜的情況下，我還需要獲取標簽屬性

謝謝！

Answer 1

如果您知道所需的標簽始終具有確切的文本“ EXERCISE”（沒有引號，或其他不同的.find ，空格等），那么您可以在確切的文本上使用.find 。 盡管也可以使用正則表達式，以防萬一您想檢查空白的變化以及不檢查的內容。

從那里，您可以利用.parents來獲取對象祖先的列表，這意味着包含該祖先的元素，包含該元素的元素等等，直到文檔的頂部。 然后，只需提取標簽名稱，將列表反向，然后將所有內容結合在一起即可。

thetag = soup.find(string="EXERCISE")
parent_tags = [ p.name for p in list(thetag.parents) ]
print('/'.join(parent_tags[::-1]))

輸出：

[文件] / HTML /體/ HMTL /表/ TR / TD / DIV / H4

如果您不希望一開始就使用“ [document] ”，則可以采用多種方式將其刪除，例如，使用以下幾行代替最后兩行：

parent_tags = [ p.name for p in list(thetag.parents)[:-1] ]
print('/' + '/'.join(parent_tags[::-1]))

輸出：

/ HTML /體/ HMTL /表/ TR / TD / DIV / H4

Answer 2

CSS選擇器:contains(EXERCISE):not(:has(:contains(EXERCISE)))將選擇包含字符串“ EXERCISE”的最里面的標記。

然后，我們使用方法find_parents()查找該標記的所有父代並打印其名稱：

import requests
from bs4 import BeautifulSoup

url = 'https://www.york.ac.uk/teaching/cws/wws/webpage1.html'

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

t = soup.select_one(':contains(EXERCISE):not(:has(:contains(EXERCISE)))')
# you can use also this:
# t = soup.find(text="EXERCISE").find_parent()    

#lets print the path
tag_names = [t.name, *[t.name for t in t.find_parents()]]
print(' > '.join(tag_names[::-1]))

打印：

[document] > hmtl > body > table > tr > td > div > p > p > p > p > h4

Answer 3

使用lxml：

url = 'https://www.york.ac.uk/teaching/cws/wws/webpage1.html'

import requests
from lxml import etree
parser = etree.HTMLParser()
page  = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})

root = etree.fromstring(page.content,parser)

tree = etree.ElementTree(root)
e = root.xpath('.//*[text()="EXERCISE"]')
print(tree.getpath(e[0]))

輸出：

/ HTML /體/ HMTL /表/ TR / TD / DIV [2] / H4

獲取指向包含某些文本的標簽的xpath

問題描述

3 個解決方案

解決方案1
0 2019-07-31 19:19:10

解決方案2
0 2019-07-31 19:35:25

解決方案3
0 已采納 2019-07-31 19:36:56

獲取指向包含某些文本的標簽的xpath

問題描述

3 個解決方案

解決方案1 0 2019-07-31 19:19:10

解決方案2 0 2019-07-31 19:35:25

解決方案3 0 已采納 2019-07-31 19:36:56

解決方案1
0 2019-07-31 19:19:10

解決方案2
0 2019-07-31 19:35:25

解決方案3
0 已采納 2019-07-31 19:36:56