[英]Extract nth tags from HTML after specific tag with beautifulsoup
[英]Extract all text from html after a specific tag?
我想在第二次出現特定標簽后提取 HTML 文件的文本。
我已經嘗試過正則表達式和 bs4,但我不知道出了什么問題。 正則表達式總是只給我命中本身,而沒有 html 文件的 rest 和 bs4 只是不起作用,因為我不知道如何指定文件的結尾。
簡化:
<html>
<veryspecific tag>
abc
</veryspecific tag>
<stuff that comes before>
</stuff that comes before>
<...
<veryspecific tag>
abc
</veryspecific tag>
<other tags that come after>
something
</other tags that come after>
</...>
<other tags that come after2>
something
</other tags that come after2>
</html>
#I tried splitting it, so I can take the last part which should contain the end of the file, starting from the latest occurrence, but it did not work:
htmltxt.split(r'abc.*$')
# I also tried to get the last tag and try to "while" over the 2 to get the text:
last_tag = html_parsed.findall('a')[-1]
while specific_tag != last_tag:
text = ...
specific_tag = specific_tag.next
我找到了所需的標簽並可以提取它,但我還需要文件的 rest。 有沒有一種簡單的pythonic方法來做到這一點?
這是使用BeautifulSoup
的建議:
mark = soup.find('veryspecific').find_next('veryspecific')
all_other_tags = mark.find_all_next(name=True)
print(''.join(i.text for i in all_other_tags))
它給了我這個 output:
something
something
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.