在特定標簽之后從 html 中提取所有文本？

Question

我想在第二次出現特定標簽后提取 HTML 文件的文本。

我已經嘗試過正則表達式和 bs4，但我不知道出了什么問題。 正則表達式總是只給我命中本身，而沒有 html 文件的 rest 和 bs4 只是不起作用，因為我不知道如何指定文件的結尾。

簡化：

<html>
    <veryspecific tag>
       abc
    </veryspecific tag>

    <stuff that comes before>
    </stuff that comes before>
    <...

       <veryspecific tag>
       abc
       </veryspecific tag>

       <other tags that come after>
       something
       </other tags that come after>
    </...>

    <other tags that come after2>
    something
    </other tags that come after2>
</html>

#I tried splitting it, so I can take the last part which should contain the end of the file, starting from the latest occurrence, but it did not work:

htmltxt.split(r'abc.*$')


# I also tried to get the last tag and try to "while" over the 2 to get the text:

last_tag = html_parsed.findall('a')[-1]

while specific_tag != last_tag:
   text = ...
   specific_tag = specific_tag.next

我找到了所需的標簽並可以提取它，但我還需要文件的 rest。 有沒有一種簡單的pythonic方法來做到這一點？

Answer 1

這是使用BeautifulSoup的建議：

mark = soup.find('veryspecific').find_next('veryspecific')
all_other_tags = mark.find_all_next(name=True)

print(''.join(i.text for i in all_other_tags))

它給了我這個 output：

       something

    something

在特定標簽之后從 html 中提取所有文本？

問題描述

1 個解決方案

解決方案1
1 2019-10-23 05:27:36

在特定標簽之后從 html 中提取所有文本？

問題描述

1 個解決方案

解決方案1 1 2019-10-23 05:27:36

解決方案1
1 2019-10-23 05:27:36