如果html表中包含某些單詞，則提取文本

Question

Pyhton初學者在這里。 可能有一個我不知道但在網絡上找不到解決方案的命令。 我的Python設置中有一個字符串格式的html文件。 該文件看起來像

<table>
This is Table 1
</table>

<table>
This is Table 2
</table>

<table>
This is Table 3
</table>

我想提取和之間的文本，但前提是它與表中的某些字符串匹配。 因此，我只想要表2所示的表。

我嘗試在表上拆分文檔，但由於它還包含</table> and <table>之間的部分，因此變得混亂。 我知道命令re.search，但不知道如何將它與if語句結合使用。

re.search(<table>(.*)</table>

Answer 1

使用lxml解析器解決此問題。

from lxml import html

text = '''<table>This is Table 1</table>

<table>This is Table 2</table>

<table>This is Table 3</table>'''

parser = html.fromstring(text)
parser.xpath("//table[contains(text(), 'Table 2')]/text()")

輸出將如下所示

['This is Table 2']

Answer 2

所以一個想法是通過BeautifulSoup獲取html。 然后，您可以像這樣簡單地訪問標簽：

row = soup.find('tr') # Extract and return first occurrence of tr
print(row)            # Print row with HTML formatting
print("=========Text Result==========")
print(row.get_text()) # Print row as text

然后，您可以獲取innerHtml並將其與您的字符串進行比較。 這將以您可以使用BeautifulSoup訪問html為前提。 從https://www.pluralsight.com/guides/web-scraping-with-beautiful-soup得到了這個

如果html表中包含某些單詞，則提取文本

問題描述

2 個解決方案

解決方案1
1 2019-07-17 18:06:12

解決方案2
0 2019-07-17 18:00:02

如果html表中包含某些單詞，則提取文本

問題描述

2 個解決方案

解決方案1 1 2019-07-17 18:06:12

解決方案2 0 2019-07-17 18:00:02

解決方案1
1 2019-07-17 18:06:12

解決方案2
0 2019-07-17 18:00:02