如何使用Python從html文本中提取信息

Question

我可能有一個包含以下信息的文檔：

<h1>Some Text</h1>
<p>A person name</p>
<p><i>Works somewhere, in some country</i></p>
<p>Grab this text as well</p>

該塊基本上將重復x次。 我需要提取此信息。 但是， <p> tags的數量會有所不同，因此在h1 tag再次出現之前可以是7個單獨的h1 tag 。 我也在使用beautifulsoup來幫助解決這個問題。

我可以提取此數據，但不能制定規則，因此對於每個h1 tag ，請在此之后提取x個標簽，直到再次成為h1 tag為止。

因此，每次出現h1標簽時，這都是一條新記錄。

希望這很有意義，謝謝！

Answer 1

您希望將哪種數據結構存儲在其中？

您可以使用python .split()函數並用"<h1>"分割，這將為您提供如下所示的內容：

text = """<h1>Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>
       <h1>Some More Text</h1>
       <p>Grab this</p>"""

textChunks = text.split("<h1>")

然后textChunks看起來像

["""Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>""",
 """Some More Text</h1>
       <p>Grab this</p>"""]

您可以通過遍歷數組或使用beautifulsoup來不同地對待每個單獨的塊。

如何使用Python從html文本中提取信息

問題描述

1 個解決方案

解決方案1
0 2018-09-26 14:58:33

如何使用Python從html文本中提取信息

問題描述

1 個解決方案

解決方案1 0 2018-09-26 14:58:33

解決方案1
0 2018-09-26 14:58:33