如何使用Python从html文本中提取信息

Question

I might have a document with the following information: 我可能有一个包含以下信息的文档：

<h1>Some Text</h1>
<p>A person name</p>
<p><i>Works somewhere, in some country</i></p>
<p>Grab this text as well</p>

This block will basically repeat x amount of times. 该块基本上将重复x次。 I need to extract this information. 我需要提取此信息。 However, the number of <p> tags will vary so could be 7 separate ones before the h1 tag appears again. 但是， <p> tags的数量会有所不同，因此在h1 tag再次出现之前可以是7个单独的h1 tag 。 I am using beautifulsoup as well to help with this. 我也在使用beautifulsoup来帮助解决这个问题。

I can extract this data but cannot make a rule so that for every h1 tag extract the x number of tags after that until it is a h1 tag again. 我可以提取此数据，但不能制定规则，因此对于每个h1 tag ，请在此之后提取x个标签，直到再次成为h1 tag为止。

So every time a h1 tag appears this is a new record. 因此，每次出现h1标签时，这都是一条新记录。

Hope this makes sense thanks! 希望这很有意义，谢谢！

Answer 1

What sort of data structure are you hoping to store this in? 您希望将哪种数据结构存储在其中？

You could use the python .split() function and split by "<h1>" , which would give you something that looks like this: 您可以使用python .split()函数并用"<h1>"分割，这将为您提供如下所示的内容：

text = """<h1>Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>
       <h1>Some More Text</h1>
       <p>Grab this</p>"""

textChunks = text.split("<h1>")

Then textChunks would look something like 然后textChunks看起来像

["""Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>""",
 """Some More Text</h1>
       <p>Grab this</p>"""]

And you can treat each separate chunk differently by looping through the array, or using beautifulsoup. 您可以通过遍历数组或使用beautifulsoup来不同地对待每个单独的块。

如何使用Python从html文本中提取信息

问题描述

1 个解决方案

解决方案1
0 2018-09-26 14:58:33

如何使用Python从html文本中提取信息

问题描述

1 个解决方案

解决方案1 0 2018-09-26 14:58:33

解决方案1
0 2018-09-26 14:58:33