简体   繁体   English

如何使用Python从html文本中提取信息

[英]How to extract information from html text with Python

I might have a document with the following information: 我可能有一个包含以下信息的文档:

<h1>Some Text</h1>
<p>A person name</p>
<p><i>Works somewhere, in some country</i></p>
<p>Grab this text as well</p>

This block will basically repeat x amount of times. 该块基本上将重复x次。 I need to extract this information. 我需要提取此信息。 However, the number of <p> tags will vary so could be 7 separate ones before the h1 tag appears again. 但是, <p> tags的数量会有所不同,因此在h1 tag再次出现之前可以是7个单独的h1 tag I am using beautifulsoup as well to help with this. 我也在使用beautifulsoup来帮助解决这个问题。

I can extract this data but cannot make a rule so that for every h1 tag extract the x number of tags after that until it is a h1 tag again. 我可以提取此数据,但不能制定规则,因此对于每个h1 tag ,请在此之后提取x个标签,直到再次成为h1 tag为止。

So every time a h1 tag appears this is a new record. 因此,每次出现h1标签时,这都是一条新记录。

Hope this makes sense thanks! 希望这很有意义,谢谢!

What sort of data structure are you hoping to store this in? 您希望将哪种数据结构存储在其中?

You could use the python .split() function and split by "<h1>" , which would give you something that looks like this: 您可以使用python .split()函数并用"<h1>"分割,这将为您提供如下所示的内容:

text = """<h1>Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>
       <h1>Some More Text</h1>
       <p>Grab this</p>"""

textChunks = text.split("<h1>")

Then textChunks would look something like 然后textChunks看起来像

["""Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>""",
 """Some More Text</h1>
       <p>Grab this</p>"""]

And you can treat each separate chunk differently by looping through the array, or using beautifulsoup. 您可以通过遍历数组或使用beautifulsoup来不同地对待每个单独的块。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM