简体   繁体   English

使用lxml.html拆分HTML文档

[英]Splitting a HTML document using lxml.html

I have a HTML document containing multiple chapters of text where the H1 tag is the chapter separator. 我有一个HTML文档,其中包含多个文本章节,其中H1标签是章节分隔符。 How can I split such a document into html snippets where each snippet starts with the h1 tag of the corresponding "chapter". 如何将此类文档拆分为html代码段,其中每个代码段均以相应“章节”的h1标签开头。 I though of prettifying the HTML and then iterating of the content line by line...but that's kind of a hack. 我虽然美化了HTML,然后逐行迭代了内容……但这确实是一种hack。 Is there a better solution using lxml? 使用lxml是否有更好的解决方案?

tree = lxml.html.document_fromstring(htmltext)
for element in tree.iter():
  if element.tag == 'h1':
    for subelement in element:
      // do stuff

That'll find the elements that are h1 tags and then you can iterate through all its subelements. 这将找到属于h1标签的元素,然后可以遍历其所有子元素。 You could also just take all the text inside the element as a string and do stuff with it that way as well. 您也可以只将元素内的所有文本作为字符串,并以此方式进行处理。 Whatever you want to do. 无论您想做什么。 http://lxml.de/ lxml is awesome and I would recommend it. http://lxml.de/ lxml很棒,我推荐它。 I had to update code already using it and just kept the website open for reference whenever I had a question :) 我必须使用它来更新已经使用的代码,并且只要有问题就可以保持网站开放供参考:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM