简体   繁体   English

Nokogiri从html提取节点

[英]Nokogiri extract nodes from html

I need to extract nodes from html (not inner text so I can preserve the format for further manual investigation). 我需要从html中提取节点(而不是内部文本,以便保留格式以便进一步进行手动调查)。 I wrote the below code. 我写了下面的代码。 But because how traverse works, I got duplicates in the new html file. 但是由于遍历的工作原理,我在新的html文件中得到了重复项。

This is the real html to parse. 这是要解析的实际html。 http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm

Basically I need to extract Item10 and part between "Executive Officers of the Registrant" to the next Item. 基本上,我需要提取Item10,并将“注册人的执行官”之间的内容提取到下一个项目。 Item 10 is in all documents, but "Executive Officers of the Registrant" is not in all documents. 所有文件中均包含第10项,但并非所有文件中均包含“注册人执行官”。 I need to get the nodes rather than just text because I want to preserve the tables, so in my next step I can parse tables in these sections if there are any. 我需要获取节点而不是文本,因为我想保留表,因此在下一步中,我可以解析这些部分中的表(如果有)。

Sample html: 范例html:

html = "
<BODY>
<P>Dont need this </P>  
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"

I want to get: 我想得到:

html = "
<BODY>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"

Start to extract when the start_keyword appears. 当start_keyword出现时开始提取。 End to extract when the end_keyword appears. 当end_keyword出现时结束提取。

There are multiple sections I need to extract from one html. 我需要从一个HTML中提取多个部分。 The keywords can appear in nodes with different names. 关键字可以出现在名称不同的节点中。

doc.at_css('body').traverse do |node|
    inMySection  = false

    if node.text.match(/#{start_keyword}/)
        inMySection = true
    elsif node.text.match(/#{end_keyword}/)
        inMySection = false
    end
    if inMySection
        #Extract the nodes
    end
end

I also tried to use xpath to achieve this without success after referring to these posts: 在参考了以下文章之后,我还尝试使用xpath来实现此目标而没有成功:

XPath axis, get all following nodes until XPath轴,获取以下所有节点,直到

XPath to find all following siblings up until the next sibling of a particular type XPath查找直到特定类型的下一个同级之前的所有后续同级

It's not a problem with Nokogiri but your algorithm. Nokogiri并不是问题,而是您的算法。 You've put your flag inMySection inside your loop, that means at each step you set it again to false and you lose if it was previously set to true. 您已将标志放在循环内的inMySection ,这意味着在每个步骤中都将其再次设置为false并且如果先前将其设置为true,则会丢失。

Based on your sample HTML input and output, the following snippet works: 根据您的示例HTML输入和输出,以下代码段有效:

nodes = Nokogiri::HTML(html)
inMySection  = false
nodes.at_xpath('//body').traverse do |node|
  if node.text.match(/Start/)
    inMySection = true
  elsif node.text.match(/End/)
    inMySection = false
  end
  node.remove unless inMySection
end
print nodes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM