简体   繁体   English

使用ruby和nokogiri使用HTML注释作为标记来解析HTML

[英]Using ruby and nokogiri to parsing HTML using HTML comments as markers

How could I use ruby to extract information from a table consisting of these rows? 如何使用ruby从包含这些行的表中提取信息? Is it possible to detect the comments using nokogiri? 是否可以使用nokogiri检测到评论?

<!-- Begin Topic Entry 4134 --> 
    <tr> 
        <td align="center" class="row2"><image src='style_images/ip.boardpr/f_norm.gif' border='0'  alt='New Posts' /></td> 
        <td align="center" width="3%" class="row1">&nbsp;</td> 
        <td class="row2"> 
            <table class='ipbtable' cellspacing="0"> 
                <tr> 

<td valign="middle"><alink href='http://www.xxx.com/index.php?showtopic=4134&amp;view=getnewpost'><image src='style_images/ip.boardpr/newpost.gif' border='0'  alt='Goto last unread' title='Goto last unread' hspace=2></a></td> 

                    <td width="100%"> 
                    <div style='float:right'></div> 
                    <div> <alink href="http://www.xxx.com/index.php?showtopic=4134&amp;hl=">EXTRACT LINK 1</a>  </div> 
                    </td> 
                </tr> 
            </table> 
            <span class="desc">EXTRACT DESCRIPTION</span> 
        </td> 
        <td class="row2" width="15%"><span class="forumdesc"><alink href="http://www.xxx.com/index.php?showforum=19" title="Living">EXTRACT LINK 2</a></span></td> 
        <td align="center" class="row1" width='10%'><alink href='http://www.xxx.com/index.php?showuser=1642'>Mr P</a></td> 
        <td align="center" class="row2"><alink href="javascript:who_posted(4134);">1</a></td> 
        <td align="center" class="row1">46</td> 
        <td class="row1"><span class="desc">Today, 12:04 AM<br /><alink href="http://www.xxx.com/index.php?showtopic=4134&amp;view=getlastpost">Last post by:</a> <b><alink href='http://www.xxx.com/index.php?showuser=1649'>underft</a></b></span></td> 
    </tr> 
<!-- End Topic Entry 4134 -->
-->

Try to use xpath instead: 尝试使用xpath代替:

html_doc = Nokogiri::HTML("<html><body><!-- Begin Topic Entry 4134 --></body></html>") 
html_doc.xpath('//comment()')

You could implement a Nokogiri SAX Parser . 您可以实现Nokogiri SAX Parser This is done faster than it might seem at first sight. 这样做的速度比乍看之下要快。 You get events for Elements, Attributes and Comments. 您将获得有关元素,属性和注释的事件。

Within your parser, your should rememeber the state, like @currently_interested = true to know which parts to rememeber and which not. 在解析器中,您应该记住状态,例如@currently_interested = true,以了解需要记住的部分,而不要记住的部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM