How do I omit script elements from HTML using XPath in Nokogiri in Ruby on Rails?

Question

Say I start with everything inside the body element:

Nokogiri::HTML( doc ).xpath( "/html/body/node()" ).to_html

which contains some <script> and <noscript> . How do I get rid of these?

Answer 1

您可能需要将XPath表达式更改为：

Nokogiri::HTML( doc ).xpath( "/html/body/node()[not(self::script or self::noscript)]" ).to_html

Answer 2

#!/usr/bin/env ruby

require 'nokogiri'

html = <<EOT
<html>
  <head>
    <script>
      <!-- dummy script !>
    </script>
  </head>
  <body>
    <script><!-- dummy script !></script>
    <noscript>dummy script</noscript>
  </body>
</head>
EOT

doc = Nokogiri::HTML(html)

Here's the gist of it:

doc.at('body').search('script,noscript').remove

puts doc.to_xml

>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
>> <script>
>>       <!-- dummy script !>
>>     </script>
>> </head>
>> <body>
>>     
>>   </body>
>> </html>

For simplicity, I'm using Nokogiri's ability to use CSS accessors, rather than XPath.

doc.at('body').search('script,noscript').remove

looks for the first occurrence of the <body> tag, then looks inside for all <script> and <noscript> tags, removing them.

The gap between the resulting <body> tags are the result of the carriage returns in text nodes that trailed the actual target tags.

How do I omit script elements from HTML using XPath in Nokogiri in Ruby on Rails?

Question

2 answers

solution1
2 ACCPTED 2011-09-12 19:59:10

solution2
1 2011-09-12 23:06:58

How do I omit script elements from HTML using XPath in Nokogiri in Ruby on Rails?

Question

2 answers

solution1 2 ACCPTED 2011-09-12 19:59:10

solution2 1 2011-09-12 23:06:58

solution1
2 ACCPTED 2011-09-12 19:59:10

solution2
1 2011-09-12 23:06:58