简体   繁体   中英

How to remove a node using Nokogiri

I have an HTML structure like this:

<div>
  This is
  <p> very
    <script>
      some code
    </script>
  </p>
   important.
</div>

I know how to get a Nokogiri::XML::NodeSet from this:

dom.xpath("//div")

I now want to filter out any script tag:

dom.xpath("//script")

So I can get something like:

<div>
  This is
  <p> very</p>
   important.
</div>

So that I can call div.text to get:

"This is very important."

I tried recursively/iteratively going over all children nodes and trying to match every node I want to filter out any node I don't want, but I ran into problems like too much whitespace or not enough whitespace. I'm quite sure there's a nice enough and rubyesque way.

What would be a good way to do this?

1st problem

To remove all the script nodes :

require 'nokogiri'

html = "<div>
  This is
  <p> very
    <script>
      some code
    </script>
  </p>
   important.
</div>"

doc = Nokogiri::HTML(html)

doc.xpath("//script").remove

p doc.text
#=> "\n  This is\n   very\n    \n  \n   important.\n"

Thanks to @theTinMan for his tip (calling remove on one NodeSet instead of each Node).

2nd problem

To remove the unneeded whitespaces, you can use :

  • strip to remove spaces (whitespace, tabs, newlines, ...) at beginning and end of string
  • gsub to replace mutiple spaces by just one whitespace


p doc.text.strip.gsub(/[[:space:]]+/,' ')
#=> "This is very important."

NodeSet contains the remove method which makes it easy to remove whatever matched your selector:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div><p>foo</p><p>bar</p></div>
  </body>
</html>
EOT

doc.search('p').remove
puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     <div></div>
# >>   </body>
# >> </html>

Applied to your sample input:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div>
  This is
  <p> very
    <script>
      some code
    </script>
  </p>
  important.
</div>
EOT

doc.search('script').remove
puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <div>
# >>   This is
# >>   <p> very
# >>     
# >>   </p>
# >>    important.
# >> </div>
# >> </body></html>

At that point the text in the <div> is:

doc.at('div').text # => "\n  This is\n   very\n    \n  \n   important.\n"

Normalizing that is easy:

doc.at('div').text.gsub(/[\n ]+/,' ').strip # => "This is very important."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM