Say I start with everything inside the body element:
Nokogiri::HTML( doc ).xpath( "/html/body/node()" ).to_html
which contains some <script>
and <noscript>
. How do I get rid of these?
您可能需要将XPath表达式更改为:
Nokogiri::HTML( doc ).xpath( "/html/body/node()[not(self::script or self::noscript)]" ).to_html
#!/usr/bin/env ruby
require 'nokogiri'
html = <<EOT
<html>
<head>
<script>
<!-- dummy script !>
</script>
</head>
<body>
<script><!-- dummy script !></script>
<noscript>dummy script</noscript>
</body>
</head>
EOT
doc = Nokogiri::HTML(html)
Here's the gist of it:
doc.at('body').search('script,noscript').remove
puts doc.to_xml
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
>> <script>
>> <!-- dummy script !>
>> </script>
>> </head>
>> <body>
>>
>> </body>
>> </html>
For simplicity, I'm using Nokogiri's ability to use CSS accessors, rather than XPath.
doc.at('body').search('script,noscript').remove
looks for the first occurrence of the <body>
tag, then looks inside for all <script>
and <noscript>
tags, removing them.
The gap between the resulting <body>
tags are the result of the carriage returns in text nodes that trailed the actual target tags.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.