使用nokogiri剥离样式属性

Question

I'm scrapling an html page with nokogiri and i want to strip out all style attributes. 我正在使用nokogiri删除一个html页面，我想删除所有样式属性。
How can I achieve this? 我怎样才能做到这一点？ (i'm not using rails so i can't use it's sanitize method and i don't want to use sanitize gem 'cause i want to blacklist remove not whitelist) （我不使用rails所以我不能使用它的清理方法，我不想使用sanitize gem'因为我想黑名单删除而不是白名单）

html = open(url)
doc = Nokogiri::HTML(html.read)
doc.css('.post').each do |post|
puts post.to_s
end

=> <p><span style="font-size: x-large">bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

I want it to be 我想要它

=> <p><span>bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

Answer 1

require 'nokogiri'

html = '<p class="post"><span style="font-size: x-large">bla bla</span></p>'
doc = Nokogiri::HTML(html)
doc.xpath('//@style').remove
puts doc.css('.post')
#=> <p class="post"><span>bla bla</span></p>

Edited to show that you can just call NodeSet#remove instead of having to use .each(&:remove) . 编辑显示您可以只调用NodeSet#remove .each(&:remove)而不必使用.each(&:remove) 。

Note that if you have a DocumentFragment instead of a Document, Nokogiri has a longstanding bug where searching from a fragment does not work as you would expect. 请注意，如果你有一个DocumentFragment而不是Document，Nokogiri有一个长期存在的错误，即从片段中搜索不能像你期望的那样工作。 The workaround is to use: 解决方法是使用：

doc.xpath('@style|.//@style').remove

Answer 2

This works with both a document and a document fragment: 这适用于文档和文档片段：

doc = Nokogiri::HTML::DocumentFragment.parse(...)

or 要么

doc = Nokogiri::HTML(...)

To delete all the 'style' attributes, you can do a 要删除所有“样式”属性，您可以执行

doc.css('*').remove_attr('style')

Answer 3

I tried the answer from Phrogz but could not get it to work (I was using a document fragment though but I'd have thought it should work the same?). 我尝试了Phrogz的答案，但无法让它工作（虽然我使用的是文档片段，但我认为它应该工作相同？）。

The "//" at the start didn't seem to be checking all nodes as I would expect. 开头的“//”似乎没有按照我的预期检查所有节点。 In the end I did something a bit more long winded but it worked, so here for the record in case anyone else has the same trouble is my solution (dirty though it is): 最后我做了一些更长时间的啰嗦，但它确实有效，所以这里的记录以防万一其他人有同样的麻烦是我的解决方案（虽然它很脏）：

doc = Nokogiri::HTML::Document.new
body_dom = doc.fragment( my_html )

# strip out any attributes we don't want
body_dom.xpath( './/*[@align]|*[@align]' ).each do |tag|
    tag.attributes["align"].remove
end

使用nokogiri剥离样式属性

问题描述

3 个解决方案

解决方案1
18 已采纳 2011-05-23 22:26:25

解决方案2
8 2014-10-08 01:50:24

解决方案3
3 2012-07-11 10:03:26

使用nokogiri剥离样式属性

问题描述

3 个解决方案

解决方案1 18 已采纳 2011-05-23 22:26:25

解决方案2 8 2014-10-08 01:50:24

解决方案3 3 2012-07-11 10:03:26

解决方案1
18 已采纳 2011-05-23 22:26:25

解决方案2
8 2014-10-08 01:50:24

解决方案3
3 2012-07-11 10:03:26