使用Nokogiri清理HTML

Question

我正在嘗試清理一些CMS輸入的HTML，這些HTML到處都有無關的段落標簽和br標簽。 實踐證明，消毒Sanitize寶石非常有用，但是我遇到了一個特定問題。

問題是在段落標簽之后/之前有br標簽時，例如

<p>
  <br />
  Some text here
  <br />
  Some more text
  <br />
</p>

我想去除無關的first和last br標簽，但不要去除中間的。

我非常希望可以使用消毒變壓器來完成此操作，但似乎找不到合適的匹配器來實現此目的。

任何幫助將非常感激。

Answer 1

這是找到<p>所包含的特定<br>節點的方法：

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>
  <br />
  Some text here
  <br />
  Some more text
  <br />
</p>
EOT

doc.search('p > br').map(&:to_html)
# => ["<br>", "<br>", "<br>"]

一旦知道可以找到它們，就可以輕松刪除特定的：

br_nodes = doc.search('p > br')
br_nodes.first.remove
br_nodes.last.remove
doc.to_html
# => "<p>\n  \n  Some text here\n  <br>\n  Some more text\n  \n</p>\n"

請注意，Nokogiri刪除了它們，但是與它們關聯的Text節點（包含“ \\ n”）是它們的直接同級節點。 瀏覽器會將它們吞噬，而不會顯示行尾，但是您可能會感到強迫症，因此，下面介紹了如何刪除這些行為：

br_nodes = doc.search('p > br')
[br_nodes.first, br_nodes.last].each do |br|
  br.next_sibling.remove
  br.remove
end
doc.to_html
# => "<p>\n  <br>\n  Some more text\n  </p>\n"

Answer 2

initial_linebreak_transformer = lambda {|options|
  node = options[:node]
  if node.present? && node.element? && node.name.downcase == 'p'
    first_child = node.children.first
    if first_child.name.downcase == 'br'
      first_child.unlink
      initial_linebreak_transformer.call options
    end
  end
}

使用Nokogiri清理HTML

問題描述

2 個解決方案

解決方案1
1 已采納 2014-09-16 18:22:24

解決方案2
0 2014-09-16 20:34:08

使用Nokogiri清理HTML

問題描述

2 個解決方案

解決方案1 1 已采納 2014-09-16 18:22:24

解決方案2 0 2014-09-16 20:34:08

解決方案1
1 已采納 2014-09-16 18:22:24

解決方案2
0 2014-09-16 20:34:08