使用Nokogiri清理HTML

Question

I'm trying to clean up some CMS entered HTML that has extraneous paragraph tags and br tags everywhere. 我正在尝试清理一些CMS输入的HTML，这些HTML到处都有无关的段落标签和br标签。 The Sanitize gem has proved very useful to do this but I am stuck with a particular issue. 实践证明，消毒Sanitize宝石非常有用，但是我遇到了一个特定问题。

The problem is when there is a br tag directly after/before a paragraph tag eg 问题是在段落标签之后/之前有br标签时，例如

<p>
  <br />
  Some text here
  <br />
  Some more text
  <br />
</p>

I would like to strip out the extraneous first and last br tags, but not the middle one. 我想去除无关的first和last br标签，但不要去除中间的。

I'm very much hoping I can use a sanitize transformer to do this but can't seem to find the right matcher to achieve this. 我非常希望可以使用消毒变压器来完成此操作，但似乎找不到合适的匹配器来实现此目的。

Any help would be much appreciated. 任何帮助将非常感激。

Answer 1

Here's how to locate the particular <br> nodes that are contained by <p> : 这是找到<p>所包含的特定<br>节点的方法：

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>
  <br />
  Some text here
  <br />
  Some more text
  <br />
</p>
EOT

doc.search('p > br').map(&:to_html)
# => ["<br>", "<br>", "<br>"]

Once we know we can find them, it's easy to remove specific ones: 一旦知道可以找到它们，就可以轻松删除特定的：

br_nodes = doc.search('p > br')
br_nodes.first.remove
br_nodes.last.remove
doc.to_html
# => "<p>\n  \n  Some text here\n  <br>\n  Some more text\n  \n</p>\n"

Notice that Nokogiri removed them, but their associated Text nodes that are their immediate siblings, containing their "\\n" are left behind. 请注意，Nokogiri删除了它们，但是与它们关联的Text节点（包含“ \\ n”）是它们的直接同级节点。 A browser will gobble those up and not display the line-ends, but you might be feeling OCD, so here's how to remove those also: 浏览器会将它们吞噬，而不会显示行尾，但是您可能会感到强迫症，因此，下面介绍了如何删除这些行为：

br_nodes = doc.search('p > br')
[br_nodes.first, br_nodes.last].each do |br|
  br.next_sibling.remove
  br.remove
end
doc.to_html
# => "<p>\n  <br>\n  Some more text\n  </p>\n"

Answer 2

initial_linebreak_transformer = lambda {|options|
  node = options[:node]
  if node.present? && node.element? && node.name.downcase == 'p'
    first_child = node.children.first
    if first_child.name.downcase == 'br'
      first_child.unlink
      initial_linebreak_transformer.call options
    end
  end
}

使用Nokogiri清理HTML

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-09-16 18:22:24

解决方案2
0 2014-09-16 20:34:08

使用Nokogiri清理HTML

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-09-16 18:22:24

解决方案2 0 2014-09-16 20:34:08

解决方案1
1 已采纳 2014-09-16 18:22:24

解决方案2
0 2014-09-16 20:34:08