简体   繁体   English

如何使Nokogiri透明地返回未编码的Html实体?

[英]How to make Nokogiri transparently return un/encoded Html entities untouched?

How can I use Nokogiri with having html entities (like German umlauts) untouched? 如何使用Nokogiri来保持html实体(如德语变音符号)?

Ie: 即:

# this is fine
node = Nokogiri::HTML.fragment('<p>&ouml;</p>')
node.to_s # => '<p>&ouml;</p>'

# this is not
node = Nokogiri::HTML.fragment('<p>ö</p>')
node.to_s # => '<p>&ouml;</p>'

# this is what I need
node = Nokogiri::HTML.fragment('<p>ö</p>')
node.to_s # => '<p>ö</p>'

I've tried to mess with both PARSE_OPTIONS and :save_with options but could not come up with a way to have Nokogiri just transparently behave like above. 我试图弄乱PARSE_OPTIONS和:save_with选项,但无法想出让Nokogiri透明地表现得像上面那样的方法。

Any pointers? 有什么指针吗?

Ok, my question has been answered by Aaron via twitter / gist : 好的,我的问题已由Aaron通过twitter / gist回答:

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML::Document.new
doc.encoding = 'UTF-8'

# We added a contextual fragment method for the 1.4.2 release. This *might*
# work in 1.4.1. If you want to mess with 1.4.2, build from my github, or
# grab one of our nightly builds:
#
# $ sudo gem install nokogiri -s http://tenderlovemaking.com/
#
# Also, libxml2 had a bug with encoding when handling UTF-8 fragments, so I
# suggest you also upgrade to libxml2 2.7.7.
#
# Hope that helps!
puts doc.fragment('<p>ö</p>')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM