简体   繁体   中英

REXML and encoding

Can anyone please explain this result for me?

#!/usr/bin/env ruby
# encoding: utf-8

require 'rexml/document'

doc = REXML::Document.new(DATA)
puts "doc: #{doc.encoding}"
REXML::XPath.each(doc, '//item') do |item|
  puts "  #{item}: #{item.to_s.encoding}"
end

__END__
<doc>
  <item>Test</item>
  <item>Über</item>
  <item>8</item>
</doc>

Output:

doc: UTF-8
  <item>Test</item>: US-ASCII
  <item>Über</item>: UTF-8
  <item>8</item>: US-ASCII

It seems as if REXML doesn't care what the document encoding is, and starts autodetecting encoding for each item... Am I doomed to encode('UTF-8') each string I pull out of REXML, even though UTF-8 is the original encoding? What is happening here?

You're calling Node.to_s() on your Element . To get the actual text, add Element.get_text() to your chain (and call Text.to_s() on that):

puts "  #{item}: #{item.get_text.to_s.encoding}"

Output:

doc: UTF-8
  <item>Test</item>: UTF-8
  <item>Über</item>: UTF-8
  <item>8</item>: UTF-8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM