简体   繁体   English

添加反斜杠以修复红宝石字符串中的字符编码

[英]adding backslash to fix character encoding in ruby string

I'm sure this is very easy but I'm getting tied in a knot with all these backslashes. 我敢肯定这很容易,但是我正与所有这些反斜杠纠结在一起。

I have some data that I'm scraping (politely) from a website. 我有一些要从网站(礼貌地)抓取的数据。 Occasionally a sentence comes to me looking something like this: 有时候我会看到一个句子,看起来像这样:

u00a362 000? you must be joking

Which should of course be '£2 000? 哪个当然应该是'2000英镑? you must be joking'. 你一定是在开玩笑'。 A short test in irb deciphered it. irb的简短测试将其解密。

ruby-1.9.2-p180 :001 > string = "u00a3"
  => "u00a3" 
ruby-1.9.2-p180 :002 > string = "\u00a3"
  => "£" 

Of course: add a backslash and it will be decoded. 当然:添加一个反斜杠,它将被解码。 I created the following with the help of this question : 我借助此问题创建了以下内容:

puts str.gsub('u00', '\\u00') 

which resulted in being output. 这导致输出 This is all well and good, but I want it to be £ in the string itself. 这一切都很好,但是我希望它在字符串本身中是£。 just puts ing it isn't enough. 仅仅puts这还不够。

It's no good doing gsub('u00a3', '£') as there will doubtless be other characters I'm missing. gsub('u00a3', '£')因为毫无疑问我会缺少其他字符。

thanks for any help. 谢谢你的帮助。

Try the Iconv library for converting the incoming string. 尝试使用Iconv库转换传入的字符串。 You might also take a look at the stringex gem. 您也可以看看stringex gem。 It has methods to "go the other way" but it may provide the mappings you're looking for. 它具有“另辟go径”的方法,但可以提供您要查找的映射。 That said if you've got bad encoding it can be impossible to get it right. 就是说,如果编码不好,就不可能正确。

Warning, the following is not really pretty. 警告,以下内容不是很漂亮。

str = "u00a362 000? you must be joking"
split_unicode = str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/)
final = split_unicode.map do |elem|
  if elem =~ /^u00/
    [("0x" + elem.gsub(/u00/, '')).hex].pack("U*")
  else
    elem
  end
end
puts final.join

So the idea here is to find u00xx values and convert them to hex. 因此,这里的想法是找到u00xx值并将其转换为十六进制。 From there, we can use the pack method to output the right unicode characters. 从那里,我们可以使用pack方法输出正确的unicode字符。

It can also be crunched in an horrible one-liner! 它也可以用可怕的单线处理!

puts (str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/).map {|elem| elem =~ /^u00/ ? [("0x" + elem.gsub(/u00/, '')).hex].pack("U*") : elem}).join

There might be a better solution (I hope!) but this one works. 也许有更好的解决方案(我希望!),但是这个可行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM