有没有办法清除 Ruby 中“UTF-8 中的无效字节序列”错误的文件？

Question

I have a service that uploads data to our database via XML feeds provided by customers.我有一项服务，可通过客户提供的 XML 提要将数据上传到我们的数据库。 Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences.通常这些 XML 文件被声称是 UTF-8 编码的，但它们显然有很多无效的字节序列。 I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:我可以清理这些文件并将它们完美地导入到我们的数据库中，只需在导入之前运行以下 Linux 命令：

tr -cd '^[:print:]' < original.xml > clean.xml

Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.只需运行这个 Linux 命令，我就可以使用 Ruby on Rails 中的 Nokogiri 将所有数据导入到我的数据库中。

The problem is that we're deploying on Heroku , and I can't preprocess the file with a Linux command.问题是我们在Heroku 上部署，我无法使用 Linux 命令预处理文件。 I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work.上周我一直在 Internet 上搜索基于 Ruby on Rails 的本机解决方案来解决这个问题，但没有一个有效。 Before I run through all the suggestions I've tried, here is my original code:在我完成我尝试过的所有建议之前，这是我的原始代码：

data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
  hash = node.element_children.each_with_object(Hash.new) do |e, h|
   h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
   data.push(newrow)
 end
end

Running this on the raw file produces an error: "Invalid byte sequence in UTF-8"在原始文件上运行它会产生错误：“UTF-8 中的字节序列无效”

Here are all the helpful suggestions I've tried but all have failed.以下是我尝试过但都失败的所有有用建议。

Use Coder使用编码器
```
Coder.clean!(data_string, "UTF-8")
```

Force Encoding强制编码

data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

Convert to UTF-16 and back to UTF-8转换为 UTF-16 并返回到 UTF-8

 data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '') data_string.encode!('UTF-8', 'UTF-16')

Use valid_encoding?使用valid_encoding？
```
 data_string.chars.select{|i| i.valid_encoding?}.join
```
No characters are removed;没有删除任何字符； generates "invalid byte sequence" errors.生成“无效字节序列”错误。
Specify encoding on opening the file在打开文件时指定编码

I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of every possible file encoding):我实际上编写了一个函数，它尝试所有可能的编码，直到它可以无错误地打开文件并转换为 UTF-8（@file_encodings 是每个可能的文件编码的数组）：

@file_encodings.each do |enc|
  print "#{enc}..."
  conv_str = "r:#{enc}:utf-8"
  begin
    data_file = File.open(fname, conv_str)
    data_string = data_file.read
  rescue
    data_file = nil
    data_string = ""
  end
  data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")

  unless data_string.blank? print "\n#{enc} detected!\n"
  return data_string
end

Use Regexp to remove non-printables:使用 Regexp 删除不可打印的内容：
data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[:cntrl:]&&[^\\n\\r]]/,"") data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[[:cntrl:]&&[^\\n\\r]]/,"")

(I also tried variants including /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\\|;:'",<.>/\\?]/) （我也尝试了包括 /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\\|;:'",<.>/\\?] /)

For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.对于上述所有内容，结果都是相同的......要么发生“无效字节序列”错误，要么在仅读取 4400 行后文件中途被切断。

So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.那么，为什么 Linux "tr" 命令可以完美运行，而这些建议中没有一个可以在 Ruby on Rails 中完成这项工作。

What I ended up doing is extremely inelegant, but it gets the job done.我最终做的非常不优雅，但它完成了工作。 I inspected each row that stopped Nokogiri (row.last) and looked for strange characters.我检查了停止 Nokogiri (row.last) 的每一行并寻找奇怪的字符。 Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):我发现的每一个我都添加到了一个字符类中，然后用 gsub! 把它删除了，就像这样（控制字符不会在这里打印，但你明白了）：

data_string.gsub!(/[Crazy Control Characters]/,"")

But the purist in me insists there should be a more elegant, general solution.但我的纯粹主义者坚持认为应该有一个更优雅、更通用的解决方案。

Answer 1

Ruby 2.1 has a new method called String.scrub which is exactly what you need. Ruby 2.1 有一个名为String.scrub的新方法，这正是您所需要的。

If the string is invalid byte sequence then replace invalid bytes with given replacement character, else returns self.如果字符串是无效字节序列，则用给定的替换字符替换无效字节，否则返回 self。 If block is given, replace invalid bytes with returned value of the block.如果给出了块，则用块的返回值替换无效字节。

Check the documentation for more information.查看文档以获取更多信息。

Answer 2

I found this on Stack Overflow for some other question and this too worked fine for me.我在 Stack Overflow 上为其他一些问题找到了这个，这对我来说也很好用。 Assuming data_string is your XML:假设 data_string 是您的 XML：

data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Answer 3

Try using a combination of force_encoding("ISO-8859-1") and encode("utf-8"):尝试使用 force_encoding("ISO-8859-1") 和 encode("utf-8") 的组合：

data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)

This helped me once.这帮助了我一次。

Answer 4

Thanks for the responses.感谢您的回复。 I did find something that works by testing all sorts of combinations of different tools.通过测试各种不同工具的组合，我确实找到了一些可行的方法。 I hope this is helpful to other people who have shared the same frustration.我希望这对其他有同样挫败感的人有所帮助。

data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" )
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

As you can see, it's a combination of the "encode" method and a regexp to remove control characters (except for newlines).如您所见，它是“编码”方法和正则表达式的组合，用于删除控制字符（换行符除外）。

My testing revealed that the file I was importing had TWO problems: (1) invalid UTF-8 byte sequences;我的测试表明我正在导入的文件有两个问题：（1）无效的 UTF-8 字节序列； and (2) unprintable control characters that forced Nokogiri to stop parsing before the end of the file. (2) 不可打印的控制字符，迫使 Nokogiri 在文件结束之前停止解析。 I had to fix both problems, in that order, otherwise gsub!我必须按照这个顺序解决这两个问题，否则 gsub！ throws the "invalid byte sequence" error.抛出“无效字节序列”错误。

Note that the first line in the code above could be substituted with EITHER of the following with the same successful result:请注意，上面代码中的第一行可以用以下任一替换，并获得相同的成功结果：

Coder.clean!(data_string,'UTF-8')

or或者

data_string.scrub!("")

This worked perfectly for me.这对我来说非常有效。

有没有办法清除 Ruby 中“UTF-8 中的无效字节序列”错误的文件？

问题描述

4 个解决方案

解决方案1
8 2014-08-08 08:39:25

解决方案2
1 2014-11-27 04:33:33

解决方案3
0 2014-08-08 10:23:57

解决方案4
0 2014-08-09 17:28:42

有没有办法清除 Ruby 中“UTF-8 中的无效字节序列”错误的文件？

问题描述

4 个解决方案

解决方案1 8 2014-08-08 08:39:25

解决方案2 1 2014-11-27 04:33:33

解决方案3 0 2014-08-08 10:23:57

解决方案4 0 2014-08-09 17:28:42

解决方案1
8 2014-08-08 08:39:25

解决方案2
1 2014-11-27 04:33:33

解决方案3
0 2014-08-08 10:23:57

解决方案4
0 2014-08-09 17:28:42