Is there a way to clean a file of “invalid byte sequence in UTF-8” errors in Ruby?

Question

I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:

tr -cd '^[:print:]' < original.xml > clean.xml

Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.

The problem is that we're deploying on Heroku , and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:

data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
  hash = node.element_children.each_with_object(Hash.new) do |e, h|
   h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
   data.push(newrow)
 end
end

Running this on the raw file produces an error: "Invalid byte sequence in UTF-8"

Here are all the helpful suggestions I've tried but all have failed.

Use Coder
```
Coder.clean!(data_string, "UTF-8")
```

Force Encoding

data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

Convert to UTF-16 and back to UTF-8

 data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '') data_string.encode!('UTF-8', 'UTF-16')

Use valid_encoding?
```
 data_string.chars.select{|i| i.valid_encoding?}.join
```
No characters are removed; generates "invalid byte sequence" errors.
Specify encoding on opening the file

I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of every possible file encoding):

@file_encodings.each do |enc|
  print "#{enc}..."
  conv_str = "r:#{enc}:utf-8"
  begin
    data_file = File.open(fname, conv_str)
    data_string = data_file.read
  rescue
    data_file = nil
    data_string = ""
  end
  data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")

  unless data_string.blank? print "\n#{enc} detected!\n"
  return data_string
end

Use Regexp to remove non-printables:
data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[:cntrl:]&&[^\\n\\r]]/,"")

(I also tried variants including /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\\|;:'",<.>/\\?]/)

For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.

So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.

What I ended up doing is extremely inelegant, but it gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):

data_string.gsub!(/[Crazy Control Characters]/,"")

But the purist in me insists there should be a more elegant, general solution.

Answer 1

Ruby 2.1 has a new method called String.scrub which is exactly what you need.

If the string is invalid byte sequence then replace invalid bytes with given replacement character, else returns self. If block is given, replace invalid bytes with returned value of the block.

Check the documentation for more information.

Answer 2

I found this on Stack Overflow for some other question and this too worked fine for me. Assuming data_string is your XML:

data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Answer 3

Try using a combination of force_encoding("ISO-8859-1") and encode("utf-8"):

data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)

This helped me once.

Answer 4

Thanks for the responses. I did find something that works by testing all sorts of combinations of different tools. I hope this is helpful to other people who have shared the same frustration.

data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" )
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

As you can see, it's a combination of the "encode" method and a regexp to remove control characters (except for newlines).

My testing revealed that the file I was importing had TWO problems: (1) invalid UTF-8 byte sequences; and (2) unprintable control characters that forced Nokogiri to stop parsing before the end of the file. I had to fix both problems, in that order, otherwise gsub! throws the "invalid byte sequence" error.

Note that the first line in the code above could be substituted with EITHER of the following with the same successful result:

Coder.clean!(data_string,'UTF-8')

or

data_string.scrub!("")

This worked perfectly for me.

Is there a way to clean a file of “invalid byte sequence in UTF-8” errors in Ruby?

Question

4 answers

solution1
8 2014-08-08 08:39:25

solution2
1 2014-11-27 04:33:33

solution3
0 2014-08-08 10:23:57

solution4
0 2014-08-09 17:28:42

Is there a way to clean a file of “invalid byte sequence in UTF-8” errors in Ruby?

Question

4 answers

solution1 8 2014-08-08 08:39:25

solution2 1 2014-11-27 04:33:33

solution3 0 2014-08-08 10:23:57

solution4 0 2014-08-09 17:28:42

solution1
8 2014-08-08 08:39:25

solution2
1 2014-11-27 04:33:33

solution3
0 2014-08-08 10:23:57

solution4
0 2014-08-09 17:28:42