简体   繁体   English

Ruby / Rails CSV解析,UTF-8中的无效字节序列

[英]Ruby/Rails CSV parsing, invalid byte sequence in UTF-8

I am trying to parse a CSV file generated from an Excel spreadsheet. 我正在尝试解析从Excel电子表格生成的CSV文件。

Here is my code 这是我的代码

require 'csv'
file = File.open("input_file")
csv = CSV.parse(file)

But I get this error 但是我得到了这个错误

ArgumentError: invalid byte sequence in UTF-8

I think the error is because Excel encodes the file into ISO 8859-1 (Latin-1) and not in UTF-8 我认为错误是因为Excel将文件编码为ISO 8859-1 (Latin-1)而不是UTF-8

Can someone help me with a workaround for this issue, please 请有人帮我解决这个问题

Thanks in advance. 提前致谢。

You need to tell Ruby that the file is in ISO-8859-1. 您需要告诉Ruby该文件是ISO-8859-1。 Change your file open line to this: 将文件打开行更改为:

file=File.open("input_file", "r:ISO-8859-1")

The second argument tells Ruby to open read only with the encoding ISO-8859-1. 第二个参数告诉Ruby使用编码ISO-8859-1打开只读。

Specify the encoding with encoding option: 使用encoding选项指定encoding

CSV.foreach(file.path, headers: true, encoding:'iso-8859-1:utf-8') do |row|
  ...
end

You can supply source encoding straight in the file mode parameter: 您可以在文件模式参数中直接提供源编码:

CSV.foreach( "file.csv", "r:windows-1250" ) do |row|
   <your code>
end

将文件保存为utf-8,除非由于某种原因需要以不同方式保存,在这种情况下,您可以在读取文件时指定编码集

将第二个参数"r:ISO-8859-1"File.open("input_file","r:ISO-8859-1" )

I had this same problem and was just using google spreadsheets and then downloading as a CSV. 我有同样的问题,只是使用谷歌电子表格,然后下载为CSV。 That was the easiest solution. 这是最简单的解决方案。

Then I came across this gem 然后我遇到了这个宝石

https://github.com/singlebrook/utf8-cleaner https://github.com/singlebrook/utf8-cleaner

Now I don't need to worry about this issue at all. 现在我根本不需要担心这个问题。 Hope this helps! 希望这可以帮助!

If you have only one (or few) file, so when its not needed to automatically declare encoding on whatever file you get from input, and you have the contents of this file visible in plaintext (txt, csv etc) separated with ie semicolon, you can create new file with .csv extension manually, and paste the contents of your file there, then parse the contents like usual. 如果您只有一个(或几个)文件,那么当它不需要自动声明您从输入获得的任何文件的编码时,您可以使用分号分隔的明文(txt,csv等)中显示此文件的内容,您可以手动创建扩展名为.csv新文件,并将文件内容粘贴到那里,然后像往常一样解析内容。

Keep in mind, that this is a workaround, but in need of parsing in linux only one big excel file, converted to some flavour of csv, it spares time on experimenting with all those fancy encodings 请记住,这是一个解决方法,但需要在linux中解析只有一个大的excel文件,转换为某种csv的味道,它花费时间试验所有那些花哨的编码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM