Ruby / Nokogiri网站抓取-UTF-8中的无效字节序列（ArgumentError）

Question

ruby n00b here. 红宝石n00b在这里。 I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). 我正在尝试从CSV文件中存储的每个URL中抓取一个p标签，并将抓取的内容及其URL输出到新文件（myResults.csv）。 However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? 但是，我不断收到“ UTF-8中无效的字节序列（ArgumentError）”错误，这表明URL无效？ (they are all standard ' http://www.exmaple.com/page ' and work in my browser)? （它们都是标准的“ http://www.exmaple.com/page ”，并且可以在我的浏览器中使用）？

Have tried .parse and .encode from similar threads on here, but no luck. 在这里尝试了类似线程的.parse和.encode，但是没有运气。 Thanks for reading. 谢谢阅读。

The code: 编码：

require 'csv'
require 'nokogiri'
require 'open-uri'

CSV_OPTIONS = {
  :write_headers => true,
  :headers => %w[url desc]
}

CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
  csv_doc = File.foreach('listOfURLs.xls') do |url|
    URI.parse(URI.encode(url.chomp))
    begin
    page = Nokogiri.HTML(open(url))
      page.css('.bio media-content').each do |scrape|
      desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
      csv << [url, desc]

    end
  end
end
end

puts "scraping done!"

The error message: 错误信息：

/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
    from bbb.rb:13:in `block (2 levels) in <main>'
    from bbb.rb:11:in `foreach'
    from bbb.rb:11:in `block in <main>'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
    from bbb.rb:10:in `<main>'

Answer 1

Two things: 两件事情：

You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls 您说这些URL存储在CSV文件中，但是您在代码listOfURLs.xls引用了一个Excel文件。
The issue seems to be the encoding of the file listOfURLs.xls , ruby assumes that the file is UTF-8 encoded. 问题似乎是文件listOfURLs.xls的编码，ruby假定文件是UTF-8编码的。 If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error. 如果文件不是UTF-8编码的或包含无效的UTF-8字符，则可能会出现该错误。
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters. 您应该仔细检查文件是否以UTF-8编码，并且不包含任何非法字符。
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1: 如果必须打开不是UTF-8编码的文件，请对ISO-8859-1尝试以下操作：
```
 f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row| puts row end 
```

Some good info about invalid byte sequences in UTF-8 有关UTF-8中无效字节序列的一些良好信息

Update: 更新：

An example: 一个例子：

CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
    csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
        URI.parse(URI.encode(url.chomp))
        begin
        page = Nokogiri.HTML(open(url))
            page.css('.bio media-content').each do |scrape|
            desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
            csv << [url, desc]

        end
    end
end

Answer 2

我在这里参加聚会有点晚了，但这对将来遇到相同问题的任何人都应该起作用：csv_doc = IO.read（file）.force_encoding（'ISO-8859-1'）。encode（'utf -8'，替换：无）

Ruby / Nokogiri网站抓取-UTF-8中的无效字节序列（ArgumentError）

问题描述

2 个解决方案

解决方案1
2 2014-07-20 23:09:50

解决方案2
1 2015-12-11 14:04:52

Ruby / Nokogiri网站抓取-UTF-8中的无效字节序列（ArgumentError）

问题描述

2 个解决方案

解决方案1 2 2014-07-20 23:09:50

解决方案2 1 2015-12-11 14:04:52

解决方案1
2 2014-07-20 23:09:50

解决方案2
1 2015-12-11 14:04:52