简体   繁体   English

Ruby / Nokogiri网站抓取-UTF-8中的无效字节序列(ArgumentError)

[英]Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError)

ruby n00b here. 红宝石n00b在这里。 I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). 我正在尝试从CSV文件中存储的每个URL中抓取一个p标签,并将抓取的内容及其URL输出到新文件(myResults.csv)。 However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? 但是,我不断收到“ UTF-8中无效的字节序列(ArgumentError)”错误,这表明URL无效? (they are all standard ' http://www.exmaple.com/page ' and work in my browser)? (它们都是标准的“ http://www.exmaple.com/page ”,并且可以在我的浏览器中使用)?

Have tried .parse and .encode from similar threads on here, but no luck. 在这里尝试了类似线程的.parse和.encode,但是没有运气。 Thanks for reading. 谢谢阅读。

The code: 编码:

require 'csv'
require 'nokogiri'
require 'open-uri'

CSV_OPTIONS = {
  :write_headers => true,
  :headers => %w[url desc]
}

CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
  csv_doc = File.foreach('listOfURLs.xls') do |url|
    URI.parse(URI.encode(url.chomp))
    begin
    page = Nokogiri.HTML(open(url))
      page.css('.bio media-content').each do |scrape|
      desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
      csv << [url, desc]

    end
  end
end
end

puts "scraping done!"

The error message: 错误信息:

/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
    from bbb.rb:13:in `block (2 levels) in <main>'
    from bbb.rb:11:in `foreach'
    from bbb.rb:11:in `block in <main>'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
    from bbb.rb:10:in `<main>'

Two things: 两件事情:

  1. You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls 您说这些URL存储在CSV文件中,但是您在代码listOfURLs.xls引用了一个Excel文件。

  2. The issue seems to be the encoding of the file listOfURLs.xls , ruby assumes that the file is UTF-8 encoded. 问题似乎是文件listOfURLs.xls的编码,ruby假定文件是UTF-8编码的。 If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error. 如果文件不是UTF-8编码的或包含无效的UTF-8字符,则可能会出现该错误。

    You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters. 您应该仔细检查文件是否以UTF-8编码,并且不包含任何非法字符。

    If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1: 如果必须打开不是UTF-8编码的文件,请对ISO-8859-1尝试以下操作:

     f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row| puts row end 

Some good info about invalid byte sequences in UTF-8 有关UTF-8中无效字节序列的一些良好信息

Update: 更新:

An example: 一个例子:

CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
    csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
        URI.parse(URI.encode(url.chomp))
        begin
        page = Nokogiri.HTML(open(url))
            page.css('.bio media-content').each do |scrape|
            desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
            csv << [url, desc]

        end
    end
end

我在这里参加聚会有点晚了,但这对将来遇到相同问题的任何人都应该起作用:csv_doc = IO.read(file).force_encoding('ISO-8859-1')。encode('utf -8',替换:无)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM