[英]Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError)
ruby n00b here. 红宝石n00b在这里。 I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). 我正在尝试从CSV文件中存储的每个URL中抓取一个p标签,并将抓取的内容及其URL输出到新文件(myResults.csv)。 However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? 但是,我不断收到“ UTF-8中无效的字节序列(ArgumentError)”错误,这表明URL无效? (they are all standard ' http://www.exmaple.com/page ' and work in my browser)? (它们都是标准的“ http://www.exmaple.com/page ”,并且可以在我的浏览器中使用)?
Have tried .parse and .encode from similar threads on here, but no luck. 在这里尝试了类似线程的.parse和.encode,但是没有运气。 Thanks for reading. 谢谢阅读。
The code: 编码:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
The error message: 错误信息:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'
Two things: 两件事情:
You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls
您说这些URL存储在CSV文件中,但是您在代码listOfURLs.xls
引用了一个Excel文件。
The issue seems to be the encoding of the file listOfURLs.xls
, ruby assumes that the file is UTF-8 encoded. 问题似乎是文件listOfURLs.xls
的编码,ruby假定文件是UTF-8编码的。 If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error. 如果文件不是UTF-8编码的或包含无效的UTF-8字符,则可能会出现该错误。
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters. 您应该仔细检查文件是否以UTF-8编码,并且不包含任何非法字符。
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1: 如果必须打开不是UTF-8编码的文件,请对ISO-8859-1尝试以下操作:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row| puts row end
Some good info about invalid byte sequences in UTF-8 有关UTF-8中无效字节序列的一些良好信息
Update: 更新:
An example: 一个例子:
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
我在这里参加聚会有点晚了,但这对将来遇到相同问题的任何人都应该起作用:csv_doc = IO.read(file).force_encoding('ISO-8859-1')。encode('utf -8',替换:无)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.