I am not a programmer and know very little Ruby language. I have a scraping program that fetch product info from a website, I am trying to add a rescue code to handle HTTP 404 error, so it will not end the scraping but continue to the next product.
I need to add the rescue into the code below:
def initialize(id, log = nil, timeout_threshold = nil)
@log_buffer = nil
# prepare internal logging stream
@log = (!log.nil? and log.is_a?(Logger)) ? log : Logger.new(@log_buffer=StringIO.new)
begin
# store instance url address
@id = id.to_s
@url = Link::base_url + 'en-US/item_' + @id + '.htm'
# set remote timeout threshold
@timeout_threshold = (timeout_threshold.to_i > 0) ? timeout_threshold.to_i : 15
@timeout = false
@expired = false
if url_verify
Timeout::timeout(@timeout_threshold) {
Mechanize.html_parser = Nokogiri::HTML
@@agent = Agent.instance
###TODO: [optional?] login
###TODO: [optional?] or login iff pricing not present?
###TODO: Agent.get(user login page)
###TODO: Agent.fill in user/pswd
###TODO: Agent.submit
@html = @@agent.get(url)
@log.info("Alamode Product #{@id.to_s}: Load #{url.to_s}")
@specification = parse_specifications
@quantity, @mapped_quantity = parse_quantities
@price = parse_price
@valid = true
# check parsed page
if @specification.size.zero? and @quantity.size.zero?
@valid = false
@expired = true
@log.warn("Alamode Product #{@id.to_s}: #{url.to_s} unscrappable (product no longer available?)")
else
@log.info("stAlamode Product #{@id.to_s}: #{url.to_s} successfully parsed")
@log.info(" QTY #{@mapped_quantity.to_s}")
end
}
else
# return error message
@valid = false
@log.error("Alamode Product #{@id.to_s}: #{url.to_s} is not a properly formatted URI address")
end
rescue Timeout::Error
@valid = false
@timeout = true
@log.error("Alamode Product #{@id.to_s}: #{url.to_s} did not respond within allocated time")
end
end
Ruby allows you to stack rescue clauses.
begin
...
rescue YourErrorName
...
rescue Timeout::Error
...
end
Within the new clause you can then exit quietly (do nothing - better log the result) or start the scrapping with the next id. I am not familiar with Nokogiri, so you have to figure out the error name by yourself ;) Good luck!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.