简体   繁体   English

如果屏幕刮擦在导轨上的红宝石中失败,如何正常失败并获得通知

[英]How to fail gracefully and get notified if screen scraping fails in ruby on rails

I am working on a Rails 3 project that relies heavily on screen scraping to collect data mainly using Nokogiri . 我正在一个Rails 3项目中工作,该项目主要依靠屏幕抓取来收集数据,主要使用Nokogiri进行 I'm aggregating essentially all the same data but I'm grabbing it from many difference sources and as time goes on I will be adding more and more. 我正在汇总基本上所有相同的数据,但是我从许多不同的来源中获取数据,随着时间的流逝,我将越来越多地添加它们。 However I am acutely aware that screen scraping can be notoriously unreliable. 但是,我非常清楚地知道,屏幕抓取可能非常不可靠。

As such I am interested in how other people have handled the problem of verifying the data and then also getting notified if it is failing. 因此,我对其他人如何处理验证数据并在出现故障时得到通知的问题感兴趣。

My current plan is as follow. 我目前的计划如下。

  1. I am going to have validation on my model for most of the fields. 我将对大多数领域的模型进行验证。 If they fail I won't get bad data into my system. 如果他们失败了,我将不会将不良数据输入我的系统。 Although logging this failure in a meaningful way is still a problem. 尽管以有意义的方式记录此故障仍然是一个问题。

  2. I was thinking of some kind of counter where after so many failures from a particular source I somehow turn it off. 我在考虑某种计数器,在某个特定来源发生了多次故障之后,我以某种方式将其关闭。 Not sure how to keep track of that. 不确定如何跟踪。 I guess the only way is to have a field on my Source model that counts it and can be reset. 我猜唯一的方法是在我的Source模型上有一个字段来对其进行计数并可以重置。

  3. Logging is 800 pound gorilla I'm not sure how to deal with. 伐木是800磅的大猩猩,我不确定该如何处理。 I could just do standard writing to logs but if something fails I'd like to store the entire html so I can figure it out. 我可以对日志进行标准写入,但是如果出现故障,我想存储整个html,以便我找出来。 Also I need to notify myself somehow so I can address the issues. 另外,我需要以某种方式通知自己,以便解决这些问题。 I thought of maybe just creating a model for all this and storing it in the database. 我想到可能只是为此创建一个模型并将其存储在数据库中。 If I did this I'd probably have to store the html on s3 or something. 如果执行此操作,则可能必须将html存储在s3上。 I'm running this on heroku so that influences what I can do. 我在heroku上运行它,因此影响了我的工作能力。

  4. Setup begin and rescue blocks around every field. 设置程序开始并救援每个领域的块。 I was trying to figure out a to code this in a nicer ruby way so I just don't have a page of them but although I do have some fields are just straight up doc.css_at("#whatever") there are quite a number that require various formatting or calculations so I think it makes sense to try to rescue that so I can then log what went wrong. 我试图找出一种以更好的红宝石方式编码的代码,所以我只是没有它们的页面,但是尽管我确实有一些字段只是直接出现在doc.css_at(“#whatever”)那里,需要各种格式或计算的数字,所以我认为尝试挽救它是有道理的,这样我就可以记录出现了什么问题。 The other option is to let the exception bubble up and catch it when I try to create the model. 另一个选择是让异常冒泡并在尝试创建模型时捕获它。

Anyway I'm sure I'm not even thinking of everything but that is why I'm trying to figure out how other people have handled this problem. 无论如何,我确定我什至没有考虑所有问题,所以这就是为什么我试图弄清楚其他人如何处理此问题。

Our team does something similar to this, so here's some ideas: 我们的团队做了类似的事情,所以这里有一些想法:

  • we use a really high level begin/rescue transaction to make sure we don't get into weird half loaded states: 我们使用非常高级别的开始/救援事务来确保我们不会进入怪异的半加载状态:
 begin ActiveRecord::Base.transaction do ...try to load a data source... end rescue ...error handling... end 
  • Email/page yourself when certain errors occur. 在发生某些错误时给您自己发送电子邮件/寻呼。 We use exception_notifier but if you're sitting on Heroku the Exceptional plugin also seems like a good option. 我们使用exception_notifier,但是如果您坐在Heroku上,Exceptional插件似乎也是一个不错的选择。 I've also heard of people having success w/ hoptoad 我也听说有人在跳跳成功了

  • Capturing state is VERY important for troubleshooting issues. 捕获状态对于解决问题非常重要。 Something that's worked quite well for us is GMail. GMail对我们来说非常有效。 Our loaders effectively have two phases: 我们的装载机实际上分为两个阶段:

    1. capture data and send it to our gmail account 捕获数据并将其发送到我们的Gmail帐户
    2. log into gmail, download latest data and parse it 登录gmail,下载最新数据并进行解析

The second phase is the complex one, and if it fails a developer can simply log into the gmail account and easily inspect the failed message. 第二阶段是复杂的阶段,如果失败,开发人员可以简单地登录gmail帐户并轻松检查失败的消息。 This process has some limitations (per email and per mailbox storage limits, two phase pipeline, etc.) and we started out doing it because we had no other option, but it's proven shockingly resilient and convenient. 这个过程有一些限制(每个电子邮件和每个邮箱的存储限制,两阶段管道等),我们开始这样做是因为我们没有其他选择,但是事实证明,它具有惊人的弹性和便利性。 Keep email in mind as a cheap/easy way to store noncritical state. 记住电子邮件是一种廉价/简便的存储非关键状态的方法。 We didn't start out thinking of using it that way and are now really glad we do. 我们并不是开始考虑使用这种方式,现在真的很高兴我们这样做。 Logging into GMail feels better than digging through log files. 登录到GMail比浏览日志文件更好。

  • Build a dashboard UI. 建立仪表板UI。 We have a simple dashboard with a grid of sources by day that looks like this . 我们有一个简单的仪表板,每天都有一个源网格, 看起来像这样 Each box is colored either red or green based on whether the load for that source on that day succeeded. 根据当天该源的加载是否成功,每个框的颜色为红色或绿色。 You can go one step further and set up a monitor on this UI (mon.itor.us or equivalent) that alarms if some error threshold is met. 您可以更进一步,并在此UI上设置监视器(mon.itor.us或等效监视器),该监视器将在达到某些错误阈值时发出警报。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM