简体   繁体   English

如何使用由 Javascript 函数生成的 Ruby 抓取数据?

[英]How to scrape data using Ruby which is generated by a Javascript function?

I am trying to scrape the data URL link from the latest date, which is the first row of the table, from this page.我想刮从最新的日期,这是该表的第一行,从数据的URL链接页面。 It seems like the content of the table is generated by a JavaScript function.表格的内容似乎是由 JavaScript 函数生成的。

I tried using Nokogiri to get it but Nokogiri can not scrape JavaScript.我尝试使用 Nokogiri 来获取它,但 Nokogiri 无法抓取 JavaScript。 Then, I tried to get the script part only using Nokogiri using:然后,我尝试仅使用 Nokogiri 获取脚本部分:

url = "http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data"
doc = Nokogiri::HTML(open(url))
js = doc.css("script").text
puts js

In the output I found the table that I wanted with class name sgxTableGrid .在输出中,我找到了我想要的类名sgxTableGrid的表。 But, the problem is there is no clue about the data URL link here in the JavaScript function and everything is being generated dynamically.但是,问题是 JavaScript 函数中没有关于数据 URL 链接的线索,并且一切都是动态生成的。

Does someone know a better way of approaching this problem?有人知道解决这个问题的更好方法吗?

Looking through the HTML for that page, the table is generated by JSON received as the result of a JavaScript request.查看该页面的 HTML,该表是由作为 JavaScript 请求结果接收到的 JSON 生成的。

You can figure out what's going on by tracing backwards through the source code of the page.您可以通过向后追溯页面的源代码来弄清楚发生了什么。 Here's some of what you'll need if you want to retrieve the JSON outside of their JavaScript, however there'll still be work needed to actually do something with it:如果您想在 JavaScript 之外检索 JSON,则需要以下一些内容,但仍然需要做一些工作来实际使用它:

  1. Starting with this code:从这段代码开始:

     require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open('http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data')) scripts = doc.css('script').map(&:text) puts scripts.select{ |s| s['sgxTableGrid'] }

    Look at the text output in an editor.在编辑器中查看文本输出。 Search for sgxTableGrid .搜索sgxTableGrid You'll see a line like:你会看到这样一行:

     var tableHeader = "<table width='100%' class='sgxTableGrid'>"

    Look down a little farther and you'll see:再往下看一点,你会看到:

     var totalRows = data.items.length - 1;

    data comes from the parameter to the function being called, so that's where we start. data来自被调用函数的参数,这就是我们开始的地方。

  2. Get a unique part of the containing function's name loadGridns_ and search for it.获取包含函数名称loadGridns_的唯一部分并搜索它。 Each time you find it, look for the parameter data , then look to see where data is defined.每次找到它,查找参数data ,然后查看data定义的位置。 If it's passed into that method, then search to see what calls it.如果它被传递到该方法中,那么搜索以查看调用它的内容。 Repeat that process until you find that the variable isn't passed into the function, and at that point you'll know you're at the method that creates it.重复这个过程,直到你发现变量没有被传递到函数中,那时你就会知道你在创建它的方法上。

  3. I found myself in a function that starts with loadGridDatans , where it's part of a block that does a xhrPost call to retrieve a URL.我发现自己在一个以loadGridDatans开头的函数中,它是执行xhrPost调用以检索 URL 的块的一部分。 That URL is the target you're after, so grab the name of the containing function, and loop through the calls where the URL is passed in, like you did in the above step.该 URL 是您要查找的目标,因此获取包含函数的名称,并循环传递传入 URL 的调用,就像您在上述步骤中所做的那样。

  4. That search ended up on a line that looks like:该搜索最终出现在如下所示的行上:

     var url = viewByDailyns_7_2AA4H0C090FIE0I1OH2JFH20K1_...
  5. At that point you can start reconstructing the URL you need.此时,您可以开始重建所需的 URL。 Open a JavaScript debugger, like Firebug, and put a break point on that line.打开一个 JavaScript 调试器,比如 Firebug,并在该行上放置一个断点。 Reload the page and JavaScript should stop executing at that line.重新加载页面,JavaScript 应该会在该行停止执行。 Single-step, or set breakpoints, and watch the url variable be created until it's in its final form.单步执行,或设置断点,并观察创建的url变量,直到它处于最终形式。 At that point you have something you can use in OpenURI , which should retrieve the JSON you want.那时,您可以在OpenURI使用一些东西,它应该检索您想要的 JSON。

Notice, their function names might be generated dynamically;注意,它们的函数名可能是动态生成的; I didn't check to see, so trying to use the full name of the function might fail.我没有查看,所以尝试使用函数的全名可能会失败。

They might also be serializing the datetime stamp or using a session-key that's serialized to make the function names unique/more opaque, doing it for a number of reasons.他们也可能正在序列化日期时间戳或使用序列化的会话密钥以使函数名称唯一/更不透明,这样做的原因有很多。

Even though it's a pain to take this stuff apart, it's also a good lesson in how dynamic pages work.尽管把这些东西拆开很痛苦,但它也是关于动态页面如何工作的一个很好的教训。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM