简体   繁体   English

使用Nokogiri从雅虎财经中获取价值?

[英]Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script: 我写了一个简单的脚本:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[@id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text


puts "#{name} - #{market_cap} - #{ebit}"

The script grabs three values from Yahoo finance. 该脚本从雅虎财务中获取了三个值。 The problem is that the ebit XPath returns nil. 问题是ebit XPath返回nil。 The way I got the XPath was using the Chrome developer tools and copy and pasting. 我获得XPath的方式是使用Chrome开发人员工具并复制和粘贴。

This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row. 这是我试图从http://au.finance.yahoo.com/q/bs?s=MYGN获取价值的483,992total current assets的实际值为483,992

Any help would be appreciated, especially if there is a way to get this value with CSS selectors. 任何帮助将不胜感激,特别是如果有一种方法来获得CSS选择器的这个值。

Nokogiri supports: Nokogiri支持:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')

puts ebit
# >> 483,992

I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td> , moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\\d]+/, '') which removes everything that isn't a number or a comma. 我正在使用<strong>标记作为一个放置标记,使用:contains伪类,然后备份到包含<td> ,移动到下一个<td>并抓取其文本,然后最后剥离白色 -空格使用gsub(/[^,\\d]+/, '')删除所有不是数字或逗号的内容。

Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works. Nokogiri支持许多jQuery的JavaScript扩展,这就是为什么:contains作品。

This seems to work for me 这似乎对我有用

doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992

Or as a string 或者作为一个字符串

doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM