I am trying to scrape a Google search page to learn scraping, using code like this:
doc = Nokogiri::HTML(open("https://www.google.com/search?q=cardiovascular+diesese"))
I want to get the result statistics text in every search page:
but I can't find the position of the content in the parsed HTML. I can inspect the page in the browser and see it's in a <div id="result-stats">
. I tried this to find it:
doc.at_css('[id="result-stats"]').text
Your use of CSS is awkward. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="result-stats">foo</div>
</body>
</html>
EOT
doc.at_css('[id="result-stats"]').text # => "foo"
doc.at('#result-stats').text # => "foo"
CSS uses #
for id
, so '[id="result-stats"]'
is unnecessarily verbose.
Nokogiri is smart enough to know to use CSS when it looks at the selector; In many years of using it I've only fooled it once and was forced to use the CSS/XPath specific versions of the generic search
or at
methods. By using the generic methods you can change the selector between CSS and XPath without bothering with the method being called. " Using 'at', 'search' and their siblings " talks about this.
In addition, just for fun, Nokogiri should have all the jQuery extensions to CSS as those were on the v2.0 roadmap for Nokogiri .
You need to use Selenium WebDriver to get dynamic content. Nokogiri alone cannot parse it.
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :firefox
driver.get "https://www.google.com/search?q=cardiovascular+diesese"
doc = Nokogiri::HTML driver.page_source
doc.at_css('[id="result-stats"]').text
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.