How to parse a Google search page to get result statistics and AdWords count using Nokogiri

Question

I am trying to scrape a Google search page to learn scraping, using code like this:

doc = Nokogiri::HTML(open("https://www.google.com/search?q=cardiovascular+diesese"))

I want to get the result statistics text in every search page:

but I can't find the position of the content in the parsed HTML. I can inspect the page in the browser and see it's in a <div id="result-stats"> . I tried this to find it:

doc.at_css('[id="result-stats"]').text

Answer 1

Your use of CSS is awkward. Consider this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="result-stats">foo</div>
  </body>
</html>
EOT

doc.at_css('[id="result-stats"]').text # => "foo"
doc.at('#result-stats').text # => "foo"

CSS uses # for id , so '[id="result-stats"]' is unnecessarily verbose.

Nokogiri is smart enough to know to use CSS when it looks at the selector; In many years of using it I've only fooled it once and was forced to use the CSS/XPath specific versions of the generic search or at methods. By using the generic methods you can change the selector between CSS and XPath without bothering with the method being called. " Using 'at', 'search' and their siblings " talks about this.

In addition, just for fun, Nokogiri should have all the jQuery extensions to CSS as those were on the v2.0 roadmap for Nokogiri .

Answer 2

You need to use Selenium WebDriver to get dynamic content. Nokogiri alone cannot parse it.

require 'selenium-webdriver'

driver = Selenium::WebDriver.for :firefox
driver.get "https://www.google.com/search?q=cardiovascular+diesese"
doc = Nokogiri::HTML driver.page_source
doc.at_css('[id="result-stats"]').text

How to parse a Google search page to get result statistics and AdWords count using Nokogiri

Question

2 answers

solution1
2 2020-03-20 20:20:09

solution2
1 ACCPTED 2020-03-20 13:52:31

How to parse a Google search page to get result statistics and AdWords count using Nokogiri

Question

2 answers

solution1 2 2020-03-20 20:20:09

solution2 1 ACCPTED 2020-03-20 13:52:31

solution1
2 2020-03-20 20:20:09

solution2
1 ACCPTED 2020-03-20 13:52:31