简体   繁体   中英

How to parse a Google search page to get result statistics and AdWords count using Nokogiri

I am trying to scrape a Google search page to learn scraping, using code like this:

doc = Nokogiri::HTML(open("https://www.google.com/search?q=cardiovascular+diesese"))

I want to get the result statistics text in every search page:

结果统计

but I can't find the position of the content in the parsed HTML. I can inspect the page in the browser and see it's in a <div id="result-stats"> . I tried this to find it:

doc.at_css('[id="result-stats"]').text

Your use of CSS is awkward. Consider this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="result-stats">foo</div>
  </body>
</html>
EOT

doc.at_css('[id="result-stats"]').text # => "foo"
doc.at('#result-stats').text # => "foo"

CSS uses # for id , so '[id="result-stats"]' is unnecessarily verbose.

Nokogiri is smart enough to know to use CSS when it looks at the selector; In many years of using it I've only fooled it once and was forced to use the CSS/XPath specific versions of the generic search or at methods. By using the generic methods you can change the selector between CSS and XPath without bothering with the method being called. " Using 'at', 'search' and their siblings " talks about this.

In addition, just for fun, Nokogiri should have all the jQuery extensions to CSS as those were on the v2.0 roadmap for Nokogiri .

You need to use Selenium WebDriver to get dynamic content. Nokogiri alone cannot parse it.

require 'selenium-webdriver'

driver = Selenium::WebDriver.for :firefox
driver.get "https://www.google.com/search?q=cardiovascular+diesese"
doc = Nokogiri::HTML driver.page_source
doc.at_css('[id="result-stats"]').text

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM