简体   繁体   English

使用jsoup抓取Google广告

[英]using jsoup for scraping google ads

I used jsoup a few months back to scrape all the google search results , not including ads. 我几个月前用jsoup抓取了所有Google搜索结果,但不包括广告。 now comes the exact opposite for the job , i need to get all of the ads from google results. 现在工作正好相反,我需要从google结果中获取所有广告。 the thing is i cant find them in my document. 问题是我在我的文档中找不到它们。

problem is surely a wrong tag... 问题肯定是错误的标签...

Elements elements = doc.select("div[class=*What do i need to put here?*]");
                    for (Element link : elements) {
                        position++;

                        Elements tempTitles = link.select("h3[]");
                        Element tempSmtng = link.select("a").first();
                        .............

this is a code taken from that last job. 这是最后一项工作的代码。 it used to say class=g and worked great but now it seems like the ads class tag just dosent work. 它曾经说class=g并且效果很好,但是现在看来ads class标签确实有用。 any suggestions what is the tag im looking for? 有什么建议,我正在寻找什么标签?

You should be able to figure this out on your own pretty easily. 您应该可以很容易地自己弄清楚这一点。 Just use a browser with developer tools like Chrome and use Inspect Element on the ads. 只需将浏览器与Chrome等开发人员工具配合使用,然后在广告上使用Inspect Element。 You should see what CSS classes are being implemented. 您应该看到正在实现的CSS类。

Details about using Chrome Inspect Element here: https://developers.google.com/web/tools/chrome-devtools/iterate/inspect-styles/?hl=en 有关在此处使用Chrome Inspect Element的详细信息: https : //developers.google.com/web/tools/chrome-devtools/iterate/inspect-styles/?hl= zh-CN

JSoup uses CSS selectors to find elements. JSoup使用CSS选择器来查找元素。 You can read up on how to use them here: http://css.maxdesign.com.au/selectutorial/ . 您可以在这里阅读如何使用它们: http : //css.maxdesign.com.au/selectutorial/

You'll be much better off understanding how your code works by learning the underlying concepts because webscrapers are inherently brittle since the website provider can change their output whenever they want. 通过学习基础概念,您将更好地理解代码的工作方式,因为网络爬虫本质上是脆弱的,因为网站提供商可以随时更改其输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM