简体   繁体   中英

Getting all HTML from requests.get()

I just started with web scraping with Python and hit the wall. I am using the requests library to get the HTML code from a website. For example, the Google search result website: " https://www.google.com/?gws_rd=ssl#q=ball "

When I hit F12 and check the HTML, it looks different than with:

site = requests.get("https://www.google.com/?gws_rd=ssl#q=ball")
print(site.text)

with requests.get , text is much shorter and not all information is visible (it starts with !doctype , however). Because of that I am unable to work with this HTML.

Can you tell me where the mistake is?


This is actually an exercise from the book "Automate the boring stuff with Python". The task is to search for some item Google and then find few first results with HTML locators. I cannot do it because when I use requests.get() I cannot see any objects for links in the HTML code.

The HTML you see using the browser's development tools is what the browser is currently working with. This includes any changes performed via Javascript. The data you are getting when using Requests is before any Javascript has operated on the page. (Note that Requests doesn't process Javascript so you will be unable to acquire a javascript processed page using just Requests.)

If you're specifically looking to scrape Google Search, use a url like https://www.google.com/search?q=test . This particular url is for Google's non-javascript site. Keep in mind that Google (and most other sites) doesn't appreciate scraping so you may run into other issues when doing so.

Some HTML elements are generated by JavaScript.

Use "show source code" from your browser to see the original code. It must be similar to the Request response text.

It's probably because there's no user-agent being passed into requests headers thus when no user-agent is specified while using requests library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request ( or whatever it does ) and you receive a different HTML ( with some sort of an error ) with different CSS selectors. Check what's your user-agent .

Pass user-agent :

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

requests.get('URL', headers=headers)

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to spend time trying to bypass blocks from Google and figuring out why certain things don't work as they should. Instead, you need to focus on the data you want to extract from the structured JSON. Check out the playground .

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM