简体   繁体   English

为什么我看不到我请求的页面?

[英]Why can't I see the same page that I requested?

I've been learning Python and tried Web Scraping.我一直在学习 Python 并尝试过 Web Scraping。 I could manage to scrape Google Result Page for a normal Google Search, though the page was depreciated idk why.我可以设法为正常的 Google 搜索抓取 Google 结果页面,尽管该页面已贬值,但不知道是什么原因。 Tried the same for Google Images, and it is depreciated as well.对 Google 图片进行了相同的尝试,但它也已折旧。 It doesn't appear the same as it was appearing in the browser.它与浏览器中的显示不同。

Here 's my code.是我的代码。

from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO

search = input("Search for : ")
params = {"tbm": "isch", "source": "hp", "q": search}
r = requests.get("https://www.google.com/search", params=params)
print("URL :", r.url)
print("Status : ", r.status_code, "\n\n")

f = open("ImageResult.html", "w+")
f.write(r.text)

For example, I search for "Goku".例如,我搜索“悟空”。 The Google Image returns this page. Google 图片返回页面。

When I click on the first image, a popup opens.当我单击第一张图像时,会打开一个弹出窗口。 Or say I press ctrl+click.或者说我按 ctrl+click。 I reach this page.我到达这个页面。

On this page I can see that the actual image's URL can be accessed through maybe the current url or the link at the "View Image" button.在此页面上,我可以看到可以通过当前 url 或“查看图像”按钮上的链接访问实际图像的 URL。 But the issue is, I can't reach this page/popup in the version of the page that I am able to get when I request this page.但问题是,我无法在请求此页面时获得的页面版本中访问此页面/弹出窗口。

UPDATE : I'm sharing the page I am getting.更新:我正在分享我得到的页面

This depends on a lot of factors like user agent string , cookies and also google experiments . 这取决于很多因素,例如用户代理字符串,cookie以及Google实验。 Google is known for serving different ways of same content for many users.On search ,Google loads different pages based on site speed and user agent.Google also randomly runs experiments on searchpage design,etc before rollng in public to implement A/B testing dynamically. Google以为许多用户提供相同内容的不同方式而著称。在搜索时,Google根据网站速度和用户代理加载不同的页面。Google还随机进行搜索页设计等实验,然后公开展示以动态实施A / B测试。

Google Organic results have very little JavaScript and you still can parse data from the <script> tags. Google Organic 结果几乎没有 JavaScript,您仍然可以从<script>标签解析数据。

Besides that, the most often problem why you don't see the same results as in your browser is because there's no user-agent being passed into request headers thus when no user-agent is specified while using requests library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request ( or whatever it does ) and you receive a different HTML with different CSS selectors.除此之外,为什么您看不到与浏览器中相同的结果的最常见问题是因为没有将user-agent传递到请求headers因此当使用requests库时没有指定user-agent时,它默认为python-请求,Google 知道它是一个机器人/脚本,然后它会阻止一个请求(或它所做的任何事情),并且您会收到具有不同 CSS 选择器的不同 HTML。Check what's your user-agent .检查您的user-agent是什么

Pass user-agent :通过user-agent

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

requests.get('URL', headers=headers)

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi.或者,您可以使用 SerpApi 的Google Organic Results API来实现相同的目的。 It's a paid API with a free plan.这是一个带有免费计划的付费 API。

The difference in your case is that you don't have to spend time trying to bypass blocks from Google and figuring out why certain things don't work as they should, and you don't have to maintain the parser over time.您的情况的不同之处在于您不必花时间试图绕过 Google 的块并找出为什么某些事情无法正常工作,并且您不必随着时间的推移维护解析器。

Very simple example code to integrate:要集成的非常简单的示例代码:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "how to create minecraft server",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result["link"], sep="\n")

----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
# other results
'''

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM