简体   繁体   English

为什么在浏览器开发者工具中获取的数据与 BeautifulSoap / Postman 不同?

[英]Why getting different data in browser developer tools vs BeautifulSoap / Postman?

I want to scrap data from this web page我想从这个 web 页面中删除数据

I want to get all the blogs...which are under result tag (<div class="results">)我想获取所有博客...在结果标签(<div class="results">)

In browser tools there it is showing under result tag there are 10 snippets...在浏览器工具中,结果标签下显示有 10 个片段...... 在此处输入图像描述

But using Beautifulsoap I am getting但是使用 Beautifulsoap 我得到

<div class="results">
</div>

and in postman getting same thing..在 postman 得到同样的东西..

This is the way I am doing..这就是我正在做的..

topicuri = "\"
r = s.get(topicuri)
soup = BeautifulSoup(r.text, 'html.parser')
pages = soup.find('div', {'class': 'results'})
print(pages)

The website is using Javascript to display the snippets.该网站使用 Javascript 来显示片段。 BeautifulSoup does not execute Javascript, while the browser does. BeautifulSoup 不执行 Javascript,而浏览器执行。 You will probably want to use the Chromium engine in Python in order to web-scrape Javascript-based content.您可能希望使用 Python 中的 Chromium 引擎来抓取基于 Javascript 的内容。

You also can get data from api calls json response您还可以从 api 调用 json 响应中获取数据

import requests
import json 
body= "vodafone"
headers= {
    'content-type': 'application/json'
   }

api_url = "https://search.donanimhaber.com/api/search/portal/?q=vodafone&p=3&devicetype=browsermobile&order=date_desc&in=all&contenttype=all&wordtype=both&daterange=all"

jsonData = requests.post(api_url, data=json.dumps(body), headers=headers).json()

for item in jsonData['contents']:
    categoryName=item['categoryName']
    print(categoryName)

Output: Output:

Operatörler - Kurumsal Haberler
Operatörler - Kurumsal Haberler
Operatörler - Kurumsal Haberler
Mobil Aksesuarlar
Operatörler - Kurumsal Haberler
Kripto Para
Sinema ve Dizi
Mobil Oyunlar
Operatörler - Kurumsal Haberler
Operatörler - Kurumsal Haberler

As mentioned requests could not render JavaScript but there are two alternatives:如前所述, requests无法呈现JavaScript ,但有两种选择:

  • Use requests and perform a post request on your url使用requests并在您的 url 上执行发布请求
  • Use selenium to get the rendered page_source as you would expect it.使用selenium获取呈现的page_source ,正如您所期望的那样。

Example例子

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

url = 'https://search.donanimhaber.com/portal?q=vodafone&p=3&devicetype=browsermobile&order=date_desc&in=all&contenttype=all&wordtype=both&range=all'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
wait = WebDriverWait(driver, 10)
driver.get(url)

wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[@class="results"]/div[@class="snippet"]')))

content = driver.page_source
soup = BeautifulSoup(content,"html.parser")

pages = soup.find_all('div', {'class': 'snippet'})

for p in pages:
    print(p.h2.text.strip())

Output Output

Vodafone'dan dijital sağlık projelerine ücretsiz 5G desteği
Vodafone'un son 15 yılda Türkiye ekonomisine katkısı açıklandı
"Yarını Kodlayanlar" projesinde gençler afet sorunlarına çözümler üretti
Küresel akıllı saat pazarı yılın ilk çeyreğinde yüzde 35 büyüdü
Vodafone Türkiye'nin ilk çeyrek sonuçları açıklandı: Servis gelirlerinde yüzde 19 artış
Netflix'e yeni eklenen dizi ve filmleri takip edebileceğiniz site
Sony ve SinemaTV anlaştı! Spider-Man, Venom 2 ve daha fazlası TV'de ilk kez SinemaTV'de yayınlanacak
Vodafone ve Riot Games, Türkiye'nin ilk 5G Wild Rift turnuvasını duyurdu
Türkiye'de kaç kişi numara taşıma ile operatör değiştirdi?
Turkcell'in Ramazan'a özel Salla Kazan kampanyası başladı

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM