简体   繁体   English

Requests.get显示与Chrome的开发者工具不同的HTML

[英]Requests.get showing different HTML than Chrome's Developer Tool

I am working on a web scraping tool using python (specifically jupyter notebook) that scrapes a few real estate pages and saves the data like price, adress etc. 我正在使用python(特别是jupyter笔记本)的网络抓取工具,它刮擦一些房地产页面并保存数据,如价格,地址等。

It is working just fine for one of the pages I picked out but when I try to scrape this page: sreality.cz (sorry, the page is in Czech but the actual content is not that important now) using reguests.get() I get this result: 它适用于我选择的其中一个页面,但是当我尝试抓取此页面时: sreality.cz (对不起,页面是捷克语,但实际内容现在并不重要)使用reguests.get()我得到这个结果:

 <!doctype html> <html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui"> <!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS ---> <title ng:bind-template="{{metaSeo.title}}">Sreality.cz • reality a nemovitosti z celé ČR</title> <meta name="description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz"> <meta property="og:title" content="Sreality.cz • reality a nemovitosti z celé ČR"> <meta property="og:type" content="website"> <meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png"> <meta property="og:description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz"> <meta property="og:url" content="https://www.sreality.cz/"> <meta ng-if="metaStatus.value" name="szn:status" content="{{metaStatus.value}}"> <meta http-equiv="imagetoolbar" content="no"> <link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico"> <link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3"> <link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3"> <link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3"> <link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3"> <link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3"> <link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3"> <link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3"> <link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3"> <link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3"> <link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png"> <link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png"> <link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png"> <link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png"> <link rel="manifest" href="/img/icons/android-chrome-manifest.json"> <meta name="msapplication-TileColor" content="#2b5797"> <meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png"> <meta name="msapplication-config" content="/img/icons/browserconfig.xml" /> <link rel="alternate" type="application/rss+xml" ng-href="{{ rss.url }}" ng-if="rss.url"> <link ng-repeat="lang in metaSeo.languages" rel="alternate" hreflang="{{lang.code}}" ng-href="{{lang.url}}"> <link rel="stylesheet" href="/css/all.css?2e96626"> <!-- Begin Inspectlet Embed Code --> <script type="text/javascript" id="inspectletjs"> window.__insp = window.__insp || []; __insp.push(['wid', 821249485]); __insp.push(["virtualPage"]); (function() { function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); }; setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp(); })(); </script> <!-- End Inspectlet Embed Code --> <!--[if lte IE 8]> <script> document.createElement('popover'); document.createElement('mortgage'); document.createElement('vendor'); document.createElement('hp-signpost'); document.createElement('category-switcher'); document.createElement('feedback'); document.createElement('bottom'); document.createElement('panorama'); document.createElement('panorama-prev'); document.createElement('sphere-viewer'); document.createElement('sphere-viewer-prev'); document.createElement('save-filter'); </script> <![endif]--> <!-- Statistiky --> <script src="https://h.imedia.cz/js/dot-small.js" type="text/javascript"></script> <script type="text/javascript"> (function() { try { // Při přesměrování na hashbang URL (IE8-9) ztrácíme referrer, // který je potřeba pro správné počítání statistik. if (window.sessionStorage) { // někdo může mít DOM storage zakázaný var l = document.createElement('a'); l.href = document.referrer; var referrerHostname = l.hostname; if (window.location.hostname != referrerHostname) { window.sessionStorage.setItem('referrer', l.href); } } // Starý android (< 4.0) v kombinaci s angularem špatně pracuje s hashem v URL. // Považuje ho za součást query případně path. // Na takových zařízech se budeme tvářit, že žádný hash nebyl. if (parseInt((/android (\\d+)/.exec(window.navigator.userAgent.toLowerCase()) || [])[1], 10) < 4) { var hrefWithoutHashbang = window.location.href.replace('/#!', ''); var hashIndex = hrefWithoutHashbang.indexOf('#'); if (hashIndex != -1) { window.location.replace(hrefWithoutHashbang.substring(0, hashIndex)); } } } catch (e) {} })(); </script> <!-- API mapy.cz --> <script type="text/javascript" src="https://api4.mapy.cz/loader.js"></script> <script type="text/javascript">Loader.load(null, {poi: true, pano: true})</script> <!-- Login reklama --> <script src="https://i.imedia.cz/js/im3.js" type="text/javascript"></script> <script src="https://1.im.cz/software/promo/promo-sbrowser.js"></script> <!-- Rozkopírování SID cookie --> <script src="https://h.imedia.cz/js/sid.js"></script> <!-- Login --> <script src="https://login.szn.cz/js/api/login.js"></script> <script> login.cfg({ serviceId: "sreality" }); </script> <!-- KONFIGURACE --> <script src="/js/conf/config.js?2e96626"></script> <script src="/js/advert.js"></script> <script src="/js/all.js?2e96626"></script> <script type="text/javascript"> if (window.DOT) { var dotCfg = { service: 'sreality' }; if (window.SrealityABTest && window.SrealityABTest.getVariant()) { dotCfg.abtest = window.SrealityABTest.getVariant(); } DOT.cfg(dotCfg); } </script> <noscript> <meta http-equiv="refresh" content="0;url=?_escaped_fragment_="/> </noscript> <meta name="fragment" content="!" ng-if="metaSeo.showMetaFragment" /> </head> <!--[if IE 8]> <body class="ie8"> <![endif]--> <!--[if IE 9]> <body class="notie8 ie9"> <![endif]--> <!--[if gt IE 9]><!--> <body class="notie8 notie9 lang-{{html.lang}}"> <!--<![endif]--> <div loading-line></div> <div page-layout> <div ng-view></div> </div> </body> </html> 

Though it is different from the one I see when I look at the page in Chrome's developer tool - a part of the code is here (the whole code doesn't fit in here and uploadtext isn't working for some reason): 虽然它与我在Chrome的开发者工具中查看页面时看到的不同 - 代码的一部分在这里(整个代码不适合这里,并且uploadtext由于某种原因不起作用):

 <!DOCTYPE html> <html lang="cs" ng-app="sreality" ng-controller="MainCtrl" class="ng-scope"><head><style type="text/css">@charset "UTF-8";[ng\\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide{display:none !important;}ng\\:form{display:block;}.ng-animate-block-transitions{transition:0s all!important;-webkit-transition:0s all!important;}.ng-hide-add-active,.ng-hide-remove{display:block!important;}</style> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui"> <!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS ---> <title ng:bind-template="Byty na prodej Brno-město, posledních 30 dní • Sreality.cz" class="ng-binding">Byty na prodej Brno-město, posledních 30 dní • Sreality.cz</title> <meta name="description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů."> <meta property="og:title" content="Byty na prodej Brno-město, posledních 30 dní"> <meta property="og:type" content="website"> <meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png"> <meta property="og:description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů."> <meta property="og:url" content="https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic"> <!-- ngIf: metaStatus.value --><meta ng-if="metaStatus.value" name="szn:status" content="200" class="ng-scope"><!-- end ngIf: metaStatus.value --> <meta http-equiv="imagetoolbar" content="no"> <link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico"> <link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3"> <link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3"> <link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3"> <link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3"> <link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3"> <link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3"> <link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3"> <link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3"> <link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3"> <link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png"> <link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png"> <link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png"> <link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png"> <link rel="manifest" href="/img/icons/android-chrome-manifest.json"> <meta name="msapplication-TileColor" content="#2b5797"> <meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png"> <meta name="msapplication-config" content="/img/icons/browserconfig.xml"> <!-- ngIf: rss.url --><link rel="alternate" type="application/rss+xml" ng-href="/api/cs/v2/estates/rss?category_main_cb=1&amp;locality_district_id=72&amp;suggested_regionId=-1&amp;suggested_districtId=-1&amp;estate_age=31&amp;locality_region_id=14&amp;category_type_cb=1" ng-if="rss.url" class="ng-scope" href="/api/cs/v2/estates/rss?category_main_cb=1&amp;locality_district_id=72&amp;suggested_regionId=-1&amp;suggested_districtId=-1&amp;estate_age=31&amp;locality_region_id=14&amp;category_type_cb=1"><!-- end ngIf: rss.url --> 

I can see from the first html code that requests.get downloads that the page runs some scripts which probably cause the html to be different. 我可以从第一个html代码中看到,request.get下载该页面运行的一些脚本可能导致html不同。

I already tried using urllib but the result html doc was still the same. 我已经尝试过使用urllib,但结果html doc仍然是一样的。

Is there a way to download the html I see when I open the page in Chromes's developer tool so I can scrape it? 有没有办法下载我在Chromes的开发者工具中打开页面时看到的html,这样我就可以抓住它了?

If eventually data from that page you are after, you can get it very easily using selenium in combination with BeautifulSoup. 如果最终来自该页面的数据,您可以使用selenium与BeautifulSoup结合使用它。 It gives you all the links of apartments. 它为您提供公寓的所有链接。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic")
soup = BeautifulSoup(driver.page_source,"html.parser")
driver.quit()

for title in soup.select(".text-wrap"):
    num = "https://www.sreality.cz" + title.select_one(".title").get('href')
    print(num)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM