简体   繁体   中英

python selenium webdriver not showing all html

I am developing a web scraper in python.

This is my code:

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from bs4 import BeautifulSoup

chrome_options = Options()
chrome_options.add_argument("--headless")

driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.hapag-lloyd.com/en/home.html")

source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')

print(soup)

but the html returned is different from what I saw on the browser(please check the last few lines):

<html><head>
<meta content="no-cache" http-equiv="Pragma"/>
<meta content="-1" http-equiv="Expires"/>
<meta content="no-cache" http-equiv="CacheControl"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="data:;base64,iVBORw0KGgo=" rel="shortcut icon"/>
<script>

(function(){
window["bobcmn"] = "111110101010102000000022000000052000000002a4b927ad200000096300000000300000000300000006/TSPD/300000008TSPD_101300000005https3000000b0081ecde62cab2000d65f90c7efd5185e314a8800e00a5aad11b1a439eb174c6c3f64d45284e14d9508dcf0830d0a2800346a2db5907272d4309ad725a7dc856ab98589c10724bd284477ca152744f4ac2102b44d72e2a1e9200000000200000000";

window.aIv=!!window.aIv;try{(function(){(function(){})();var sZ=78;try{var IZ,lZ,OZ=s(868)?0:1,zZ=s(999)?0:1,ss=s(445)?0:1,Ss=s(601)?0:1;for(var is=(s(421),0);is<lZ;++is)OZ+=s(211)?2:1,zZ+=s(768)?1:2,ss+=(s(54),2),Ss+=s(289)?2:3;IZ=OZ+zZ+ss+Ss;window.zz===IZ&&(window.zz=++IZ)}catch(Ls){window.zz=IZ}var Os=!0;function _(Z){var S=arguments.length,I=[],O=1;while(O<S)I[O-1]=arguments[O++]-Z;return String.fromCharCode.apply(String,I)}
function SS(Z){var S=30;!Z||document[J(S,148,135,145,135,128,135,138,135,146,151,113,146,127,146,131)]&&document[_(S,148,135,145,135,128,135,138,135,146,151,113,146,127,146,131)]!==l(68616527636,S)||(Os=!1);return Os}function l(Z,S){Z+=S;return Z.toString(36)}function J(Z){var S=arguments.length,I=[];for(var O=1;O<S;++O)I.push(arguments[O]-Z);return String.fromCharCode.apply(String,I)}function _S(){}SS(window[_S[_(sZ,188,175,187,179)]]===_S);SS(typeof ie9rgb4!==l(1242178186121,sZ));
SS(RegExp("\x3c")[l(1372127,sZ)](function(){return"\x3c"})&!RegExp(l(42811,sZ))[l(1372127,sZ)](function(){return"'x3'+'d';"}));
var IS=window[J(sZ,175,194,194,175,177,182,147,196,179,188,194)]||RegExp(J(sZ,187,189,176,183,202,175,188,178,192,189,183,178),l(-60,sZ))[l(1372127,sZ)](window["\x6e\x61vi\x67a\x74\x6f\x72"]["\x75\x73e\x72A\x67\x65\x6et"]),jS=+new Date+(s(267)?375283:6E5),JS,Z_,s_,S_=window[_(sZ,193,179,194,162,183,187,179,189,195,194)],__=IS?s(890)?18994:3E4:s(725)?3775:6E3;
document[J(sZ,175,178,178,147,196,179,188,194,154,183,193,194,179,188,179,192)]&&document[J(sZ,175,178,178,147,196,179,188,194,154,183,193,194,179,188,179,192)](J(sZ,196,183,193,183,176,183,186,183,194,199,177,182,175,188,181,179),function(Z){var S=88;document[J(S,206,193,203,193,186,193,196,193,204,209,171,204,185,204,189)]&&(document[_(S,206,193,203,193,186,193,196,193,204,209,171,204,185,204,189)]===_(S,192,193,188,188,189,198)&&Z[J(S,193,203,172,202,205,203,204,189,188)]?s_=!0:document[J(S,206,
193,203,193,186,193,196,193,204,209,171,204,185,204,189)]===l(68616527578,S)&&(JS=+new Date,s_=!1,i_()))});function i_(){if(!document[_(47,160,164,148,161,168,130,148,155,148,146,163,158,161)])return!0;var Z=+new Date;if(Z>jS&&(s(988)?840535:6E5)>Z-JS)return SS(!1);var S=SS(Z_&&!s_&&JS+__<Z);JS=Z;Z_||(Z_=!0,S_(function(){Z_=!1},s(891)?0:1));return S}i_();var I_=[s(915)?10661718:17795081,s(30)?27611931586:2147483647,s(748)?1636390818:1558153217];
function L_(Z){var S=43;Z=typeof Z===l(1743045633,S)?Z:Z[_(S,159,154,126,159,157,148,153,146)](s(837)?37:36);var I=window[Z];if(!I[_(S,159,154,126,159,157,148,153,146)])return;var O=""+I;window[Z]=function(Z,S){Z_=!1;return I(Z,S)};window[Z][J(S,159,154,126,159,157,148,153,146)]=function(){return O}}for(var O_=(s(493),0);O_<I_[l(1294399127,sZ)];++O_)L_(I_[O_]);SS(!1!==window[_(sZ,175,151,196)]);window.LZ={zs:"084e4452c4017800c5def6fe02b0086dc53ff9519b1bcb514d1f4dd874776393bcfec37f99ebfc4795da47aec5f492a8a4131f92a5e26fecd10807e6bd8ba79b77bb1692ddac2154a98808ca5559f35a278cf21dd71a1e61c4579303187e42dc179ae0846f6078a996bb6f824e2238fc7b431f54a421fcf7145bd4fcc3d9b982"};
function Zi(Z){var S=+new Date,I;!document[_(63,176,180,164,177,184,146,164,171,164,162,179,174,177,128,171,171)]||S>jS&&(s(968)?421041:6E5)>S-JS?I=SS(!1):(I=SS(Z_&&!s_&&JS+__<S),JS=S,Z_||(Z_=!0,S_(function(){Z_=!1},s(688)?0:1)));return!(arguments[Z]^I)}function s(Z){return 265>Z}
(function(){var Z=/(\A([0-9a-f]{1,4}:){1,6}(:[0-9a-f]{1,4}){1,1}\Z)|(\A(([0-9a-f]{1,4}:){1,7}|:):\Z)|(\A:(:[0-9a-f]{1,4}){1,7}\Z)/ig,S=document.getElementsByTagName("head")[0],I=[];S&&(S=S.innerHTML.slice(0,1E3));while(S=Z.exec(""))I.push(S)})();})();}catch(x){}finally{ie9rgb4=void(0);};function ie9rgb4(a,b){return a>>b>>0};

})();

</script>
<script src="/TSPD/081ecde62cab200082f75af3905bec19af31f4aaf7bd4079c3ac5a62a6fb4096cfcec166097ddde7?type=7" type="text/javascript"></script>
<noscript>Please enable JavaScript to view the page content.<br/>Your support ID is: 17324345507588527622.</noscript>
</head><body>
<form action="" enctype="multipart/form-data" method="post"><input name="_pd" type="hidden" value=""/></form></body></html>

It reports "Please enable JavaScript to view the page content.
Your support ID is: 17324345507588527622.".

I checked a few queries launched by other people. This problem should have been solved by the use of chrome.

And I also tried to get html with the requests-html. But the programming just keep running and do not return a thing.

It's a limitation of the page_source method. See this answer: https://stackoverflow.com/a/64897405/1387701

and See the source code:

Description copied from interface: WebDriver Get the source of the last loaded page. If the page has been modified after loading (for example, by Javascript) there is no guarantee that the returned text is that of the modified page. Please consult the documentation of the particular driver being used to determine whether the returned text reflects the current state of the page or the text last sent by the web server. The page source returned is a representation of the underlying DOM: do not expect it to be formatted or escaped in the same way as the response sent from the web server. Think of it as an artist's impression.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM