简体   繁体   English

Python如何获取(解码)html源代码

[英]Python how to get (decoded) html source code

I am trying in python (2.7.13) to get the source code of a webpage (having the current foreign exchange rates).我正在尝试在 python (2.7.13)中获取网页的源代码(具有当前的外汇汇率)。 Normally that is no problem with requests.get(url, headers) etc. In this case I can download/get the webpage, but some parts seems to be (base64?) encoded.通常这对 requests.get(url, headers) 等没有问题。在这种情况下,我可以下载/获取网页,但某些部分似乎是(base64?)编码的。

However when I visit the page in a browser and I view the source code: the right (decoded) code will be shown in the browser.但是,当我在浏览器中访问该页面并查看源代码时:正确的(解码的)代码将显示在浏览器中。 Question is: how can I get the (decoded) web page source.问题是:如何获得(解码的)web 页面源。 The url is: https://www.isbank.com.tr/en/foreign-exchange-rates url 是: https://www.isbank.com.tr/en/foreign-exchange-rates

Part of the code I use is:我使用的部分代码是:

url = "https://www.isbank.com.tr/en/foreign-exchange-rates"
resp = requests.get(url)
out = resp.text

The response contains the text in Turkish, saying that the request is rejected due to the "unusual traffic detected from your device".响应包含土耳其语文本,表示由于“从您的设备检测到异常流量”,请求被拒绝。 It seems that the site checks the User-Agent header to prevent simple scripts from crawling it.该站点似乎检查了User-Agent header 以防止简单的脚本对其进行爬网。 You can bypass it by adding some plausible header:您可以通过添加一些似是而非的 header 来绕过它:

url = 'https://www.isbank.com.tr/en/foreign-exchange-rates'
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
resp = requests.get(url, headers={'User-Agent': ua})
out = resp.text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM