簡體   English   中英

為什么這個簡單的兩行 python 網絡抓取代碼在 python 解釋器中正確執行,但在我的 PC 上不能正確執行?

[英]Why does this simple two line python web scraping code execute correctly in python interpreters but not on my PC?

import urllib2
hdr={'User-Agent': 'Mozilla/5.0', 'Accept-Language': 'en-US,en;q=0.8'}
print urllib2.urlopen(urllib2.Request('https://mobile.lowvig.ag/sports', headers=hdr)).read()

當我從這兩個在線解釋器運行它時,這個兩行程序會打印正確的 html 代碼: https : //repl.it/languages/python https://paiza.io/en/languages/python

但是,當我從家用 PC 運行它時,它會打印出看起來像是 cloudflare 警告頁面的內容:

 <!DOCTYPE html> <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]--> <head> <title>Attention Required! | Cloudflare</title> <meta name="captcha-bypass" id="captcha-bypass" /> <meta charset="UTF-8" /> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" /> <meta name="robots" content="noindex, nofollow" /> <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" /> <link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" /> <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]--> <style type="text/css">body{margin:0;padding:0}</style> <!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]--> <!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script><!--<![endif]--> </head> <body> <div id="cf-wrapper"> <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div> <div id="cf-error-details" class="cf-error-details-wrapper"> <div class="cf-wrapper cf-header cf-error-overview"> <h1 data-translate="challenge_headline">One more step</h1> <h2 class="cf-subheadline"><span data-translate="complete_sec_check">Please complete the security check to access</span> mobile.lowvig.ag</h2> </div><!-- /.header --> <div class="cf-section cf-highlight cf-captcha-container"> <div class="cf-wrapper"> <div class="cf-columns two"> <div class="cf-column"> <div class="cf-highlight-inverse cf-form-stacked"> <form class="challenge-form" id="challenge-form" action="/cdn-cgi/l/chk_captcha" method="get"> <input type="hidden" name="s" value="b3df19c7958863448f72e934c06ae7332861f030-1564882975-1800-AcBCDHplwdTaCzOhBkp56Ja0sk/FSnXB3lJxmJpKdOTH0MYNevcFL2u/8NelatBwLBq+AfsceRViMsHQs7gnTUCvyRKSpGh4IizRs3BPQflkFl9uaScZ4CoP1yZCKYWVWrFDkwELhwE6KPGUci0e6XT1ph465Mzcryl6xtId0S0U"></input> <script type="text/javascript" src="/cdn-cgi/scripts/cf.challenge.js" data-type="normal" data-ray="500cd664e845b615" async data-sitekey="6LfBixYUAAAAABhdHynFUIMA_sa4s-XsJvnjtgB0"></script> <div class="g-recaptcha"></div> <noscript id="cf-captcha-bookmark" class="cf-captcha-info"> <div><div style="width: 302px"> <div> <iframe src="https://www.google.com/recaptcha/api/fallback?k=6LfBixYUAAAAABhdHynFUIMA_sa4s-XsJvnjtgB0" frameborder="0" scrolling="no" style="width: 302px; height:422px; border-style: none;"></iframe> </div> <div style="width: 300px; border-style: none; bottom: 12px; left: 25px; margin: 0px; padding: 0px; right: 25px; background: #f9f9f9; border: 1px solid #c1c1c1; border-radius: 3px;"> <textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid #c1c1c1; margin: 10px 25px; padding: 0px; resize: none;"></textarea> <input type="submit" value="Submit"></input> </div> </div></div> </noscript> </form> </div> </div> <div class="cf-column"> <div class="cf-screenshot-container"> <span class="cf-no-screenshot"></span> </div> </div> </div><!-- /.columns --> </div> </div><!-- /.captcha-container --> <div class="cf-section cf-wrapper"> <div class="cf-columns two"> <div class="cf-column"> <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2> <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p> </div> <div class="cf-column"> <h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2> <p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p> <p data-translate="resolve_captcha_network">If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.</p> <p data-translate="resolve_captcha_privacy_pass">Another way to prevent getting this page in the future is to use Privacy Pass. Check out the browser extension in the <a href="https://chrome.google.com/webstore/detail/privacy-pass/ajhmfdgkijocedmfjonnpjfojldioehi">Chrome Store</a>.</p> </div> </div> </div><!-- /.section --> <div class="cf-error-footer cf-wrapper"> <p> <span class="cf-footer-item">Cloudflare Ray ID: <strong><snip></strong></span> <span class="cf-footer-separator">&bull;</span> <span class="cf-footer-item"><span>Your IP</span>: <snip></span> <span class="cf-footer-separator">&bull;</span> <span class="cf-footer-item"><span>Performance &amp; security by</span> <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">Cloudflare</a></span> </p> </div><!-- /.error-footer --> </div><!-- /#cf-error-details --> </div><!-- /#cf-wrapper --> <script type="text/javascript"> window._cf_translation = {}; </script> </body> </html>

我想弄清楚這些環境訪問此頁面的方式與我的家用 PC 有何不同,以便我可以成功抓取此網站。

更新:響應兩個提交的答案:以下代碼通過代理路由請求(必須從 urllib2 切換到請求),並顯示相同的 cloudflare 頁面(代理 IP 正確顯示為“您的 ip”在底部):

import requests
hdr={'User-Agent': 'Mozilla/5.0', 'Accept-Language': 'en-US,en;q=0.8'}
proxyDict={'https': 'https://user:pass@ip:port'}
print requests.get('https://mobile.lowvig.ag/sports', headers=hdr, proxies=proxyDict).content

這讓我覺得可能是我的環境(除了我的 IP)特定的東西觸發了這個問題。 此外,我可以在普通瀏覽器中通過我的家庭 IP 毫無問題地查看該網站。

網站所有者lowvig.ag已決定使用 Cloudflare 驗證您的 IP 地址、AS 編號、國家或用戶代理,這可能是為了保護網站免受機器人攻擊。

如果您認為您應該免於這樣做,您可能需要聯系網站所有者以將您的 IP 地址列入白名單。

請參閱https://support.cloudflare.com/hc/en-us/articles/203366080-Why-do-I-see-a-captcha-or-challenge-page-Attention-Required-trying-to-visit-a -站點保護由 Cloudflare-作為站點訪問者-

我試過你的代碼(基於 Python 2.7),它在我的電腦上工作。 我的IP沒有被列入黑名單。

我使用 Python 3.7 創建了一個類似的代碼。 它也起作用了。

import urllib.request

hdr = {
    'User-Agent': 'Mozilla/5.0', 
    'Accept-Language': 'en-US,en;q=0.8'
    }

req = urllib.request.Request(
    'https://mobile.lowvig.ag/sports', 
    headers=hdr
    )

with urllib.request.urlopen(req) as f:
    print(f.read().decode('utf-8'))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM