Web抓刮在python中的urlopen

Question

我想從這個網站獲取數據： http ： //www.boursorama.com/includes/cours/last_transactions.phtml？symbole = 1xEURUS

似乎urlopen沒有得到HTML代碼，我不明白為什么。 它像：

html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)

我的代碼是正確的，我得到了具有相同代碼的其他網頁的html源代碼，但似乎它無法識別此地址。

它打印：b''

也許另一個圖書館更合適？ 為什么urlopen不返回網頁的html代碼？ 謝謝！

Answer 1

就個人而言，我寫道：

# Python 2.7

import urllib

url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
sock = urllib.urlopen(url)
content = sock.read() 
sock.close()

print content

Et si tuparlesfrançais，.. bonjour sur stackoverflow.com！

更新1

事實上，我現在更喜歡使用以下代碼，因為它更快：

# Python 2.7

import httplib

conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)

req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'

try:
    conn.request('GET',req)
except:
     print 'echec de connexion'

content = conn.getresponse().read()

print content

在此代碼中將httplib更改為http.client應該足以使其適應Python 3。

。

我確認，通過這兩個代碼，我獲得了源代碼，在其中我看到了您感興趣的數據：

        <td class="L20" width="33%" align="center">11:57:44</td>

        <td class="L20" width="33%" align="center">1.4486</td>

        <td class="L20" width="33%" align="center">0</td>

</tr>

                                        <tr>

        <td  width="33%" align="center">11:57:43</td>

        <td  width="33%" align="center">1.4486</td>

        <td  width="33%" align="center">0</td>

</tr>

更新2

將以下代碼段添加到上面的代碼將允許您提取我想要的數據：

for i,line in enumerate(content.splitlines(True)):
    print str(i)+' '+repr(line)

print '\n\n'


import re

regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')

print regx.findall(content)

結果（只有結束）

.......................................
.......................................
.......................................
.......................................
98 'window.config.graphics = {};\n'
99 'window.config.accordions = {};\n'
100 '\n'
101 "window.addEvent('domready', function(){\n"
102 '});\n'
103 '</script>\n'
104 '<script type="text/javascript">\n'
105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
107 '\t\t\t\tvar sas_formatids = "8968";\n'
108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
111 "\twindow.addEvent('domready', function(){\r\n"
112 'sas_move(1,8968);\t});\r\n'
113 '</script>\n'
114 '<script type="text/javascript">\n'
115 'var _gaq = _gaq || [];\n'
116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
120 "_gaq.push(['_trackPageLoadTime']);\n"
121 "_gaq.push(['_trackPageview']);\n"
122 '(function() {\n'
123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
126 '})();\n'
127 '</script>\n'
128 '</body>\n'
129 '</html>'



[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

我希望你不打算在外匯交易中“玩”交易：這是快速寬松貨幣的最好方式之一。

更新3

抱歉！ 我忘了你使用Python 3.所以我認為你必須像這樣定義正則表達式：

regx = re.compile（ b '\\ t \\ t \\ t \\ t \\ t \\ t ......）

也就是說在字符串之前用b ，否則你會得到一個像這個問題的錯誤

Answer 2

我懷疑發生的是服務器發送壓縮數據而不告訴你它正在這樣做。 Python的標准HTTP庫無法處理壓縮格式。
我建議使用httplib2，它可以處理壓縮格式（並且通常比urllib好得多）。

import httplib2
folder = httplib2.Http('.cache')
response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")

print(response)向我們顯示服務器的響應：
{'status'：'200'，'content-length'：'7787'，'x-sid'：'26，E'，'content-language'：'fr'，'set-cookie'：'PHPSESSIONID = ed45f761542752317963ab4762ec604f; 路徑= /; domain = .www.boursorama.com'，'expires'：'Thu，1981年11月19日08:52:00 GMT'，'vary'：'Accept-Encoding，User-Agent'，'server'：'nginx'， 'connection'：'keep-alive'， ' - content-encoding'：'gzip' ，'pragma'：'no-cache'，'cache-control'：'no-store，no-cache，must-revalidate， post-check = 0，pre-check = 0'，'date'：'Tue，2011年8月23日10:26:46 GMT'，'content-type'：'text / html; charset = ISO-8859-1'，'content-location'：'http：//www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}

雖然這並沒有證實它是拉鏈的（我們現在告訴服務器我們可以處理壓縮，畢竟），它確實給理論增添了一些力量。

實際的內容存在，你猜對了， content 。 看一下它簡要地告訴我們它正在工作（我只是要粘貼一點點）：
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\\n\\t"http://

編輯：是的，這確實創建了一個名為.cache的文件夾; 我發現在httplib2中使用文件夾總是更好，之后你總是可以刪除文件夾。

Answer 3

我已經使用httplib2測試了您的URL，並使用curl測試了終端。 兩者都很好：

URL = "http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS"
h = httplib2.Http()
resp, content = h.request(URL, "GET")
print(content)

所以對我來說，urllib.request中有一個錯誤，或者真的發生了奇怪的客戶端 - 服務器交互。

Web抓刮在python中的urlopen

問題描述

3 個解決方案

解決方案1
4 已采納 2011-08-23 08:55:59

更新1

更新2

更新3

解決方案2
4 2011-08-23 10:41:12

解決方案3
2 2011-08-23 08:49:21

Web抓刮在python中的urlopen

問題描述

3 個解決方案

解決方案1 4 已采納 2011-08-23 08:55:59

更新1

更新2

更新3

解決方案2 4 2011-08-23 10:41:12

解決方案3 2 2011-08-23 08:49:21

解決方案1
4 已采納 2011-08-23 08:55:59

解決方案2
4 2011-08-23 10:41:12

解決方案3
2 2011-08-23 08:49:21