我試圖從網頁中使用正則表達式python獲取代理

Question

import urllib.request
import re
page = urllib.request.urlopen("http://www.samair.ru/proxy/ip-address-01.htm").read()
re.findall('\d+\.\d+\.\d+\.\d+', page)

我不明白為什么它說：

文件“C：\\ Python33 \\ lib \\ re.py”，第201行，在findall中返回_compile（pattern，flags）.findall（string）TypeError：不能在類字節對象上使用字符串模式

Answer 1

import urllib
import re
page = urllib.urlopen("http://www.samair.ru/proxy/ip-address-01.htm").read()
print re.findall('\d+\.\d+\.\d+\.\d+', page)

工作並給了我結果：

['056.249.66.50', '100.44.124.8', '103.31.250.115', ...

編輯

這適用於python2.7

Answer 2

讀取urllib.request.urlopen返回的類文件對象的結果是一個bytes對象。 您可以將其解碼為unicode字符串並使用unicode正則表達式：

>>> re.findall('\d+\.\d+\.\d+\.\d+', page.decode('utf-8'))
['056.249.66.50', '100.44.124.8', '103.31.250.115', '105.236.180.243', '105.236.21.213', '108.171.162.172', '109.207.61.143', '109.207.61.197', '109.207.61.202', '109.226.199.129', '109.232.112.109', '109.236.220.98', '110.196.42.33', '110.74.197.141', '110.77.183.64', '110.77.199.111', '110.77.200.248', '110.77.219.154', '110.77.219.2', '110.77.221.208']

...或使用字節正則表達式：

>>> re.findall(b'\d+\.\d+\.\d+\.\d+', page)
[b'056.249.66.50', b'100.44.124.8', b'103.31.250.115', b'105.236.180.243', b'105.236.21.213', b'108.171.162.172', b'109.207.61.143', b'109.207.61.197', b'109.207.61.202', b'109.226.199.129', b'109.232.112.109', b'109.236.220.98', b'110.196.42.33', b'110.74.197.141', b'110.77.183.64', b'110.77.199.111', b'110.77.200.248', b'110.77.219.154', b'110.77.219.2', b'110.77.221.208']

具體取決於您喜歡使用的數據類型。

我試圖從網頁中使用正則表達式python獲取代理

問題描述

2 個解決方案

解決方案1
1 已采納 2013-04-27 22:14:51

解決方案2
1 2013-04-27 22:22:24

我試圖從網頁中使用正則表達式python獲取代理

問題描述

2 個解決方案

解決方案1 1 已采納 2013-04-27 22:14:51

解決方案2 1 2013-04-27 22:22:24

解決方案1
1 已采納 2013-04-27 22:14:51

解決方案2
1 2013-04-27 22:22:24