urllib為什么不能與本地網站一起使用？

Question

urllib出現問題，似乎無法刮擦自己的本地網站。 我可以獲取它以打印出網站的所有內容，但是正則表達式或某些功能不起作用。 我在當前代碼中得到的輸出僅為[] 。 所以我想知道自己在做什么錯？ 我已經有一段時間沒有使用urllib了，所以很可能我錯過了一些顯而易見的事情。 Python檔案：

import urllib
import re

htmlfile=urllib.urlopen('IP of server')
htmltext=htmlfile.read()
regex="<body>(.+?)</body>"
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

HTML檔案：

<html>
    <body>
        This is a basic HTML file to try to get my python file to work...
    </body>
</html>

提前謝謝一堆！

Answer 1

這里有些錯誤。 您需要啟用dotall修飾符，以強制點跨越換行符序列。 至於包含已編譯正則表達式和對findall調用的以下各行，應為：

regex = "<body>(.+?)</body>"
pattern = re.compile(regex, re.DOTALL)
price = pattern.findall(htmltext)

可以將其簡化如下，我建議從匹配結果中刪除空格。

price = re.findall(r'(?s)<body>\s*(.+?)\s*</body>', htmltext)

為了將來參考，請使用諸如BeautifulSoup之類的解析器來提取數據而不是正則表達式。

Answer 2

另外，實際上，這比基於正則表達式的方法更可取 -使用HTML Parser 。

示例（使用BeautifulSoup ）：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <html>
...     <body>
...         This is a basic HTML file to try to get my python file to work...
...     </body>
... </html>
... """
>>> soup = BeautifulSoup(data)
>>> print soup.body.get_text(strip=True)
This is a basic HTML file to try to get my python file to work...

注意代碼是多么簡單，沒有“ regex魔術”。

Answer 3

點. 除非您設置了dot-matches-all s修飾符，否則不匹配換行符：

re.compile('<body>(.+?)</body>', re.DOTALL)

urllib為什么不能與本地網站一起使用？

問題描述

3 個解決方案

解決方案1
2 已采納 2015-01-14 01:51:58

解決方案2
2 2015-01-14 02:03:18

解決方案3
1 2015-01-14 01:47:27

urllib為什么不能與本地網站一起使用？

問題描述

3 個解決方案

解決方案1 2 已采納 2015-01-14 01:51:58

解決方案2 2 2015-01-14 02:03:18

解決方案3 1 2015-01-14 01:47:27

解決方案1
2 已采納 2015-01-14 01:51:58

解決方案2
2 2015-01-14 02:03:18

解決方案3
1 2015-01-14 01:47:27