為什么這段代碼不起作用？ Python RE（正則表達式）

Question

我試圖使用RE與python從隨機頁面中提取所有產品。 但是，我不知道為什么我沒有得到任何比賽。 我很肯定我的代碼的re.findall部分出現問題，特別是當我添加“。*”時。

    import urllib2, re
    response=urllib2.urlopen('http:/  /www.tigerdirect.com/applications/Category/guidedSearch.asp?CatId=17&cm_sp=Masthead-_-Computers-_-Spot%2002')
    stuff=response.read()
    laptops= re.findall(r'<div class="product"> .* </div>',stuff)
    for laptop in laptops:
        print laptop

Answer 1

.*是一個貪婪的比賽，你應該使用.*? 這是非貪婪的版本。

以下是您要匹配的其中一條記錄的示例

'<div class="product"><div class="productImage"><a href="../SearchTools/item-details.asp?EdpNo=4903661&amp;Sku=T71-156409" class="itemImage" title="Lenovo Z580 15.6&quot; Core i7 500GB HDD Notebook PC"><img src="http://images.highspeedbackbone.net/skuimages/medium/T71-156409_chiclet01xx_er.jpg" alt="It features a powerful 3rd generation Intel Core i7-3520M and a 4GB DDR3 RAM which deliver you a powerful computing performance." onerror="this.src=\'http://images.highspeedbackbone.net/SearchTools/no_image-sm.jpg\'" border="0" /></a></div><div class="productInfo"><h3 class="itemName"><a href="../SearchTools/item-details.asp?EdpNo=4903661&amp;Sku=T71-156409" title="Lenovo Z580 15.6&quot; Core i7 500GB HDD Notebook PC">Lenovo Z580 15.6" Core i7 500GB HDD Notebook PC</a></h3></div>'

看起來你需要刪除周圍的空格.*也是

>>> response=urllib2.urlopen('http://www.tigerdirect.com/applications/Category/guidedSearch.asp?CatId=17&cm_sp=Masthead-_-Computers-_-Spot%2002')
>>> stuff=response.read()
>>> laptops= re.findall(r'<div class="product">.*?</div>',stuff)
>>> len(laptops)
16

有更好的解析HTML的方法，例如BeautifulSoup

為什么這段代碼不起作用？ Python RE（正則表達式）

問題描述

1 個解決方案

解決方案1
2 已采納 2013-06-20 03:11:04

為什么這段代碼不起作用？ Python RE（正則表達式）

問題描述

1 個解決方案

解決方案1 2 已采納 2013-06-20 03:11:04

解決方案1
2 已采納 2013-06-20 03:11:04