I am trying to use RE with python to extract all the products off of a random page. However, I do not know why I am not getting any matches. I am positive that the problem is occurring with the re.findall part of my code, specifically when i add the" .* ".
import urllib2, re
response=urllib2.urlopen('http:/ /www.tigerdirect.com/applications/Category/guidedSearch.asp?CatId=17&cm_sp=Masthead-_-Computers-_-Spot%2002')
stuff=response.read()
laptops= re.findall(r'<div class="product"> .* </div>',stuff)
for laptop in laptops:
print laptop
.*
is a greedy match, you should probably use .*?
which is the non-greedy version.
Here is an example of one of the records you are trying to match
'<div class="product"><div class="productImage"><a href="../SearchTools/item-details.asp?EdpNo=4903661&Sku=T71-156409" class="itemImage" title="Lenovo Z580 15.6" Core i7 500GB HDD Notebook PC"><img src="http://images.highspeedbackbone.net/skuimages/medium/T71-156409_chiclet01xx_er.jpg" alt="It features a powerful 3rd generation Intel Core i7-3520M and a 4GB DDR3 RAM which deliver you a powerful computing performance." onerror="this.src=\'http://images.highspeedbackbone.net/SearchTools/no_image-sm.jpg\'" border="0" /></a></div><div class="productInfo"><h3 class="itemName"><a href="../SearchTools/item-details.asp?EdpNo=4903661&Sku=T71-156409" title="Lenovo Z580 15.6" Core i7 500GB HDD Notebook PC">Lenovo Z580 15.6" Core i7 500GB HDD Notebook PC</a></h3></div>'
It looks like you need to remove the spaces around .*
too
>>> response=urllib2.urlopen('http://www.tigerdirect.com/applications/Category/guidedSearch.asp?CatId=17&cm_sp=Masthead-_-Computers-_-Spot%2002')
>>> stuff=response.read()
>>> laptops= re.findall(r'<div class="product">.*?</div>',stuff)
>>> len(laptops)
16
There are better ways of parsing HTML, such as BeautifulSoup
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.