简体   繁体   中英

Trouble printing all items from a list in python

I'm trying to learn how to do web scraping, and it's not coming out in the format i would hope it would have. Here is the issue I'm running into:

import urllib
import re

pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]

i=0
while i<len(pagelist):
    url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<h2 style="float:left;">(.+?)</h2>' 
    pattern = re.compile(regex)
    storeName = re.findall(pattern,htmltext)
    print "Store Name=", storeName[i]
    i+=1

This code produces this result: Store Name = Boost Mobile store by wireless depot Store Name = Wal-Mart ..... and so for 10 different stores, I'm assuming this happens because

while i<len(pagelist):

is only = to ten

so it only prints out ten of the stores instead of all stores listed on all pages.

When I change the second to last line to this

print storeName

It will print out every store name listed on each page but not in the format above but like this: 'Boost mobile store by wireless depot', 'boost mobile store by kob wireless', 'marietta check chashing services',..... and so on for about another 120 entries. so how do I get it in the desired format of: "Store Name = ...." rather then: 'name','name',.....

Do not parse HTML with regex. Use a specialized tool - an HTML Parser .

Here's the solution using BeautifulSoup :

import urllib2
from bs4 import BeautifulSoup

base_url = "http://www.boostmobile.com/stores/?page={page}&zipcode={zipcode}"
num_pages = 10
zipcode = 30008

for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url))

    print "Page Number: %s" % page
    results = soup.find('table', class_="results")
    for h2 in results.find_all('h2'):
        print h2.text

It prints:

Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...

As you can see, first we find a table tag with results class - this is where the store names actually are. Then, inside the table we are finding all of the h2 tags. This is more robust than relying on the style attribute of a tag.


You can also make use of SoupStrainer . It would improve the performance since it would parse only the part of document that you specify:

required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)

    print "Page Number: %s" % page
    for h2 in soup.find_all('h2'):
        print h2.text

Here we are saying: "parse only the table tag with the class results . And give us all of the h2 tags inside it."

Also, if you want to improve performance, you can let BeautifulSoup use lxml parser under the hood :

soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=required_part) 

Hope that helps.

storeName is an array, and you need to loop through it. Currently you are indexing into it a single time at each page, using the page number, which was probably not your intent.

Here is a correct version of your code, with the loop added.

import urllib
import re

pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]

i=0
while i<len(pagelist):
    url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<h2 style="float:left;">(.+?)</h2>' 
    pattern = re.compile(regex)
    storeName = re.findall(pattern,htmltext)
    for sn in storeName:
        print "Store Name=", sn
    i+=1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM