简体   繁体   English

无法打印python中列表中的所有项目

[英]Trouble printing all items from a list in python

I'm trying to learn how to do web scraping, and it's not coming out in the format i would hope it would have. 我正在努力学习如何进行网页抓取,并且它不会以我希望它会有的格式出现。 Here is the issue I'm running into: 这是我遇到的问题:

import urllib
import re

pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]

i=0
while i<len(pagelist):
    url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<h2 style="float:left;">(.+?)</h2>' 
    pattern = re.compile(regex)
    storeName = re.findall(pattern,htmltext)
    print "Store Name=", storeName[i]
    i+=1

This code produces this result: Store Name = Boost Mobile store by wireless depot Store Name = Wal-Mart ..... and so for 10 different stores, I'm assuming this happens because 此代码产生以下结果:Store Name = Boost Mobile store by wireless depot Store Name = Wal-Mart .....等10个不同的商店,我假设这是因为

while i<len(pagelist):

is only = to ten 只有=到十

so it only prints out ten of the stores instead of all stores listed on all pages. 所以它只打印出十个商店而不是所有页面上列出的所有商店。

When I change the second to last line to this 当我将第二行更改为最后一行时

print storeName

It will print out every store name listed on each page but not in the format above but like this: 'Boost mobile store by wireless depot', 'boost mobile store by kob wireless', 'marietta check chashing services',..... and so on for about another 120 entries. 它将打印出每个页面上列出的每个商店名称,但不是上面的格式,但是像这样:'通过无线仓库提升移动商店','通过kob wireless提升移动商店','marietta check chashing services',....等等约120个条目。 so how do I get it in the desired format of: "Store Name = ...." rather then: 'name','name',..... 所以我如何得到所需的格式:“Store Name = ....”而不是:'name','name',.....

Do not parse HTML with regex. 不要使用正则表达式解析HTML。 Use a specialized tool - an HTML Parser . 使用专门的工具 - HTML Parser

Here's the solution using BeautifulSoup : 以下是使用BeautifulSoup的解决方案:

import urllib2
from bs4 import BeautifulSoup

base_url = "http://www.boostmobile.com/stores/?page={page}&zipcode={zipcode}"
num_pages = 10
zipcode = 30008

for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url))

    print "Page Number: %s" % page
    results = soup.find('table', class_="results")
    for h2 in results.find_all('h2'):
        print h2.text

It prints: 它打印:

Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...

As you can see, first we find a table tag with results class - this is where the store names actually are. 如您所见,首先我们找到一个带有results类的table标记 - 这就是商店名称的实际位置。 Then, inside the table we are finding all of the h2 tags. 然后,在table我们找到所有的h2标签。 This is more robust than relying on the style attribute of a tag. 这比依赖标记的style属性更强大。


You can also make use of SoupStrainer . 您也可以使用SoupStrainer It would improve the performance since it would parse only the part of document that you specify: 它会提高性能,因为它只解析您指定的文档部分:

required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)

    print "Page Number: %s" % page
    for h2 in soup.find_all('h2'):
        print h2.text

Here we are saying: "parse only the table tag with the class results . And give us all of the h2 tags inside it." 这里我们说:“只用table results解析table标签。然后给我们里面的所有h2标签。”

Also, if you want to improve performance, you can let BeautifulSoup use lxml parser under the hood : 此外,如果您想提高性能,可以BeautifulSoup使用lxml解析器

soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=required_part) 

Hope that helps. 希望有所帮助。

storeName is an array, and you need to loop through it. storeName是一个数组,您需要遍历它。 Currently you are indexing into it a single time at each page, using the page number, which was probably not your intent. 目前,您使用页码在每个页面上索引一次,这可能不是您的意图。

Here is a correct version of your code, with the loop added. 这是您的代码的正确版本,添加了循环。

import urllib
import re

pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]

i=0
while i<len(pagelist):
    url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<h2 style="float:left;">(.+?)</h2>' 
    pattern = re.compile(regex)
    storeName = re.findall(pattern,htmltext)
    for sn in storeName:
        print "Store Name=", sn
    i+=1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM