简体   繁体   中英

How do I follow links (or scrape multiple links) when web scraping with urllib2?

I am attempting to scrape the url ' http://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Knife&appid=730#p1 ' (for purely information purposes), but I cannot seem to figure out how to go to the next page. My current code is the following, but it just loops through the first page repeatedly instead of going to the next page.

import urllib2
from bs4 import BeautifulSoup

page_num = 1

while True:
    url = 'http://steamcommunity.com/market/search? q=&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Knife&appid=730#p' + str(page_num)
    open_url = urllib2.urlopen(url).read()
    market_page = BeautifulSoup(read_url)

    for i in market_page('div', {'class' : 'market_listing_row      market_recent_listing_row market_listing_searchresult'}):
        item_name = i.find_all('span', {'class' : 'market_listing_item_name'})[0].get_text()
        price = i.find_all('span')[1].get_text()
        page_num += 1
        print  item_name + ' costs ' + price

EDIT: Also, the problem with the page I'm trying to scrape is that the links to the next page do not have any hrefs, so I was using a loop to try to go to different URLs, but it just scrapes the first URL repeatedly.

import urllib2
from bs4 import BeautifulSoup

pages  = 90

for page in range(pages):
    url = 'http://steamcommunity.com/market/search? q=&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Knife&appid=730#p' + str(page)
    open_url = urllib2.urlopen(url).read()
    market_page = BeautifulSoup(read_url)

    for i in market_page('div', {'class' : 'market_listing_row      market_recent_listing_row market_listing_searchresult'}):
        item_name = i.find_all('span', {'class' : 'market_listing_item_name'})[0].get_text()
        price = i.find_all('span')[1].get_text()
        print  item_name + ' costs ' + price

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM