简体   繁体   中英

Scraping a paginated website: Scraping page 2 gives back page 1 results

I am using the get method of the requests library in python to scrape information from a website which is organized into pages (ie paginated with numbers at the bottom).

Page 1 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian

I am able to extract the data that I need from the first page but when I feed my code the url for the second page, I get the same data from the first page. Now after carefully analyzing my code, I am certain the issue is not my code logic but the way the second page url is structured.

So my question is how can I get my code to work as I want. I suspect it is a question of parameters but I am not 100% percent sure. If indeed it is parameters that I need to pass to request, I would appreciate some guidance on how to break down the parameters. My page 2 link is attached below. Thanks.

Page 2 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q= 'selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'

Note: The pages are not really links per se.

It looks like platform is ASP.NET and pagination links are operated by JS. I seriously doubt you will have it easy with python, since beautifulsoup is a HTML parser/extractor, so if you really want to use this site, I would suggest to looking into Selenium or even PhantomJS, since they fully replicate the browser.

But in this particular case you are lucky, because there's a legacy website version which doesn't use modern bells and whistles :)

http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=searchRelevance

It looks like the pagination of this site is handled by the query parameters passed in the second URL you posted, ie:

https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'

The query string is url encoded. %3D is = and %26 is &. It might be more readable like this:

q='selectedobjecttype=RECIPES&page=2&perpage=30&DietaryOption=Vegetarian'

For example, if you wanted to pull back the fifth page of Vegetarian Recipes the URL would look like this:

https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q= 'selectedobjecttype%3DRECIPES%26page%3D5%26perpage%3D30%26DietaryOption%3DVegetarian'

You can keep incrementing the page number until you get a page with no results which looks like this .

What about this?

from bs4 import BeautifulSoup
import urllib.request

for numb in ('1', '10'):
    resp = urllib.request.urlopen("https://realfood.tesco.com/search.html?DietaryOption=Vegetarian")
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

    for link in soup.find_all('a', href=True):
        print(link['href'])

Hopefully it works for you. I can't test it because my office blocks these kinds of things. I'll try it when I get home tonight to see if it does what it should do...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM