抓取分页网站：抓取第2页返回第1页结果

Question

I am using the get method of the requests library in python to scrape information from a website which is organized into pages (ie paginated with numbers at the bottom). 我正在使用python中的请求库的get方法来从组织成页面的网站（即，在底部用数字分页）中抓取信息。

Page 1 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian 页面1链接： https : //realfood.tesco.com/search.html?DietaryOption=Vegetarian

I am able to extract the data that I need from the first page but when I feed my code the url for the second page, I get the same data from the first page. 我能够从第一页中提取所需的数据，但是当我向代码提供第二页的URL时，也会从第一页获得相同的数据。 Now after carefully analyzing my code, I am certain the issue is not my code logic but the way the second page url is structured. 现在，在仔细分析我的代码之后，我确定问题不是我的代码逻辑，而是第二页url的结构方式。

So my question is how can I get my code to work as I want. 所以我的问题是如何使我的代码按我的意愿工作。 I suspect it is a question of parameters but I am not 100% percent sure. 我怀疑这是参数问题，但我不确定100％。 If indeed it is parameters that I need to pass to request, I would appreciate some guidance on how to break down the parameters. 如果确实是我需要传递给请求的参数，我将对如何分解参数提供一些指导。 My page 2 link is attached below. 我的第2页链接附在下面。 Thanks. 谢谢。

Page 2 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q= 'selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian' 第2页链接： https ://realfood.tesco.com/search.html?DietaryOption = Vegetarian#!q = 'selectedobjecttype％3DRECIPES％26page％3D2％26perpage％3D30％26DietaryOption％3DVegetarian'

Note: The pages are not really links per se. 注意：这些页面本身并不是真正的链接。

Answer 1

It looks like platform is ASP.NET and pagination links are operated by JS. 看起来平台是ASP.NET，分页链接由JS操作。 I seriously doubt you will have it easy with python, since beautifulsoup is a HTML parser/extractor, so if you really want to use this site, I would suggest to looking into Selenium or even PhantomJS, since they fully replicate the browser. 我非常怀疑您使用python会容易，因为beautifulsoup是HTML解析器/提取器，因此，如果您真的想使用此网站，我建议您研究Selenium或PhantomJS，因为它们完全复制了浏览器。

But in this particular case you are lucky, because there's a legacy website version which doesn't use modern bells and whistles :) 但是在这种特殊情况下，您很幸运，因为有一个旧版网站版本没有使用现代功能：)

http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=searchRelevance http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=search相关性

Answer 2

It looks like the pagination of this site is handled by the query parameters passed in the second URL you posted, ie: 该网站的分页似乎由您发布的第二个URL中传递的查询参数处理，即：

https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'

The query string is url encoded. 查询字符串是url编码的。 %3D is = and %26 is &. ％3D是=，％26是＆。 It might be more readable like this: 这样可能更易读：

q='selectedobjecttype=RECIPES&page=2&perpage=30&DietaryOption=Vegetarian'

For example, if you wanted to pull back the fifth page of Vegetarian Recipes the URL would look like this: 例如，如果您想退回素食食谱的第五页，URL将如下所示：

https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q= 'selectedobjecttype%3DRECIPES%26page%3D5%26perpage%3D30%26DietaryOption%3DVegetarian' https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype％3DRECIPES％26page％3D5％26perpage％3D30％26DietaryOption％3DVegetarian '

You can keep incrementing the page number until you get a page with no results which looks like this . 您可以不断增加页码，直到得到没有结果的页面，如下所示。

Answer 3

What about this? 那这个呢？

from bs4 import BeautifulSoup
import urllib.request

for numb in ('1', '10'):
    resp = urllib.request.urlopen("https://realfood.tesco.com/search.html?DietaryOption=Vegetarian")
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

    for link in soup.find_all('a', href=True):
        print(link['href'])

Hopefully it works for you. 希望它对您有用。 I can't test it because my office blocks these kinds of things. 我无法测试它，因为我的办公室挡住了这类事情。 I'll try it when I get home tonight to see if it does what it should do... 今晚回家时我会尝试一下，看看它是否能完成应做的工作...

抓取分页网站：抓取第2页返回第1页结果

问题描述

3 个解决方案

解决方案1
1 2017-12-03 19:50:31

解决方案2
1 2017-12-03 21:53:40

解决方案3
0 已采纳 2018-01-05 20:42:26

抓取分页网站：抓取第2页返回第1页结果

问题描述

3 个解决方案

解决方案1 1 2017-12-03 19:50:31

解决方案2 1 2017-12-03 21:53:40

解决方案3 0 已采纳 2018-01-05 20:42:26

解决方案1
1 2017-12-03 19:50:31

解决方案2
1 2017-12-03 21:53:40

解决方案3
0 已采纳 2018-01-05 20:42:26