抓取下一页：下一页的 url 停留在同一页

Question

I start from this page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A and want to scrape all the next pages until it has reached the bottom.我从这个页面https://www.cnrtl.fr/portailindex/LEXI/TLFI/A开始，想要抓取所有下一页，直到它到达底部。

For each letter A to Z the next pages'url (as shown in the browser) are https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/<index> where the index increments each time by 80. For instance the first next page is https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 .对于每个字母 A 到 Z，下一页的 url（如浏览器中所示）为https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/<index> ，其中索引每次递增 80。对于例如下一页的第一页是https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 。 First idea was to build the url addresses based on this rule and fetch them with urllib.第一个想法是根据此规则构建 url 地址并使用 urllib 获取它们。 However, when I implement in python,但是，当我在 python 中实施时，

res = urllib.request.urlopen(url)
soup = BeautifulSoup(res, "lxml")

it seems that I always stay on the first page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ .好像一直停留在第一页https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ 。

A second idea is to get the next page from the next page button, an example of next page button is第二个想法是从下一页按钮获取下一页，下一页按钮的一个例子是

<a href="/portailindex/LEXI/TLFI/B/480"><img src="/images/portail/right.gif" title="Page suivante" \
           border="0" width="32" height="32" alt="" />

but all I will get is again /portailindex/LEXI/TLFI/B/480 and when calling urllib.request on this, it does not increment to the next page.但我将再次得到/portailindex/LEXI/TLFI/B/480并且在调用 urllib.request 时，它不会递增到下一页。

So, why does https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 in browser work while the urllib.request brings me back to https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ ?那么，为什么https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80在浏览器中工作，而 urllib.request 将我带回https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ ？

Any elegant way to go from one page to the next here until it finishes nicely?从一页到下一页到 go 的任何优雅方式，直到它很好地完成？

Answer 1

It seems to do it它似乎做到了

import urllib
from bs4 import BeautifulSoup
import requests
import string

dictionary = []

def get_words_in_page( url ):
    res = urllib.request.urlopen(url)
    soup = BeautifulSoup(res, "lxml")
    lst = ""
    for w in soup.findAll("a",{"href":regex}):
        dictionary.append(w.string)
        lst=w.string

base_url = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/"
    
for l in string.ascii_lowercase:    
    base_url = base_url + l.upper()    
    get_words_in_page( base_url )        
    next_index = 0    
    while True:    
        next_index += 80
        url = base_url+"/"+str(next_index)        
        try:
            res = urllib.request.urlopen(url)
        except ValueError:
            break    
        get_words_in_page( url )

Answer 2

Not very sure what's going on, but something like the following worked well for me recently:不太确定发生了什么，但最近像下面这样的东西对我来说效果很好：

Python 3.10.2 on Windows 10. The following code is from the context of a larger program. Python 3.10.2 on Windows 10. 以下代码来自一个更大程序的上下文。

from bs4 import BeautifulSoup as Soup
from urllib import request

START = 1
END = 82

BASE_URL = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/*"

def pull(url: str) -> Soup:
    my_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}

    my_request = request.Request(url, headers=my_headers)
    html_text = request.urlopen(my_request).read()

    return Soup(html_text, 'html.parser')

def main():
    for i in range(START, END + 1):
        print(f"\nStarting page {i}...")
        url = BASE_URL.replace("*", str(i))

        soup = pull(url)

Could be that you need headers?可能是你需要标题？ Source 来源

Answer 3

Just iterate over the letter href and for each use the href of the <a> that holds the arrow for next page to iterate over all sub pages.只需遍历字母href并为每个使用包含下一页箭头的<a>的href来遍历所有子页面。

In my opinion this would be more generic than the approache that deals with count up the numbers.在我看来，这比计算数字的方法更通用。

Example例子

from bs4 import BeautifulSoup
import requests

baseUrl = 'https://www.cnrtl.fr'
response = requests.get('https://www.cnrtl.fr/portailindex/LEXI/TLFI/A')
soup = BeautifulSoup(response.content, 'html.parser')

data = []

for url in soup.select('table.letterHeader a'):

    while True:
        response = requests.get(baseUrl+url['href'])
        soup = BeautifulSoup(response.content, 'html.parser')

        data.extend([x.text for x in soup.select('table.hometab a')])

        if (a := soup.select_one('a:has(img[title="Page suivante"])')):
            url = a
        else:
            break

        time.sleep(2)

Output Output

['à', 'à-plat', 'abaissement', 'abas', 'a', 'a-raciste', 'abaisser', 'abasie', 'a b c', 'à-venir', 'abaisseur', 'abasourdir', 'à contre-lumière', 'aalénien', 'abajoue', 'abasourdissant', "à l'envers", 'aaronide', 'abalober', 'abasourdissement', 'à la bonne franquette', 'ab hoc et ab hac', 'abalone', 'abat', 'à muche-pot', 'ab intestat', 'abalourdir', 'abat-chauvée', 'à musse-pot', 'ab irato', 'abalourdissement', 'abat-faim', 'à pic', 'ab ovo', 'abandon', 'abat-feuille', 'à posteriori', 'aba', 'abandonnataire', 'abat-flanc', 'à priori', 'abaca', 'abandonné', 'abat-foin', 'à tire-larigot', 'abaddir', 'abandonnée', 'abat-joue', 'à vau', 'abadie', 'abandonnement', 'abat-jour', 'à vau-de-route', 'abadis', 'abandonnément', 'abat-relui', 'à vau-le-feu', 'abaissable', 'abandonner', 'abat-reluit', 'à-bas', 'abaissant', 'abandonneur', 'abat-son', 'à-compte', 'abaisse', 'abandonneuse', 'abat-vent', 'a-humain', 'abaissé', 'abaque', 'abat-voix', 'a-mi-la', 'abaisse-langue', 'abarticulaire', 'abatage', 'à-pic', 'abaissée', 'abarticulation', 'abâtardi', 'abâtardir', 'abbatial', 'abdominal', 'abécé', 'abâtardissement', 'abbatiale', 'abdominale', 'abécédaire', 'abatée', 'abbatiat', 'abdominien', 'abécédé', 'abatis', 'abbattre', 'abdominienne', 'abéchement', 'abatre', 'abbaye', 'abdomino-coraco-huméral', 'abécher', 'abattable', 'abbé', 'abdomino-coraco-humérale', 'abecquage', 'abattage', 'abbesse', 'abdomino-génital', 'abecquement', 'abattant', 'abbevillien', 'abdomino-génitale', 'abecquer', 'abattée', 'abbevillienne', 'abdomino-guttural', 'abecqueuse', 'abattement', 'abbevillois', 'abdomino-gutturale', 'abée', 'abatteur', 'abbevilloise', 'abdomino-huméral', 'abeillage', 'abatteuse', 'abcéder', 'abdomino-humérale', 'abeille', 'abattis', 'abcès', 'abdomino-périnéal', 'abeillé', 'abattoir', 'abdalas', 'abdomino-scrotal', 'abeiller', 'abattre', 'abdéritain', 'abdomino-thoracique', 'abeillier', 'abattu', 'abdéritaine', 'abdomino-utérotomie', 'abeillon', 'abattue', 'abdicataire', 'abdominoscopie', 'abélien', 'abatture', 'abdication', 'abdominoscopique', 'abéquage', 'abax', 'abdiquer', 'abducteur', 'abéquer', 'abbadie', 'abdomen', 'abduction', 'abéqueuse', 'aber', 'abiétine', 'abjurer', 'aboi', 'aberrance', 'abiétiné', 'ablatif', 'aboiement', 'aberrant', 'abiétinée', 'ablation', 'aboilage', 'aberration', 'abiétique', 'ablativo', 'abolir', 'aberrer', 'abigaïl', 'able', 'abolissable', 'aberrographe', 'abigéat', 'ablégat', 'abolissement', 'aberroscope', 'abigotir', 'ablégation', 'abolitif', 'abessif', 'abîme', 'abléphare', 'abolition', 'abêtifier', 'abîmé', 'ablépharie', 'abolitionnisme', 'abêtir', 'abîmement', 'ablépharoplastique', 'abolitionniste', 'abêtissant', 'abîmer', 'ableret', 'aboma', 'abêtissement', 'abiogenèse', 'ablet', 'abominable', 'abêtissoir', 'abiose', 'ablette', 'abominablement', 'abhorrable', 'abiotique', 'ablier', 'abomination', 'abhorré', 'abject', 'abluant', 'abominer', 'abhorrer', 'abjectement', 'abluante', 'abondamment', 'abicher', 'abjection', 'abluer', 'abondance', 'abies', 'abjurateur', 'ablution', 'abondant', 'abiétacée', 'abjuration', 'ablutionner', 'abonder', 'abiétin', 'abjuratoire', 'abnégation', 'abonnable',...]

抓取下一页：下一页的 url 停留在同一页

问题描述

3 个解决方案

解决方案1
1 已采纳 2022-02-23 14:23:56

解决方案2
0 2022-02-23 13:46:55

解决方案3
0 2022-02-23 14:56:59

Example例子

Output Output

抓取下一页：下一页的 url 停留在同一页

问题描述

3 个解决方案

解决方案1 1 已采纳 2022-02-23 14:23:56

解决方案2 0 2022-02-23 13:46:55

解决方案3 0 2022-02-23 14:56:59

Example例子

Output Output

解决方案1
1 已采纳 2022-02-23 14:23:56

解决方案2
0 2022-02-23 13:46:55

解决方案3
0 2022-02-23 14:56:59