简体   繁体   English

抓取下一页:下一页的 url 停留在同一页

[英]scraping the next page : next page's url staying on the same page

I start from this page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A and want to scrape all the next pages until it has reached the bottom.我从这个页面https://www.cnrtl.fr/portailindex/LEXI/TLFI/A开始,想要抓取所有下一页,直到它到达底部。

For each letter A to Z the next pages'url (as shown in the browser) are https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/<index> where the index increments each time by 80. For instance the first next page is https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 .对于每个字母 A 到 Z,下一页的 url(如浏览器中所示)为https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/<index> ,其中索引每次递增 80。对于例如下一页的第一页是https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 First idea was to build the url addresses based on this rule and fetch them with urllib.第一个想法是根据此规则构建 url 地址并使用 urllib 获取它们。 However, when I implement in python,但是,当我在 python 中实施时,

res = urllib.request.urlopen(url)
soup = BeautifulSoup(res, "lxml")

it seems that I always stay on the first page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ .好像一直停留在第一页https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/

A second idea is to get the next page from the next page button, an example of next page button is第二个想法是从下一页按钮获取下一页,下一页按钮的一个例子是

<a href="/portailindex/LEXI/TLFI/B/480"><img src="/images/portail/right.gif" title="Page suivante" \
           border="0" width="32" height="32" alt="" />

but all I will get is again /portailindex/LEXI/TLFI/B/480 and when calling urllib.request on this, it does not increment to the next page.但我将再次得到/portailindex/LEXI/TLFI/B/480并且在调用 urllib.request 时,它不会递增到下一页。


So, why does https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 in browser work while the urllib.request brings me back to https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ ?那么,为什么https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80在浏览器中工作,而 urllib.request 将我带回https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/

Any elegant way to go from one page to the next here until it finishes nicely?从一页到下一页到 go 的任何优雅方式,直到它很好地完成?

It seems to do it它似乎做到了

import urllib
from bs4 import BeautifulSoup
import requests
import string

dictionary = []

def get_words_in_page( url ):
    res = urllib.request.urlopen(url)
    soup = BeautifulSoup(res, "lxml")
    lst = ""
    for w in soup.findAll("a",{"href":regex}):
        dictionary.append(w.string)
        lst=w.string

base_url = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/"
    
for l in string.ascii_lowercase:    
    base_url = base_url + l.upper()    
    get_words_in_page( base_url )        
    next_index = 0    
    while True:    
        next_index += 80
        url = base_url+"/"+str(next_index)        
        try:
            res = urllib.request.urlopen(url)
        except ValueError:
            break    
        get_words_in_page( url )

Not very sure what's going on, but something like the following worked well for me recently:不太确定发生了什么,但最近像下面这样的东西对我来说效果很好:

Python 3.10.2 on Windows 10. The following code is from the context of a larger program. Python 3.10.2 on Windows 10. 以下代码来自一个更大程序的上下文。

from bs4 import BeautifulSoup as Soup
from urllib import request

START = 1
END = 82

BASE_URL = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/*"

def pull(url: str) -> Soup:
    my_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}

    my_request = request.Request(url, headers=my_headers)
    html_text = request.urlopen(my_request).read()

    return Soup(html_text, 'html.parser')

def main():
    for i in range(START, END + 1):
        print(f"\nStarting page {i}...")
        url = BASE_URL.replace("*", str(i))

        soup = pull(url)

Could be that you need headers?可能是你需要标题? Source 来源

Just iterate over the letter href and for each use the href of the <a> that holds the arrow for next page to iterate over all sub pages.只需遍历字母href并为每个使用包含下一页箭头的<a>href来遍历所有子页面。

In my opinion this would be more generic than the approache that deals with count up the numbers.在我看来,这比计算数字的方法更通用。

Example例子

from bs4 import BeautifulSoup
import requests

baseUrl = 'https://www.cnrtl.fr'
response = requests.get('https://www.cnrtl.fr/portailindex/LEXI/TLFI/A')
soup = BeautifulSoup(response.content, 'html.parser')

data = []

for url in soup.select('table.letterHeader a'):

    while True:
        response = requests.get(baseUrl+url['href'])
        soup = BeautifulSoup(response.content, 'html.parser')

        data.extend([x.text for x in soup.select('table.hometab a')])

        if (a := soup.select_one('a:has(img[title="Page suivante"])')):
            url = a
        else:
            break

        time.sleep(2)

Output Output

['à', 'à-plat', 'abaissement', 'abas', 'a', 'a-raciste', 'abaisser', 'abasie', 'a b c', 'à-venir', 'abaisseur', 'abasourdir', 'à contre-lumière', 'aalénien', 'abajoue', 'abasourdissant', "à l'envers", 'aaronide', 'abalober', 'abasourdissement', 'à la bonne franquette', 'ab hoc et ab hac', 'abalone', 'abat', 'à muche-pot', 'ab intestat', 'abalourdir', 'abat-chauvée', 'à musse-pot', 'ab irato', 'abalourdissement', 'abat-faim', 'à pic', 'ab ovo', 'abandon', 'abat-feuille', 'à posteriori', 'aba', 'abandonnataire', 'abat-flanc', 'à priori', 'abaca', 'abandonné', 'abat-foin', 'à tire-larigot', 'abaddir', 'abandonnée', 'abat-joue', 'à vau', 'abadie', 'abandonnement', 'abat-jour', 'à vau-de-route', 'abadis', 'abandonnément', 'abat-relui', 'à vau-le-feu', 'abaissable', 'abandonner', 'abat-reluit', 'à-bas', 'abaissant', 'abandonneur', 'abat-son', 'à-compte', 'abaisse', 'abandonneuse', 'abat-vent', 'a-humain', 'abaissé', 'abaque', 'abat-voix', 'a-mi-la', 'abaisse-langue', 'abarticulaire', 'abatage', 'à-pic', 'abaissée', 'abarticulation', 'abâtardi', 'abâtardir', 'abbatial', 'abdominal', 'abécé', 'abâtardissement', 'abbatiale', 'abdominale', 'abécédaire', 'abatée', 'abbatiat', 'abdominien', 'abécédé', 'abatis', 'abbattre', 'abdominienne', 'abéchement', 'abatre', 'abbaye', 'abdomino-coraco-huméral', 'abécher', 'abattable', 'abbé', 'abdomino-coraco-humérale', 'abecquage', 'abattage', 'abbesse', 'abdomino-génital', 'abecquement', 'abattant', 'abbevillien', 'abdomino-génitale', 'abecquer', 'abattée', 'abbevillienne', 'abdomino-guttural', 'abecqueuse', 'abattement', 'abbevillois', 'abdomino-gutturale', 'abée', 'abatteur', 'abbevilloise', 'abdomino-huméral', 'abeillage', 'abatteuse', 'abcéder', 'abdomino-humérale', 'abeille', 'abattis', 'abcès', 'abdomino-périnéal', 'abeillé', 'abattoir', 'abdalas', 'abdomino-scrotal', 'abeiller', 'abattre', 'abdéritain', 'abdomino-thoracique', 'abeillier', 'abattu', 'abdéritaine', 'abdomino-utérotomie', 'abeillon', 'abattue', 'abdicataire', 'abdominoscopie', 'abélien', 'abatture', 'abdication', 'abdominoscopique', 'abéquage', 'abax', 'abdiquer', 'abducteur', 'abéquer', 'abbadie', 'abdomen', 'abduction', 'abéqueuse', 'aber', 'abiétine', 'abjurer', 'aboi', 'aberrance', 'abiétiné', 'ablatif', 'aboiement', 'aberrant', 'abiétinée', 'ablation', 'aboilage', 'aberration', 'abiétique', 'ablativo', 'abolir', 'aberrer', 'abigaïl', 'able', 'abolissable', 'aberrographe', 'abigéat', 'ablégat', 'abolissement', 'aberroscope', 'abigotir', 'ablégation', 'abolitif', 'abessif', 'abîme', 'abléphare', 'abolition', 'abêtifier', 'abîmé', 'ablépharie', 'abolitionnisme', 'abêtir', 'abîmement', 'ablépharoplastique', 'abolitionniste', 'abêtissant', 'abîmer', 'ableret', 'aboma', 'abêtissement', 'abiogenèse', 'ablet', 'abominable', 'abêtissoir', 'abiose', 'ablette', 'abominablement', 'abhorrable', 'abiotique', 'ablier', 'abomination', 'abhorré', 'abject', 'abluant', 'abominer', 'abhorrer', 'abjectement', 'abluante', 'abondamment', 'abicher', 'abjection', 'abluer', 'abondance', 'abies', 'abjurateur', 'ablution', 'abondant', 'abiétacée', 'abjuration', 'ablutionner', 'abonder', 'abiétin', 'abjuratoire', 'abnégation', 'abonnable',...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM