简体   繁体   中英

Python web-scraping go to next page

The code just prints the same email addresses again and again and doesnt go to the next page. Does anybody see the error in my code?

import requests
from bs4 import BeautifulSoup as soup
def get_emails(_links:list):
for i in range(len(_links)):
 new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'})
 if new_d:
   yield new_d[-1]['title']

start=20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start=20').text, 'html.parser')

results = [i['href'] for i in d.find_all('a')][52:-9]
results = [link for link in results if link.startswith('http://')]
print(list(get_emails(results)))

next_page=d.find('div', {'class': 'paging'}, 'weiter')

if next_page:

    d=next_page.get('href')
    start+=20
else:
    break

When you press the button "weiter" (next page) the urlending changes from "...start=20" to "start=40". It is in 20s steps because there are 20 results per site.

The problem is with url you are requesting. Same url is requested everytime because you are not updating the url as per start you are calculating. Try changing url like this:

'http://www.schulliste.eu/type/gymnasien/?bundesland=&start={}'.format(start)

Assuming next_page returns anything, the problem is you're trying to do the same thing twice at once, but neither are done properly:

1.) You're trying to point d to a the next page, and yet in the beginning of the loop you reassign d to the starting page again.

2.) You're trying to assign start+=20 for the next page but you're not referencing start in any part of your code.

Thus, you have two ways to tackle this:

1.) Move the d assignment outside of the loop, and remove the start object altogether:

# start=20
# You don't need start because it's not being used at all

# move the initial d assignment outside the loop
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start=20').text, 'html.parser')
while True:
    # rest of your code

if next_page:

    d=next_page.get('href')
    # start+=20
    # Again, you don't need the start any more.
else:
    break

2.) No need to reassign d , just reference start in your url in the beginning of the loop and remove the d assignment in the if next_page :

start=20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start={page_id}'.format(page_id=start).text, 'html.parser')

# rest of your code

if next_page:

    # d=next_page.get('href')
    # this d assignment is redundant as it will get reassigned in the loop.  Start is your key.
    start+=20
else:
    break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM