简体   繁体   English

Python网络抓取转到下一页

[英]Python web-scraping go to next page

The code just prints the same email addresses again and again and doesnt go to the next page. 该代码只是一次又一次地打印相同的电子邮件地址,并且不会转到下一页。 Does anybody see the error in my code? 有人在我的代码中看到错误了吗?

import requests
from bs4 import BeautifulSoup as soup
def get_emails(_links:list):
for i in range(len(_links)):
 new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'})
 if new_d:
   yield new_d[-1]['title']

start=20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start=20').text, 'html.parser')

results = [i['href'] for i in d.find_all('a')][52:-9]
results = [link for link in results if link.startswith('http://')]
print(list(get_emails(results)))

next_page=d.find('div', {'class': 'paging'}, 'weiter')

if next_page:

    d=next_page.get('href')
    start+=20
else:
    break

When you press the button "weiter" (next page) the urlending changes from "...start=20" to "start=40". 当您按下按钮“ weiter”(下一页)时,urlending从“ ... start = 20”更改为“ start = 40”。 It is in 20s steps because there are 20 results per site. 由于每个站点有20个结果,因此需要20秒的步骤。

The problem is with url you are requesting. 问题出在您请求的网址上。 Same url is requested everytime because you are not updating the url as per start you are calculating. 每次都请求相同的url,因为您没有按照计算的起始时间更新url。 Try changing url like this: 尝试像这样更改网址:

'http://www.schulliste.eu/type/gymnasien/?bundesland=&start={}'.format(start)

Assuming next_page returns anything, the problem is you're trying to do the same thing twice at once, but neither are done properly: 假设next_page返回任何内容,问题是您试图一次执行两次相同的操作,但是没有正确完成:

1.) You're trying to point d to a the next page, and yet in the beginning of the loop you reassign d to the starting page again. 1.)您试图将d指向下一页,但是在循环开始时,您又将d重新分配给了起始页面。

2.) You're trying to assign start+=20 for the next page but you're not referencing start in any part of your code. 2.)您试图为下一页分配start+=20 ,但是您没有在代码的任何部分引用start

Thus, you have two ways to tackle this: 因此,您有两种方法可以解决此问题:

1.) Move the d assignment outside of the loop, and remove the start object altogether: 1.)将d分配移到循环外,并完全删除start对象:

# start=20
# You don't need start because it's not being used at all

# move the initial d assignment outside the loop
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start=20').text, 'html.parser')
while True:
    # rest of your code

if next_page:

    d=next_page.get('href')
    # start+=20
    # Again, you don't need the start any more.
else:
    break

2.) No need to reassign d , just reference start in your url in the beginning of the loop and remove the d assignment in the if next_page : 2.)无需重新分配d ,只需在循环start在您的url中引用start ,然后在if next_page删除d分配:

start=20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start={page_id}'.format(page_id=start).text, 'html.parser')

# rest of your code

if next_page:

    # d=next_page.get('href')
    # this d assignment is redundant as it will get reassigned in the loop.  Start is your key.
    start+=20
else:
    break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM