[英]scraping data from unknown number of pages using beautiful soup
I want to parse some info from website that has data spread among several pages. 我想从网站解析一些信息,这些信息的数据分布在几个页面中。
The problem is I don't know how many pages there are. 问题是我不知道有多少页面。 There might be 2, but there might be also 4, or even just one page. 可能有2个,但也可能有4个,甚至只有一个页面。
How can I loop over pages when I don't know how many pages there will be? 当我不知道会有多少页面时,我怎么能遍历页面?
I know however the url pattern which looks something like in the code below. 我知道url模式看起来像下面的代码。
Also, the pages names are not plain numbers but they are in 'pe2'
for page 2 and 'pe4'
for page 3 etc. so can't just loop over range(number). 此外,网页名称不是普通的数字,但他们在'pe2'
2页和'pe4'
3页等,所以不能随便超范围(数)循环。
This dummy code for the loop I am trying to fix. 这个我正在尝试修复的循环的虚拟代码。
pages=['','pe2', 'pe4', 'pe6', 'pe8',]
import requests
from bs4 import BeautifulSoup
for i in pages:
url = "http://www.website.com/somecode/dummy?page={}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
#rest of the scraping code
You can use a while loop that will stop to run when encounters an exception. 您可以使用while循环,它会在遇到异常时停止运行。
Code: 码:
from bs4 import BeautifulSoup
from time import sleep
import requests
i = 0
while(True):
try:
if i == 0:
url = "http://www.website.com/somecode/dummy?page=pe"
else:
url = "http://www.website.com/somecode/dummy?page=pe{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
#print page url
print(url)
#rest of the scraping code
#don't overflow website
sleep(2)
#increase page number
i += 2
except:
break
Output: 输出:
http://www.website.com/somecode/dummy?page
http://www.website.com/somecode/dummy?page=pe2
http://www.website.com/somecode/dummy?page=pe4
http://www.website.com/somecode/dummy?page=pe6
http://www.website.com/somecode/dummy?page=pe8
...
... and so on, until it faces an Exception.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.