[英]How can I write two For Loops when I do webscraping with Python?
我想写一个代码来抓取多个网页。
但是,问题是网页中有两个数字变体。
000/BBSDD0002/93976?page=1&
000/BBSDD0002/93975?page=1&
000/BBSDD0002/93970?page=1&
000/BBSDD0002/93964?page=1&
000/BBSDD0002/93950?page=1&
000/BBSDD0002/93946?page=1&
000/BBSDD0002/93945?page=1&
000/BBSDD0002/93930?page=2&
000/BBSDD0002/93925?page=2&
.
.
.
.
000/BBSDD0002/39045?page=536&
正如我们在这里看到的,页码和文档编号同时变化。
import requests
import re
from bs4 import BeautifulSoup
from itertools import product
page = range(1, 6)
document = range(39045, 93976)
for i, j in product(page, document):
print("Page Number:", i)
url = "https://000.com/BBSDD0002/{}?page={}&".format(i,j)
res = requests.get(url, headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"lxml")
list1=soup.find_all("td", attrs = {"class":"sbj"})
for li in list1:
print(li.get_text())
到目前为止我写了这个,但它只循环页码,所以它没有给我任何东西。
有什么方法可以为页码和文档编号创建外观吗?
不确定你的目标是什么,但你可以这样做:
page = range(1, 6)
entry_id = 39045
for p in page:
for i in range(0,10):
print(f'https://000.com/BBSDD0002/{entry_id}?page={p}')
entry_id = entry_id+1
什么导致:
https://000.com/BBSDD0002/39045?page=1
https://000.com/BBSDD0002/39046?page=1
https://000.com/BBSDD0002/39047?page=1
https://000.com/BBSDD0002/39048?page=1
https://000.com/BBSDD0002/39049?page=1
https://000.com/BBSDD0002/39050?page=1
https://000.com/BBSDD0002/39051?page=1
https://000.com/BBSDD0002/39052?page=1
https://000.com/BBSDD0002/39053?page=1
https://000.com/BBSDD0002/39054?page=1
https://000.com/BBSDD0002/39055?page=2
https://000.com/BBSDD0002/39056?page=2
https://000.com/BBSDD0002/39057?page=2
https://000.com/BBSDD0002/39058?page=2
https://000.com/BBSDD0002/39059?page=2
https://000.com/BBSDD0002/39060?page=2
https://000.com/BBSDD0002/39061?page=2
https://000.com/BBSDD0002/39062?page=2
https://000.com/BBSDD0002/39063?page=2
https://000.com/BBSDD0002/39064?page=2
https://000.com/BBSDD0002/39065?page=3
https://000.com/BBSDD0002/39066?page=3
https://000.com/BBSDD0002/39067?page=3
...
如果您尝试抓取评论 - 为什么不迭代页面并收集他们的网址。 这也将防止您在示例中为已删除的评论创建无效的 url。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.