繁体   English   中英

我用Python做webscraping时怎么写两个For循环?

[英]How can I write two For Loops when I do webscraping with Python?

我想写一个代码来抓取多个网页。

但是,问题是网页中有两个数字变体。

000/BBSDD0002/93976?page=1&
000/BBSDD0002/93975?page=1&
000/BBSDD0002/93970?page=1&
000/BBSDD0002/93964?page=1&
000/BBSDD0002/93950?page=1&
000/BBSDD0002/93946?page=1&
000/BBSDD0002/93945?page=1&
000/BBSDD0002/93930?page=2&
000/BBSDD0002/93925?page=2&
.
.
.
.
000/BBSDD0002/39045?page=536&

正如我们在这里看到的,页码和文档编号同时变化。

import requests
import re
from bs4 import BeautifulSoup
from itertools import product

page = range(1, 6)
document = range(39045, 93976)



for i, j in product(page, document):
    print("Page Number:", i)
    url = "https://000.com/BBSDD0002/{}?page={}&".format(i,j)
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    soup = BeautifulSoup(res.text,"lxml")
    
    list1=soup.find_all("td", attrs = {"class":"sbj"})
    for li in list1:
        print(li.get_text())

到目前为止我写了这个,但它只循环页码,所以它没有给我任何东西。

有什么方法可以为页码和文档编号创建外观吗?

不确定你的目标是什么,但你可以这样做:

page = range(1, 6)
entry_id = 39045

for p in page:
    for i in range(0,10):
        print(f'https://000.com/BBSDD0002/{entry_id}?page={p}')
        entry_id = entry_id+1

什么导致:

https://000.com/BBSDD0002/39045?page=1
https://000.com/BBSDD0002/39046?page=1
https://000.com/BBSDD0002/39047?page=1
https://000.com/BBSDD0002/39048?page=1
https://000.com/BBSDD0002/39049?page=1
https://000.com/BBSDD0002/39050?page=1
https://000.com/BBSDD0002/39051?page=1
https://000.com/BBSDD0002/39052?page=1
https://000.com/BBSDD0002/39053?page=1
https://000.com/BBSDD0002/39054?page=1
https://000.com/BBSDD0002/39055?page=2
https://000.com/BBSDD0002/39056?page=2
https://000.com/BBSDD0002/39057?page=2
https://000.com/BBSDD0002/39058?page=2
https://000.com/BBSDD0002/39059?page=2
https://000.com/BBSDD0002/39060?page=2
https://000.com/BBSDD0002/39061?page=2
https://000.com/BBSDD0002/39062?page=2
https://000.com/BBSDD0002/39063?page=2
https://000.com/BBSDD0002/39064?page=2
https://000.com/BBSDD0002/39065?page=3
https://000.com/BBSDD0002/39066?page=3
https://000.com/BBSDD0002/39067?page=3
...

如果您尝试抓取评论 - 为什么不迭代页面并收集他们的网址。 这也将防止您在示例中为已删除的评论创建无效的 url。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM