简体   繁体   English

在网站上抓取多个页面

[英]Scraping Multiple Pages On A Website

I'm trying to scrape a list of all the coaching instiututes on thsi URL: https://www.sulekha.com/entrance-exam-coaching/delhi 我正在尝试通过以下网址抓取所有教练机构的列表: https ://www.sulekha.com/entrance-exam-coaching/delhi

The following is my Python code: 以下是我的Python代码:

import bs4
from urllib.request
import urlopen as uReq
from bs4
import BeautifulSoup as soup

my_url = 'https://www.sulekha.com/entrance-exam-coaching/delhi'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close() x

page_soup = soup(page_html, "lxml")


insti = page_soup.findAll("div", {"class": "list-title"})

filename = "entrance_institutes.csv"

f = open(filename, "w")
headers = "Institute \n"
f.write(headers)

for ins in insti:
    ins_name = ins.div.a["title"]

f.write(ins_name + "\n")

f.close()

This code runs fine. 这段代码运行良好。 Attached is the image of the csv it generates. 附件是它生成的csv的图像。 How should I go about scraping all the listings one page after the other ? 我应该如何将所有列表逐页刮掉?

Thanks 谢谢

Output csv 输出CSV

I'm not 100% sure what you mean. 我不确定100%的意思。 If you're asking how to fix the bug in your code then you need to change your loop to: 如果您询问如何修复代码中的错误,则需要将循环更改为:

for ins in insti:
    ins_name = ins.div.a["title"]
    f.write(ins_name + "\n")

As your code is you loop through the everything and write the last one due to the write not being in the loop. 由于您的代码是循环的,因此要遍历所有内容,然后再写最后一个,因为写入不在循环中。

However if you're asking how to take that list and then scrap those then that's more involved and for starters you need to save the URL rather than the title but I'm going to leave the rest to you because that kind of sounds like homework. 但是,如果您要询问如何获取该列表然后将其废弃,则涉及更多,对于初学者,您需要保存URL而不是标题,但是我将剩下的留给您,因为这听起来像是作业。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM