[英]How to Crawl Multiple pages/cities from a website (BeautifulSoup,Requests,Python3)
I'm wondering how to crawl multiple different pages/cities from one website using beautiful soup/requests without having to repeat my code over and over. 我想知道如何使用漂亮的汤/请求从一个网站抓取多个不同的页面/城市,而不必一遍又一遍地重复我的代码。
Here is my code right now: 现在是我的代码:
Region = "Marrakech"
Spider = 20
def trade_spider(max_pages):
page = -1
partner_ID = 2
location_ID = 25
already_printed = set()
while page <= max_pages:
page += 1
response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page))
jsondata = json.loads(response.read().decode("utf-8"))
format = (jsondata['activities'])
g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
soup = BeautifulSoup(g_data)
hallo = soup.find_all("article", {"class": "activity-card"})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card"})
for header in headers:
header_final = header.text.strip()
if header_final not in already_printed:
already_printed.add(header_final)
deeplinks = item.find_all("a", {"class": "activity"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
if deeplink_final not in already_printed:
already_printed.add(deeplink_final)
end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
print(end_final)
trade_spider(int(Spider))
My goal is to ideally crawl multiple cities/regions from one particular website. 我的目标是理想地从一个特定的网站爬网多个城市/地区。
Now, I can do this manually by repeating my code over and over and crawling each individual website and then concatenating my results for each of these dataframes together but that seems very unpythonic. 现在,我可以手动执行此操作,一遍又一遍地重复我的代码并爬网每个单独的网站,然后将每个这些数据框的结果串联在一起,但这似乎很不合常规。 I was wondering if anyone had a faster way or any advice?
我想知道是否有人有更快的方法或任何建议?
I tried to add a second city into my region tag, but does not work 我试图将第二个城市添加到我的区域标签中,但是不起作用
Region = "Marrakech","London"
Can anyone help me with that? 有人可以帮我吗? Any feedback is appreciated.
任何反馈表示赞赏。
Region = ["Marrakech","London"]
Put your while loop inside the for loop, then reset pages to -1. 将while循环放入for循环中,然后将页面重置为-1。
for reg in Region:
pages = -1
and replace Region with reg while requesting url. 并在请求网址时用reg替换Region。
Region = ["Marrakech","London"]
Spider = 20
def trade_spider(max_pages):
partner_ID = 2
location_ID = 25
already_printed = set()
for reg in Region:
page = -1
while page <= max_pages:
page += 1
response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page))
jsondata = json.loads(response.read().decode("utf-8"))
format = (jsondata['activities'])
g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
soup = BeautifulSoup(g_data)
hallo = soup.find_all("article", {"class": "activity-card"})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card"})
for header in headers:
header_final = header.text.strip()
if header_final not in already_printed:
already_printed.add(header_final)
deeplinks = item.find_all("a", {"class": "activity"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
if deeplink_final not in already_printed:
already_printed.add(deeplink_final)
end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
print(end_final)
trade_spider(int(Spider))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.