简体   繁体   English

使用python抓取网站并收集所有超链接

[英]Scraping website and collect all the hyperlinks using python

I am making a program which could take information from any website.我正在制作一个可以从任何网站获取信息的程序。 But the program is not working.但该程序不起作用。

Example-- the website is naukri.com and we have to collect all the hyperlinks of a page:示例——网站是 naukri.com,我们必须收集页面的所有超链接:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

isc = ssl.create_default_context()
isc.check_hostname = False
isc.verify_mode = ssl.CERT_NONE

open = urllib.request.urlopen('https://www.naukri.com/job-listings-Python- 
Developer-Cloud-Analogy-Softech-Pvt-Ltd-Noida-Sector-63-Noida-1-to-2-years-250718003152src=jobsearchDesk&sid=15325422374871&xp=1&px=1&qp=python%20developer 
&srcPage=s', context = isc).read()
soup = BeautifulSoup(open, 'html.parser')

tags = soup('a')

for tag in tags:
    print(tag.get('href', None))

I would use requests and bs4.我会使用请求和 bs4。 I was able to get this to work and I think it has the desired outcome.我能够让它发挥作用,我认为它具有预期的结果。 Try this:尝试这个:

import requests
from bs4 import BeautifulSoup

url = ('https://www.naukri.com/job-listings-Python-Developer-Cloud-Analogy-Softech-Pvt-Ltd-Noida-Sector-63-Noida-1-to-2-years-250718003152src=jobsearchDesk&sid=15325422374871&xp=1&px=1&qp=python%20developer&srcPage=s')
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('a', href=True)

for each in links:
    print(each.get('href'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM