简体   繁体   中英

How can I extract outgoing links from a website in python?

def parsehttp(url):
    r = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(r, 'lxml')


    for link in soup.find_all('a'):
        href = link.attrs.get("href")
        print(href)

I would like to be able to extract all outgoing links from a website, however, the code that I have right now is returning both relative links and outgoing links and I only want the outgoing links. The difference is outgoing links has the https portion in them while relative ones do not. I also want to obtain the 'title' portion that comes with each link as well.

You can use a regular expression:

for link in soup.findAll('a', attrs={'href': re.compile("^(http|https)://")}):
    href = link.attrs.get("href")
    if href is not None:
        print(href)
for link in soup.find_all('a'):
    href = link.attrs.get("href", "")
    if not href.startwith("https://"):
        continue
    
    print(href) 

you can check if the first 5 characters of href are https to identify this:

if href[0:5] == "https":
   #outgoing link
else:
   #incoming link

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM