How can I extract outgoing links from a website in python?

Question

def parsehttp(url):
    r = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(r, 'lxml')


    for link in soup.find_all('a'):
        href = link.attrs.get("href")
        print(href)

I would like to be able to extract all outgoing links from a website, however, the code that I have right now is returning both relative links and outgoing links and I only want the outgoing links. The difference is outgoing links has the https portion in them while relative ones do not. I also want to obtain the 'title' portion that comes with each link as well.

Answer 1

You can use a regular expression:

for link in soup.findAll('a', attrs={'href': re.compile("^(http|https)://")}):
    href = link.attrs.get("href")
    if href is not None:
        print(href)

Answer 2

for link in soup.find_all('a'):
    href = link.attrs.get("href", "")
    if not href.startwith("https://"):
        continue
    
    print(href)

Answer 3

you can check if the first 5 characters of href are https to identify this:

if href[0:5] == "https":
   #outgoing link
else:
   #incoming link

How can I extract outgoing links from a website in python?

Question

3 answers

solution1
2 ACCPTED 2021-02-10 10:11:36

solution2
1 2021-02-10 10:14:44

solution3
0 2021-02-10 10:07:33

How can I extract outgoing links from a website in python?

Question

3 answers

solution1 2 ACCPTED 2021-02-10 10:11:36

solution2 1 2021-02-10 10:14:44

solution3 0 2021-02-10 10:07:33

solution1
2 ACCPTED 2021-02-10 10:11:36

solution2
1 2021-02-10 10:14:44

solution3
0 2021-02-10 10:07:33