简体   繁体   中英

extract links from a site and follow one of them in Python2.7

I want to create a piece of code that works as follows: You feed it an URL, it looks on that webpage how many links there are, follows one, looks on that new webpage again, follows one link, and so on.

I have a piece of code that opens a web page, searches for links and creates a list from them:

import urllib
from bs4 import BeautifulSoup
list_links = []
page = raw_input('enter an url')
url = urllib.urlopen(page).read()
html = BeautifulSoup(url, 'html.parser')
for link in html.find_all('a'):
    link = link.get('href')
    list_links.append(link)

Next, I want user to decide which link to follow, so I have this:

link_number = len(list_links)
print 'enter a number between 0 and', (link_number)
number = raw_input('')

for number in number:
    if int(number) < 0 or int(number) > link_number:
        print "The End."
        break
    else:
        continue

url_2 = urllib.urlopen(list_links[int(number)]).read()

Here my code crashes

Ideally, I would like to have an endless process (unsell user would stop it by entering a wrong number) like this: open the page -> count amount of links -> choose one -> follow this link and open new page -> count amount of links...

Can anybody help me?

You can try using this (sorry if it's not exactly pretty, I wrote it in a bit of a hurry):

import requests, random
from bs4 import BeautifulSoup as BS
from time import sleep


def main(url):
    content = scraping_call(url)
    if not content:
        print "Couldn't get html..."
        return
    else:
        links_list = []
        soup  = BS(content, 'html5lib')
        for link in soup.findAll('a'):
            try:
                links_list.append(link['href'])
            except KeyError:
                continue

        chosen_link_index = input("Enter a number between 0 and %d: " % len(links_list))
        if not 0 < chosen_link_index <= len(links_list):
            raise ValueError ('Number must be between 0 and %d: ' % len(links_list))
            #script will crash here. 
            #If you want the user to try again, you can
            #set up a nr of attempts, like in scraping_call()
        else:
            #if user wants to stop the infinite loop 
            next_step = raw_input('Continue or exit? (Y/N) ') or 'Y'
            # default value is 'yes' so if u want to continue, 
            #just press Enter
            if next_step.lower() == 'y':
                main(links_list[chosen_link_index])
            else:
                return



def scraping_call(url):
    attempt = 1
    while attempt < 6:
        try:
            page = requests.get(url)
            if page.status_code == 200:
                result = page.content
            else:
                result = ''
        except Exception,e:
            result = ''
            print 'Failed attempt (',attempt,'):', e
            attempt += 1
            sleep(random.randint(2,4))
            continue
        return result


if __name__ == '__main__':
    main('enter the starting URL here')

Some of the links in a certain webpage can appear in a form of relative address and we need to take this into account. This should do the trick. Works for python 3.4.

from urllib.request import urlopen
from urllib.parse import urljoin, urlsplit
from bs4 import BeautifulSoup

addr = input('enter an initial url: ')

while True:
    html = BeautifulSoup(urlopen(addr).read(), 'html.parser')
    list_links = []
    num = 0
    for link in html.find_all('a'):
        url = link.get('href')
        if not urlsplit(url).netloc:
            url = urljoin(addr, url)
        if urlsplit(url).scheme in ['http', 'https']:
            print("%d : %s " % (num, str(url)))
            list_links.append(url)
            num += 1

    idx = int(input("enter an index between 0 and %d: " % (len(list_links) - 1)))
    if not 0 <= idx < len(list_links):
        raise ValueError('Number must be between 0 and %d: ' % len(list_links))
    addr = list_links[idx]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM