简体   繁体   中英

Python - parsing html response

In my code, a user inputs a search term and the get_all_links parses the html response and extract the links that start with ' http '. When req is replaced with a hard coded url such as:

content = urllib.request.urlopen("http://www.ox.ac.uk")

The program returns a list of properly formatted links correctly. However passing in req , no links are returned. I suspect this may be a formatting blip.

Here is my code:

import urllib.request
def get_all_links(s): # function to get all the links
    d=0
    links=[] # getting all links into a list
    while d!=-1: # untill d is -1. i.e no links in that page
        d=s.find('<a href=',d) # if <a href is found
        start=s.find('"',d) # stsrt will be the next character
        end=s.find('"',start+1) # end will be upto "
        if d!=-1: # d is not -1
            d+=1  
            if(s[start+1]=='h'): # add the link which starts with http only.
                links.append(s[start+1:end]) # to link list
    return links # return list

def main():
    term = input('Enter a search term: ')
    url = 'http://www.google.com/search'
    value = {'q' : term}
    user_agent = 'Mozilla/5.0'
    headers = {'User-Agent' : user_agent}
    data = urllib.parse.urlencode(value)
    print(data)
    url = url + '?' + data
    print(url)
    req = urllib.request.Request(url, None, headers)

    content = urllib.request.urlopen(req)
    s = content.read()
    print(s)
    links = get_all_links(s.decode('utf-8'))

    for i in links: # print the returned list.
        print(i) 

main()

You should use a HTML parser, as suggested in the comments. A library like BeautifulSoup is perfect for this.

I have adapted your code to use BeautifulSoup

import urllib.request
from bs4 import BeautifulSoup

def get_all_links(s):
    soup = BeautifulSoup(s, "html.parser")

    return soup.select("a[href^=\"http\"]") # Select all anchor tags whose href attribute starts with 'http'

def main():
    term = input('Enter a search term: ')
    url = 'http://www.google.com/search'
    value = {'q' : term}
    user_agent = 'Mozilla/5.0'
    headers = {'User-Agent' : user_agent}
    data = urllib.parse.urlencode(value)
    print(data)
    url = url + '?' + data
    print(url)
    req = urllib.request.Request(url, None, headers)

    content = urllib.request.urlopen(req)
    s = content.read()
    print(s)
    links = get_all_links(s.decode('utf-8'))

    for i in links: # print the returned list.
        print(i)

main()

It uses the select method of the BeautifulSoup library and returns a list of selected elements (in your case anchor-tags).

Using a library like BeautifulSoup not only makes it easier, but you can also use much more complex selections. Imagine how you would have to change your code when you wanted to select all links whose href attribute contains the word "google" or "code"?

You can read the BeautifulSoup documentation here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM