简体   繁体   中英

How to extract information from a web page using python in json or xml format?

I need help in extracting information from a webpage. I give the URL and then I need to extract information like contact number, address, href, name of person etc. I am able to extract the page source completely for a provided URL with known tags. But I need a generic source code to extract this data from any URL. I used regex to extract emails for eg

import urllib
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b'
pattern=re.compile(regex)
print pattern
while i<len(urls):
    htmlfile=urllib.urlopen(urls[i])
    htmltext=htmlfile.read()
    titles=re.findall(pattern,htmltext)
    print titles
    i+=1

This gives me empty list. Any help to extract all info as I said above will be highly appreciated. The idea is to give a URL and the extract all information like name, phone number, email, address etc in json or xml format. Thank you all in advance...!!

To start with you need to fix your regex. \\ needs to be escaped in python strings. Easy way to fix this is using a raw string r'' instead.

regex=r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b

Meanwhile I have managed to get it working, after some small modifications (beware that I am working with Python 3.4.2):

import urllib.request
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}'
pattern=re.compile(regex)
print(pattern)
while i<len(urls):
    htmlfile=urllib.request.urlopen(urls[i])
    htmltext=htmlfile.read()
    titles=re.findall(pattern,htmltext.decode())
    print(titles)
    i+=1

The result is:

['townshipclerk@plainsboronj.com', 'acancro@plainsboronj.com',  ...]

Good luck

I think you're on the wrong track here: you have a HTML file, from where you try to extract information. You have started doing this by filtering on '@'-sign for finding e-mail addresses (hence your choice of working with regular expressions). However other things like names, phone numbers, ... are not recognisable using regular expressions, hence another approach might be useful. Under URL " https://docs.python.org/3/library/html.parser.html " there is some explanation on how to parse HTML files. In my opinion this will be a better approach for solving your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM