简体   繁体   中英

Text extraction from multiple websites

from bs4 import BeautifulSoup
import re
import urllib2
import urllib
list_open = open("weblist.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
for url in line_in_list:
        Beautiful = urllib2.urlopen(url).read()
        beautiful
        soup = bs4.BeautifulSoup(beautiful)
        for news in soup:
                 print soup.getText()

The following code helps me to extract text from multiple websites (weblist.txt)

but when my weblist contains any link or website which don't open with this code it stops immediately and not check further links. Suppose if I have 10 links and second one is not open or is not able to parse it gives error and stops in that link without checking further links.I want that it should check each link from weblist (start to end ) and extract text from all those links which are genuine or able to parse.

Just add a try except statement like this:

for url in line_in_list:
    try:
        Beautiful = urllib2.urlopen(url).read()
        beautiful
        soup = bs4.BeautifulSoup(beautiful)
        for news in soup:
             print soup.getText()
    except Exception as e:
        #Error handling
        print(e)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM