简体   繁体   中英

Download text from a URL in Python

I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time. I found a scam-database: http://www.419scam.org/emails/ I would like to download the content of the links with python, but I am stuck. My code so far:

from BeautifulSoup import BeautifulSoup
import urllib2, re

html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')

links2 = soup.findAll(href=re.compile("index"))

print links2

So I can fetch the links but I don't know yet how can I download the content. Any ideas? Thanks a lot!

You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis.

This essentially amounts to a very simple web-crawler. If you can use other people's code, you may find something that fits by googling "python Web crawler." I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. Most web-crawlers use recursion to traverse the full tree of a given site. It looks like something much simpler could suffice for your case.

Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done:

from BeautifulSoup import BeautifulSoup
import urllib2, re

emailContents = []

def analyze_emails():
    # this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents

def parse_email_page(link):
    print "opening " + link
    # open, soup, and parse the page.  
    #Looks like the email itself is in a "blockquote" tag so that may be the starting place.  
    #From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents

def parse_list_page(link):
    print "opening " + link
    html = urllib2.urlopen(link).read()
    soup = BeatifulSoup(html)
    email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages   
    for link in email_page_links:
        parseEmailPage(link['href'])


def main():
    html = urllib2.urlopen('http://www.419scam.org/emails/').read()
    soup = BeautifulSoup(html)    
    links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work

    for link in links:
        parse_list_page(link['href'])

    analyze_emails()         

if __name__ == "__main__":
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM