简体   繁体   English

从Python中的URL下载文本

[英]Download text from a URL in Python

I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. 我目前正在做一个学校项目,目标是使用Natural Language Toolkit软件包分析诈骗邮件。 Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time. 基本上,我愿意做的是比较不同年份的骗局,并试图找到一种趋势-它们的结构如何随时间变化。 I found a scam-database: http://www.419scam.org/emails/ I would like to download the content of the links with python, but I am stuck. 我找到了一个骗局数据库: http : //www.419scam.org/emails/我想用python下载链接的内容,但是我被卡住了。 My code so far: 到目前为止,我的代码:

from BeautifulSoup import BeautifulSoup
import urllib2, re

html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')

links2 = soup.findAll(href=re.compile("index"))

print links2

So I can fetch the links but I don't know yet how can I download the content. 因此,我可以获取链接,但是我还不知道如何下载内容。 Any ideas? 有任何想法吗? Thanks a lot! 非常感谢!

You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. 您已经有了一个不错的开始,但是现在您只需检索索引页面并将其加载到BeautifulSoup解析器中即可。 Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis. 现在,您已经从链接获得了href,基本上,您需要打开所有这些链接,并将其内容加载到可用于分析的数据结构中。

This essentially amounts to a very simple web-crawler. 这本质上相当于一个非常简单的网络爬虫。 If you can use other people's code, you may find something that fits by googling "python Web crawler." 如果可以使用其他人的代码,则可以通过搜索“ python Web爬网程序”找到适合的内容。 I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. 我已经看过其中的一些,它们很简单,但是对于完成此任务可能有些过分。 Most web-crawlers use recursion to traverse the full tree of a given site. 大多数网络爬虫都使用递归遍历给定站点的整个树。 It looks like something much simpler could suffice for your case. 看起来更简单的事情就可以满足您的要求。

Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done: 鉴于我对BeautifulSoup不熟悉,因此该基本结构有望使您走上正确的道路,或者使您对网络爬网的完成方式有所了解:

from BeautifulSoup import BeautifulSoup
import urllib2, re

emailContents = []

def analyze_emails():
    # this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents

def parse_email_page(link):
    print "opening " + link
    # open, soup, and parse the page.  
    #Looks like the email itself is in a "blockquote" tag so that may be the starting place.  
    #From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents

def parse_list_page(link):
    print "opening " + link
    html = urllib2.urlopen(link).read()
    soup = BeatifulSoup(html)
    email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages   
    for link in email_page_links:
        parseEmailPage(link['href'])


def main():
    html = urllib2.urlopen('http://www.419scam.org/emails/').read()
    soup = BeautifulSoup(html)    
    links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work

    for link in links:
        parse_list_page(link['href'])

    analyze_emails()         

if __name__ == "__main__":
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM