從Python中的URL下載文本

Question

我目前正在做一個學校項目，目標是使用Natural Language Toolkit軟件包分析詐騙郵件。 基本上，我願意做的是比較不同年份的騙局，並試圖找到一種趨勢-它們的結構如何隨時間變化。 我找到了一個騙局數據庫： http : //www.419scam.org/emails/我想用python下載鏈接的內容，但是我被卡住了。 到目前為止，我的代碼：

from BeautifulSoup import BeautifulSoup
import urllib2, re

html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')

links2 = soup.findAll(href=re.compile("index"))

print links2

因此，我可以獲取鏈接，但是我還不知道如何下載內容。 有任何想法嗎？ 非常感謝！

Answer 1

您已經有了一個不錯的開始，但是現在您只需檢索索引頁面並將其加載到BeautifulSoup解析器中即可。 現在，您已經從鏈接獲得了href，基本上，您需要打開所有這些鏈接，並將其內容加載到可用於分析的數據結構中。

這本質上相當於一個非常簡單的網絡爬蟲。 如果可以使用其他人的代碼，則可以通過搜索“ python Web爬網程序”找到適合的內容。 我已經看過其中的一些，它們很簡單，但是對於完成此任務可能有些過分。 大多數網絡爬蟲都使用遞歸遍歷給定站點的整個樹。 看起來更簡單的事情就可以滿足您的要求。

鑒於我對BeautifulSoup不熟悉，因此該基本結構有望使您走上正確的道路，或者使您對網絡爬網的完成方式有所了解：

from BeautifulSoup import BeautifulSoup
import urllib2, re

emailContents = []

def analyze_emails():
    # this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents

def parse_email_page(link):
    print "opening " + link
    # open, soup, and parse the page.  
    #Looks like the email itself is in a "blockquote" tag so that may be the starting place.  
    #From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents

def parse_list_page(link):
    print "opening " + link
    html = urllib2.urlopen(link).read()
    soup = BeatifulSoup(html)
    email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages   
    for link in email_page_links:
        parseEmailPage(link['href'])


def main():
    html = urllib2.urlopen('http://www.419scam.org/emails/').read()
    soup = BeautifulSoup(html)    
    links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work

    for link in links:
        parse_list_page(link['href'])

    analyze_emails()         

if __name__ == "__main__":
    main()

從Python中的URL下載文本

問題描述

1 個解決方案

解決方案1
6 已采納 2012-06-07 17:26:14

從Python中的URL下載文本

問題描述

1 個解決方案

解決方案1 6 已采納 2012-06-07 17:26:14

解決方案1
6 已采納 2012-06-07 17:26:14