Python：从网站抓取语音数据

Question

Basically, I want to get all the speeches from mit romney from this link 基本上，我想从此链接中获得罗姆尼的所有演讲。

http://mittromneycentral.com/speeches/ http://mittromneycentral.com/speeches/

I know how to use BeautifulSoup to get all the urls from the link above. 我知道如何使用BeautifulSoup从上面的链接中获取所有网址。

def mywebcrawl(url):
    urls = []
    htmltext = urllib2.urlopen(url).read()
    soup = BeautifulSoup(htmltext)
    #print soup
    for tag in soup.findAll('a', href = True):
        #append url to top level link
        tag['href'] = urlparse.urljoin(url,tag['href'])
        urls.append(tag['href'])
    pprint(urls)

However, for each url, I cannot extract the speech (note I only want the speech only, no irrelevant stuff). 但是，对于每个URL，我都无法提取语音（请注意，我仅只希望语音，没有无关紧要的内容）。 I want to build a function that will iterate through the list of urls and extract the speeches. 我想构建一个将遍历URL列表并提取语音的函数。 I have used soup.find_all('table') and soup.find_all('font') but I cannot get the desired results. 我已经使用了soup.find_all('table')和soup.find_all('font')但无法获得理想的结果。 They failed to extract the entire speech for most times. 他们大部分时间都无法提取全部演讲内容。

Answer 1

Here's the strategy I used: 这是我使用的策略：

The speech is contained within <div class="entry-content"> 演讲内容包含在<div class="entry-content">
The speech is contained with <p> tags that do not have a class attribute. 语音包含没有类属性的<p>标记。 The other <p> tags under the <div> do have a class attribute. <div>下的其他<p>标记确实具有class属性。

Here is the code for getting the list of speeches and parsing out the speech from a speech's page: 这是用于获取语音列表并从语音页面解析语音的代码：

from BeautifulSoup import BeautifulSoup as BS

def get_list_of_speeches(html):
    soup = BS(html)
    content_div = soup.findAll('div', {"class":"entry-content"})[0]
    speech_links = content_div.findAll('a')
    speeches = []
    for speech in speech_links:
        title = speech.text.encode('utf-8')
        link = speech['href']
        speeches.append( (title, link) )
    return speeches

# speeches.htm is http://mittromneycentral.com/speeches/
speech_html = open('speeches.htm').read()
get_list_of_speeches(speech_html):

def get_speech_text(html):
    soup = BS(html)
    content_div = soup.findAll('div', {"class":"entry-content"})[0]
    content = content_div.findAll('p', {"class":None})
    speech = ''
    for paragraph in content:
        speech += paragraph.text.encode('utf-8') + '\n'
    return speech


# file1.html is http://mittromneycentral.com/speeches/2006-speeches/092206-values-voters-summit-2006 
html = open('file1.htm').read()
print get_speech_text(html)

Python：从网站抓取语音数据

问题描述

1 个解决方案

解决方案1
0 2014-07-01 14:33:45

Python：从网站抓取语音数据

问题描述

1 个解决方案

解决方案1 0 2014-07-01 14:33:45

解决方案1
0 2014-07-01 14:33:45