[英]Python:Scraping speech data from website
Basically, I want to get all the speeches from mit romney from this link 基本上,我想从此链接中获得罗姆尼的所有演讲。
http://mittromneycentral.com/speeches/ http://mittromneycentral.com/speeches/
I know how to use BeautifulSoup to get all the urls from the link above. 我知道如何使用BeautifulSoup从上面的链接中获取所有网址。
def mywebcrawl(url):
urls = []
htmltext = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmltext)
#print soup
for tag in soup.findAll('a', href = True):
#append url to top level link
tag['href'] = urlparse.urljoin(url,tag['href'])
urls.append(tag['href'])
pprint(urls)
However, for each url, I cannot extract the speech (note I only want the speech only, no irrelevant stuff). 但是,对于每个URL,我都无法提取语音(请注意,我仅只希望语音,没有无关紧要的内容)。 I want to build a function that will iterate through the list of urls and extract the speeches.
我想构建一个将遍历URL列表并提取语音的函数。 I have used
soup.find_all('table')
and soup.find_all('font')
but I cannot get the desired results. 我已经使用了
soup.find_all('table')
和soup.find_all('font')
但无法获得理想的结果。 They failed to extract the entire speech for most times. 他们大部分时间都无法提取全部演讲内容。
Here's the strategy I used: 这是我使用的策略:
<div class="entry-content">
<div class="entry-content">
<p>
tags that do not have a class attribute. <p>
标记。 The other <p>
tags under the <div>
do have a class
attribute. <div>
下的其他<p>
标记确实具有class
属性。 Here is the code for getting the list of speeches and parsing out the speech from a speech's page: 这是用于获取语音列表并从语音页面解析语音的代码:
from BeautifulSoup import BeautifulSoup as BS
def get_list_of_speeches(html):
soup = BS(html)
content_div = soup.findAll('div', {"class":"entry-content"})[0]
speech_links = content_div.findAll('a')
speeches = []
for speech in speech_links:
title = speech.text.encode('utf-8')
link = speech['href']
speeches.append( (title, link) )
return speeches
# speeches.htm is http://mittromneycentral.com/speeches/
speech_html = open('speeches.htm').read()
get_list_of_speeches(speech_html):
def get_speech_text(html):
soup = BS(html)
content_div = soup.findAll('div', {"class":"entry-content"})[0]
content = content_div.findAll('p', {"class":None})
speech = ''
for paragraph in content:
speech += paragraph.text.encode('utf-8') + '\n'
return speech
# file1.html is http://mittromneycentral.com/speeches/2006-speeches/092206-values-voters-summit-2006
html = open('file1.htm').read()
print get_speech_text(html)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.