[英]removing stopwords from web pages in python
I tried for the below program its work properly: 我为以下程序尝试了其正常工作:
I want to remove stopwords from the web pages, so the FEED_URL = ' http://feeds.feedburner.com/oreilly/radar/atom ' it run successfully but when i change the url then it will give an error 我想从网页中删除停用词,因此FEED_URL =' http: //feeds.feedburner.com/oreilly/radar/atom'它可以成功运行,但是当我更改URL时,它将给出一个错误
import os
import sys
import json
import feedparser
from BeautifulSoup import BeautifulStoneSoup
from nltk import clean_html
FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'
def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)
#print "Fetched %s entries from '%s'" % (len(fp.entries[0])
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href})
out_file = os.path.join('resources', 'ch05-webpages', 'feed.json')
f = open(out_file, 'w')
f.write(json.dumps(blog_posts, indent=1))
f.close()
print ('Wrote output file to %s' % (f.name, ))
But when i change the url then it gives Error 但是当我更改URL时,它会给出错误
FEED_URL = 'http://www.thehindu.com'
Error: 错误:
IndexError Traceback (most recent call last)
<ipython-input-1-b80b4061a360> in <module>()
14 fp = feedparser.parse(FEED_URL)
15
---> 16 print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)
17 #print "Fetched %s entries from '%s'" % (len(fp.entries[0])
18
IndexError: list index out of range
So anybody can help me for solving this problem? 那么有人可以帮助我解决这个问题吗?
Looks like the feed URL which you are using is not correct. 您正在使用的供稿URL似乎不正确。
Try: 尝试:
FEED_URL = 'http://www.thehindu.com/?service=rss'
For other feeds: http://www.thehindu.com/navigation/?type=rss 对于其他提要: http : //www.thehindu.com/navigation/? type= rss
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.