简体   繁体   English

从Python中的网页中删除停用词

[英]removing stopwords from web pages in python

I tried for the below program its work properly: 我为以下程序尝试了其正常工作:

I want to remove stopwords from the web pages, so the FEED_URL = ' http://feeds.feedburner.com/oreilly/radar/atom ' it run successfully but when i change the url then it will give an error 我想从网页中删除停用词,因此FEED_URL =' http: //feeds.feedburner.com/oreilly/radar/atom'它可以成功运行,但是当我更改URL时,它将给出一个错误

import os

import sys
import json
import feedparser
from BeautifulSoup import BeautifulStoneSoup
from nltk import clean_html

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'            

def cleanHtml(html):
   return BeautifulStoneSoup(clean_html(html),
            convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

   fp = feedparser.parse(FEED_URL)

   print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)
   #print "Fetched %s entries from '%s'" % (len(fp.entries[0])

   blog_posts = []
   for e in fp.entries:
      blog_posts.append({'title': e.title, 'content'
                  : cleanHtml(e.content[0].value), 'link': e.links[0].href})

      out_file = os.path.join('resources', 'ch05-webpages', 'feed.json')
      f = open(out_file, 'w')
      f.write(json.dumps(blog_posts, indent=1))
      f.close()
      print ('Wrote output file to %s' % (f.name, ))

But when i change the url then it gives Error 但是当我更改URL时,它会给出错误

      FEED_URL = 'http://www.thehindu.com'

Error: 错误:

     IndexError                                Traceback (most recent call last)
     <ipython-input-1-b80b4061a360> in <module>()
     14 fp = feedparser.parse(FEED_URL)
     15 
     ---> 16 print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)
     17 #print "Fetched %s entries from '%s'" % (len(fp.entries[0])
     18 

     IndexError: list index out of range

So anybody can help me for solving this problem? 那么有人可以帮助我解决这个问题吗?

Looks like the feed URL which you are using is not correct. 您正在使用的供稿URL似乎不正确。

Try: 尝试:

FEED_URL = 'http://www.thehindu.com/?service=rss'

For other feeds: http://www.thehindu.com/navigation/?type=rss 对于其他提要: http : //www.thehindu.com/navigation/? type= rss

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM