[英]How do I extract all article links from BBC RSS feed using Python?
我试过这个,它似乎不起作用。 我只需要列表中的文章链接。
from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml")
bsObj = BeautifulSoup(html.read(),"html.parser");
for link in bsObj.find_all('a'):
print(link.get('href'))
即使在通过浏览器访问时它呈现为 HTML,服务器也会将 XML 返回给 Python。 如果你print(html.read())
你会看到那个 XML。
在此 XML 中, <a>
标记替换为<link>
标记(没有属性),因此您需要更改代码以反映:
from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml")
bsObj = BeautifulSoup(html.read(),"html.parser");
for link in bsObj.find_all('link'):
print(link.text)
# http://www.bbc.co.uk/news/
# http://www.bbc.co.uk/news/
# http://www.bbc.co.uk/news/entertainment-arts-41914725
# http://www.bbc.co.uk/news/entertainment-arts-41886207
# http://www.bbc.co.uk/news/entertainment-arts-41886475
# ...
# ...
import feedparser
url='http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml'
data = feedparser.parse(url)
i=0
while i < len(data):
print(data['entries'][i]["link"])
i=i+1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.