繁体   English   中英

Web 在 python 中抓取 xml 页面?

[英]Web scraping an xml page in python?

我对如何从给定的 xml 页面上刮掉所有链接(仅包含字符串“mp3”)感到困惑。 以下代码仅返回空括号:

# Import required modules 
from lxml import html 
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 
  
# Parsing the page 
# (We need to use page.content rather than  
# page.text because html.fromstring implicitly 
# expects bytes as input.) 
tree = html.fromstring(page.content)   
  
# Get element using XPath 
buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 
print(buyers)

我使用@url 错了吗?

我正在寻找的链接:

在此处输入图像描述

任何帮助将不胜感激!

怎么了?

以下xpath不起作用,正如您提到的,它是使用@urltext()

//enclosure[@url="mp3"]/text()

解决方案

//enclosure中的属性url应包含mp3 ,然后返回/@url

更改此行:

buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 

buyers = tree.xpath('//enclosure[contains(@url,"mp3")]/@url') 

Output

['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]

它不会直接回答您的问题,但您可以查看BeautifulSoup作为替代方案(并且它也可以选择在箍下使用lxml )。

import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 

# Parse
soup = BeautifulSoup(page.text, 'lxml')

# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM