[英]How to extract text from lxml.etree tags based on value of sibling tags
My objective is to pull urls from an xml document (linked) and put them in a list: https://www.valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml 我的目标是从xml文档(链接)中提取URL并将其放在列表中: https : //www.valuespreadsheet.com/iedgar/results.php? stock=NFLX&output=xml
I imported etree
from lxml
and created a list comprehension that pulls the text from all <instanceUrl>
tags. 我从lxml
导入etree
并创建了一个列表etree
,该列表etree
从所有<instanceUrl>
标记中提取文本。
url = 'https://valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml'
et = etree.fromstring(urlopen(url).read())
return [_.find('instanceUrl').text for _ in et.find('filings')]
Now, I want to restrict the list so that it only pulls the text from <instanceUrl>
tags where <formType>
=10K. 现在,我想限制列表,以便仅从<instanceUrl>
标记中提取文本,其中<formType>
= 10K。
How can I achieve this? 我该如何实现?
You need an XPath expression and the xpath()
method : 您需要一个XPath表达式和xpath()
方法 :
[url.text for url in et.xpath('//filing[formType = "10-K"]/instanceUrl')]
Here we are filtering the filing
nodes that contain formType
child nodes with 10-K
text, then getting the instanceUrl
child. 在这里,我们正在过滤包含带有10-K
文本的formType
子节点的filing
节点,然后获取instanceUrl
子节点。
Note that the _
variable name is used for throw-away variables - variables that have to be defined but not actually used (eg during unpacking). 请注意, _
变量名用于一次性变量 -必须定义但未实际使用的变量(例如,在解压缩过程中)。 In your case, you've actually used it. 就您而言,您实际上已经使用过它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.