如何基于同级标记的值从lxml.etree标记中提取文本

Question

My objective is to pull urls from an xml document (linked) and put them in a list: https://www.valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml 我的目标是从xml文档（链接）中提取URL并将其放在列表中： https : //www.valuespreadsheet.com/iedgar/results.php? stock=NFLX&output=xml

I imported etree from lxml and created a list comprehension that pulls the text from all <instanceUrl> tags. 我从lxml导入etree并创建了一个列表etree ，该列表etree从所有<instanceUrl>标记中提取文本。

url = 'https://valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml' 
et = etree.fromstring(urlopen(url).read())
return [_.find('instanceUrl').text for _ in et.find('filings')]

Now, I want to restrict the list so that it only pulls the text from <instanceUrl> tags where <formType> =10K. 现在，我想限制列表，以便仅从<instanceUrl>标记中提取文本，其中<formType> = 10K。

How can I achieve this? 我该如何实现？

Answer 1

You need an XPath expression and the xpath() method : 您需要一个XPath表达式和xpath()方法：

[url.text for url in et.xpath('//filing[formType = "10-K"]/instanceUrl')]

Here we are filtering the filing nodes that contain formType child nodes with 10-K text, then getting the instanceUrl child. 在这里，我们正在过滤包含带有10-K文本的formType子节点的filing节点，然后获取instanceUrl子节点。

Note that the _ variable name is used for throw-away variables - variables that have to be defined but not actually used (eg during unpacking). 请注意， _变量名用于一次性变量 -必须定义但未实际使用的变量（例如，在解压缩过程中）。 In your case, you've actually used it. 就您而言，您实际上已经使用过它。

如何基于同级标记的值从lxml.etree标记中提取文本

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-01-18 23:00:37

如何基于同级标记的值从lxml.etree标记中提取文本

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-01-18 23:00:37

解决方案1
2 已采纳 2017-01-18 23:00:37