简体   繁体   中英

How to extract text from lxml.etree tags based on value of sibling tags

My objective is to pull urls from an xml document (linked) and put them in a list: https://www.valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml

I imported etree from lxml and created a list comprehension that pulls the text from all <instanceUrl> tags.

url = 'https://valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml' 
et = etree.fromstring(urlopen(url).read())
return [_.find('instanceUrl').text for _ in et.find('filings')]

Now, I want to restrict the list so that it only pulls the text from <instanceUrl> tags where <formType> =10K.

How can I achieve this?

You need an XPath expression and the xpath() method :

[url.text for url in et.xpath('//filing[formType = "10-K"]/instanceUrl')]

Here we are filtering the filing nodes that contain formType child nodes with 10-K text, then getting the instanceUrl child.

Note that the _ variable name is used for throw-away variables - variables that have to be defined but not actually used (eg during unpacking). In your case, you've actually used it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM