[英]Assigning XML children to variables in a single loop in Python
我仍在学习XML函数和Pubmed API的配合。 目前,我正在使用xpath从XML中的子级获取文本,并将其分配给要分配给字典的列表。 这是XML,这是我的代码:
from pprint import pprint as pp
import requests
from lxml import etree
article_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
response = requests.get(article_url)
tree = etree.fromstring(response.content)
ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
journal = [j.text.strip() for j in tree.xpath('//Article//Title')]
year = [y.text.strip() for y in tree.xpath('//PubmedData//History//PubMedPubDate[@PubStatus="medline"]//Year')]
month = [m.text.strip() for m in tree.xpath('//PubmedData//History//PubMedPubDate[@PubStatus="medline"]//Month')]
day = [d.text.strip() for d in tree.xpath('//PubmedData//History//PubMedPubDate[@PubStatus="medline"]//Day')]
result = {_id.text: {"journal": journal, "year": year, "month": month, "day":day} for _id, journal, year, month, day in zip(ids, journal, year, month, day)}
pp(result)
所以输出是一个字典:
{'29149862': {'day': '19',
'journal': 'Italian journal of pediatrics',
'month': '11',
'year': '2017'},
'29150897': {'day': '19',
'journal': 'Respirology (Carlton, Vic.)',
'month': '11',
'year': '2017'}}
但是,我在具有1000个节点的XML上执行此操作(即,每个“新闻”和“年份”等都将在列表中包含1000多个项目)。
我想知道,
tl; dr:如何优化此过程? 提前致谢。
优化的解决方案:
import requests, pprint
from lxml import etree
article_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
response = requests.get(article_url)
tree = etree.fromstring(response.content)
ids_xpath = '//MedlineCitation/PMID[@Version=1]/text()'
article_xpath = '//Article//Title/text()'
ymd_xpath = '//PubmedData/History/PubMedPubDate[@PubStatus="medline"]/' \
'*[self::Year or self::Month or self::Day]/text()'
full_xpath = '|'.join((ids_xpath, article_xpath, ymd_xpath))
nodes = tree.xpath(full_xpath)
result = { nodes[i]: dict(zip(('journal', 'year', 'month', 'day'), nodes[1:]))
for i in range(0, len(nodes), 5)}
pprint.pprint(result)
输出:
{'29149862': {'day': '19',
'journal': 'Respirology (Carlton, Vic.)',
'month': '11',
'year': '2017'},
'29150897': {'day': '19',
'journal': 'Respirology (Carlton, Vic.)',
'month': '11',
'year': '2017'}}
关键的xpath表达式将以连续的方式提取并排列所需的节点: <id> | <journal> | <year> | <month> | <day>
<id> | <journal> | <year> | <month> | <day>
这是我的处理方式:
from pprint import pprint as pp
import requests
from lxml import etree as ET
def extract_items(tree):
for article in tree.xpath("/PubmedArticleSet/PubmedArticle"):
item = {}
citation = article.find('MedlineCitation')
data = article.find('PubmedData')
id = citation.findtext('./PMID[@Version = "1"]', default='')
medline_date = data.find('./History/PubMedPubDate[@PubStatus="medline"]')
item[id] = {
'journal': citation.findtext('./Article/Journal/Title', default=''),
'day': medline_date.findtext('Day', default=''),
'month': medline_date.findtext('Month', default=''),
'year': medline_date.findtext('Year', default=''),
}
yield item
article_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
response = requests.get(article_url)
tree = ET.fromstring(response.content)
for item in extract_items(tree):
print(item)
请注意,所有内容如何使用相对的XPath(以./
开头),甚至根本不需要慢速的“后代”速记//
。 我避免两次查询相同的路径,并且如果要转到某个元素的直接子代,我只需使用子代的名称而不是新路径即可。
结果:
{'29150897': {'journal': 'Respirology (Carlton, Vic.)', 'day': '19', 'month': '11', 'year': '2017'}}
{'29149862': {'journal': 'Italian journal of pediatrics', 'day': '19', 'month': '11', 'year': '2017'}}
我不是这样构造数据的忠实拥护者。 我建议:
{
{'id': '29150897', 'journal': 'Respirology (Carlton, Vic.)', 'day': '19', 'month': '11', 'year': '2017'},
{'id': '29149862', 'journal': 'Italian journal of pediatrics', 'day': '19', 'month': '11', 'year': '2017'}
]
因为事实证明,使用它要容易得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.