繁体   English   中英

在Python中的单个循环中将XML子代分配给变量

[英]Assigning XML children to variables in a single loop in Python

我仍在学习XML函数和Pubmed API的配合。 目前,我正在使用xpath从XML中的子级获取文本,并将其分配给要分配给字典的列表。 是XML,这是我的代码:

from pprint import pprint as pp
import requests
from lxml import etree

article_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
response = requests.get(article_url)
tree = etree.fromstring(response.content)

ids = tree.xpath("//MedlineCitation/PMID[@Version='1']")
journal = [j.text.strip() for j in tree.xpath('//Article//Title')]
year = [y.text.strip() for y in tree.xpath('//PubmedData//History//PubMedPubDate[@PubStatus="medline"]//Year')]
month = [m.text.strip() for m in tree.xpath('//PubmedData//History//PubMedPubDate[@PubStatus="medline"]//Month')]
day = [d.text.strip() for d in tree.xpath('//PubmedData//History//PubMedPubDate[@PubStatus="medline"]//Day')]

result = {_id.text: {"journal": journal, "year": year, "month": month, "day":day} for _id, journal, year, month, day in zip(ids, journal, year, month, day)}
pp(result)

所以输出是一个字典:

{'29149862': {'day': '19',
              'journal': 'Italian journal of pediatrics',
              'month': '11',
              'year': '2017'},
 '29150897': {'day': '19',
              'journal': 'Respirology (Carlton, Vic.)',
              'month': '11',
              'year': '2017'}}

但是,我在具有1000个节点的XML上执行此操作(即,每个“新闻”和“年份”等都将在列表中包含1000多个项目)。

我想知道,

  1. 如果执行x.text.strip()4次以上将导致遍历XML文档的不必要循环,并且
  2. 如何运行一个循环来获取我需要的四件事并将它们分配给列表?

tl; dr:如何优化此过程? 提前致谢。

优化的解决方案:

import requests, pprint
from lxml import etree

article_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
response = requests.get(article_url)
tree = etree.fromstring(response.content)

ids_xpath = '//MedlineCitation/PMID[@Version=1]/text()'
article_xpath = '//Article//Title/text()'
ymd_xpath = '//PubmedData/History/PubMedPubDate[@PubStatus="medline"]/' \
            '*[self::Year or self::Month or self::Day]/text()'
full_xpath = '|'.join((ids_xpath, article_xpath, ymd_xpath))
nodes = tree.xpath(full_xpath)

result = { nodes[i]: dict(zip(('journal', 'year', 'month', 'day'), nodes[1:]))
           for i in range(0, len(nodes), 5)}

pprint.pprint(result)

输出:

{'29149862': {'day': '19',
              'journal': 'Respirology (Carlton, Vic.)',
              'month': '11',
              'year': '2017'},
 '29150897': {'day': '19',
              'journal': 'Respirology (Carlton, Vic.)',
              'month': '11',
              'year': '2017'}}

关键的xpath表达式将以连续的方式提取并排列所需的节点: <id> | <journal> | <year> | <month> | <day> <id> | <journal> | <year> | <month> | <day>

这是我的处理方式:

from pprint import pprint as pp
import requests
from lxml import etree as ET

def extract_items(tree):
    for article in tree.xpath("/PubmedArticleSet/PubmedArticle"):
        item = {}

        citation = article.find('MedlineCitation')
        data = article.find('PubmedData')

        id = citation.findtext('./PMID[@Version = "1"]', default='')
        medline_date = data.find('./History/PubMedPubDate[@PubStatus="medline"]')

        item[id] = {
            'journal': citation.findtext('./Article/Journal/Title', default=''),
            'day': medline_date.findtext('Day', default=''),
            'month': medline_date.findtext('Month', default=''),
            'year': medline_date.findtext('Year', default=''),
        }
        yield item

article_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&tool=PMA&id=29150897,29149862"
response = requests.get(article_url)
tree = ET.fromstring(response.content)

for item in extract_items(tree):
    print(item)

请注意,所有内容如何使用相对的XPath(以./开头),甚至根本不需要慢速的“后代”速记// 我避免两次查询相同的路径,并且如果要转到某个元素的直接子代,我只需使用子代的名称而不是新路径即可。

结果:

{'29150897': {'journal': 'Respirology (Carlton, Vic.)', 'day': '19', 'month': '11', 'year': '2017'}}
{'29149862': {'journal': 'Italian journal of pediatrics', 'day': '19', 'month': '11', 'year': '2017'}}

我不是这样构造数据的忠实拥护者。 我建议:

{
    {'id': '29150897', 'journal': 'Respirology (Carlton, Vic.)', 'day': '19', 'month': '11', 'year': '2017'},
    {'id': '29149862', 'journal': 'Italian journal of pediatrics', 'day': '19', 'month': '11', 'year': '2017'}
]

因为事实证明,使用它要容易得多。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM