简体   繁体   English

从 Python 中的 XML 中提取元素的所有属性

[英]Extract all attributes of an element from XML in Python

I have multiple XML files containing tweets in a format similar to the one below:我有多个 XML 文件,其中包含类似于以下格式的推文:

<tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet>

There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example.创建文件的方式存在问题(缺少img的结束标记),因此我选择关闭它,如上例所示。 I know that in HTML you can close it as我知道在 HTML 你可以关闭它

<img **something here** /> 

but I don't know if this holds for XML, as I didn't see it anywhere.但我不知道这是否适用于 XML,因为我在任何地方都没有看到它。

I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it.我正在编写一个提取主题和纯文本的 python 代码,但我也对img包含的所有属性感兴趣,我似乎无法做到。 Here is what I've tried so far:这是我迄今为止尝试过的:

top = []
txt = []
emj = []

for article in root:
    topic = article.find('.topic')
    textbrut = article.find('.textbrut')

    emoji = article.find('.img')
    everything = textbrut.attrib

    if topic is not None and textbrut is not None:
            top.append(topic.text)
            txt.append(textbrut.text)

            x = list(everything.items())
            emj.append(x)

Any help would be greatly appreciated.任何帮助将不胜感激。

Apparently, Element has some useful methods (such as Element.iter() ) that help iterate recursively over all the sub-tree below it (its children, their children,...).显然,Element 有一些有用的方法(例如Element.iter() )可以帮助递归地遍历它下面的所有子树(它的孩子,他们的孩子,...)。 So here is the solution that worked for me:所以这是对我有用的解决方案:

for emoji in root.iter('img'):
    print(emoji.attrib)
    everything = emoji.attrib
    x = list(everything.items())
    new.append(x)

For more details read here.有关更多详细信息,请阅读此处。

Below以下

import xml.etree.ElementTree as ET

xml = '''<t><tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet></t>'''

root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
    data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
                 'img_attributes': tweet.find('.//img').attrib})
print(data)

output output

[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM