从 Python 中的 XML 中提取元素的所有属性

Question

I have multiple XML files containing tweets in a format similar to the one below:我有多个 XML 文件，其中包含类似于以下格式的推文：

<tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet>

There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example.创建文件的方式存在问题（缺少img的结束标记），因此我选择关闭它，如上例所示。 I know that in HTML you can close it as我知道在 HTML 你可以关闭它

<img **something here** />

but I don't know if this holds for XML, as I didn't see it anywhere.但我不知道这是否适用于 XML，因为我在任何地方都没有看到它。

I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it.我正在编写一个提取主题和纯文本的 python 代码，但我也对img包含的所有属性感兴趣，我似乎无法做到。 Here is what I've tried so far:这是我迄今为止尝试过的：

top = []
txt = []
emj = []

for article in root:
    topic = article.find('.topic')
    textbrut = article.find('.textbrut')

    emoji = article.find('.img')
    everything = textbrut.attrib

    if topic is not None and textbrut is not None:
            top.append(topic.text)
            txt.append(textbrut.text)

            x = list(everything.items())
            emj.append(x)

Any help would be greatly appreciated.任何帮助将不胜感激。

Answer 1

Apparently, Element has some useful methods (such as Element.iter() ) that help iterate recursively over all the sub-tree below it (its children, their children,...).显然，Element 有一些有用的方法（例如Element.iter() ）可以帮助递归地遍历它下面的所有子树（它的孩子，他们的孩子，...）。 So here is the solution that worked for me:所以这是对我有用的解决方案：

for emoji in root.iter('img'):
    print(emoji.attrib)
    everything = emoji.attrib
    x = list(everything.items())
    new.append(x)

For more details read here.有关更多详细信息，请阅读此处。

Answer 2

Below以下

import xml.etree.ElementTree as ET

xml = '''<t><tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet></t>'''

root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
    data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
                 'img_attributes': tweet.find('.//img').attrib})
print(data)

output output

[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]

从 Python 中的 XML 中提取元素的所有属性

问题描述

2 个解决方案

解决方案1
1 2019-10-21 12:00:26

解决方案2
0 2019-10-21 13:55:23

从 Python 中的 XML 中提取元素的所有属性

问题描述

2 个解决方案

解决方案1 1 2019-10-21 12:00:26

解决方案2 0 2019-10-21 13:55:23

解决方案1
1 2019-10-21 12:00:26

解决方案2
0 2019-10-21 13:55:23