[英]Extract all attributes of an element from XML in Python
I have multiple XML files containing tweets in a format similar to the one below:我有多个 XML 文件,其中包含类似于以下格式的推文:
<tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet>
There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example.创建文件的方式存在问题(缺少img的结束标记),因此我选择关闭它,如上例所示。 I know that in HTML you can close it as
我知道在 HTML 你可以关闭它
<img **something here** />
but I don't know if this holds for XML, as I didn't see it anywhere.但我不知道这是否适用于 XML,因为我在任何地方都没有看到它。
I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it.我正在编写一个提取主题和纯文本的 python 代码,但我也对img包含的所有属性感兴趣,我似乎无法做到。 Here is what I've tried so far:
这是我迄今为止尝试过的:
top = []
txt = []
emj = []
for article in root:
topic = article.find('.topic')
textbrut = article.find('.textbrut')
emoji = article.find('.img')
everything = textbrut.attrib
if topic is not None and textbrut is not None:
top.append(topic.text)
txt.append(textbrut.text)
x = list(everything.items())
emj.append(x)
Any help would be greatly appreciated.任何帮助将不胜感激。
Apparently, Element has some useful methods (such as Element.iter() ) that help iterate recursively over all the sub-tree below it (its children, their children,...).显然,Element 有一些有用的方法(例如Element.iter() )可以帮助递归地遍历它下面的所有子树(它的孩子,他们的孩子,...)。 So here is the solution that worked for me:
所以这是对我有用的解决方案:
for emoji in root.iter('img'):
print(emoji.attrib)
everything = emoji.attrib
x = list(everything.items())
new.append(x)
Below以下
import xml.etree.ElementTree as ET
xml = '''<t><tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet></t>'''
root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
'img_attributes': tweet.find('.//img').attrib})
print(data)
output output
[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.