[英]Split a large xml file into multiple based on tag in Python
I have a very large xml file which I need to split into several based on a particular tag.我有一个非常大的 xml 文件,我需要根据特定标签将其分成几个。 The XML file is something like this: XML 文件是这样的:
<xml>
<file id="13">
<head>
<talkid>2458</talkid>
<transcription>
<seekvideo id="645">So in college,</seekvideo>
...
</transcription>
</head>
<content> *** This is the content I am trying to save *** </content>
</file>
<file>
...
</file>
</xml>
I want to extract the content of each file and save based on the talkid .我想提取每个文件的内容并根据talkid保存。
Here is the code I have tried with:这是我尝试过的代码:
import xml.etree.ElementTree as ET
all_talks = 'path\\to\\big\\file'
context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
if elem.tag == 'file':
content = elem.find('content').text
title = elem.find('talkid').text
filename = format(title + ".txt")
with open(filename, 'wb', encoding='utf-8') as f:
f.write(ET.tostring(content), encoding='utf-8')
But I get the following error:但我收到以下错误:
AttributeError: 'NoneType' object has no attribute 'text'
You can use Beautiful Soup to parse xml.您可以使用Beautiful Soup来解析 xml。
It would like this(i added a second talk id to the xml to demonstrate finding multiple tags)它会像这样(我在 xml 中添加了第二个谈话 id 以演示查找多个标签)
xml_file = '''<xml>
<file id="13">
<head>
<talkid>2458</talkid>
<transcription>
<seekvideo id="645">So in college,</seekvideo>
...
</transcription>
<talkid>second talk id</talkid>
</head>
<content> *** This is the content I am trying to save *** </content>
</file>
<file>
...
</file>
</xml>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_file, "xml")
first_talk_id = soup.find('talkid').get_text()
talk_ids = soup.findAll('talkid')
print(first_talk_id)
# prints 2458
for talk in talk_ids:
print(talk.get_text())
# prints
# 2458
# second talk id
NOTE: you will need to install a parser for bs4 to work with xml pip install lxml
for instance.注意:例如,您需要为 bs4 安装一个解析器才能使用 xml pip install lxml
。
Try doing it this way..尝试这样做..
the issue is that the talkid is a child of the head tag and not the file tag.问题是 talkid 是 head 标签的子标签,而不是 file 标签。
import xml.etree.ElementTree as ET
all_talks = 'file.xml'
context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
if elem.tag == 'file':
head = elem.find('head')
content = elem.find('content').text
title = head.find('talkid').text
filename = format(title + ".txt")
with open(filename, 'wb') as f: # 'wt' or just 'w' if you want to write text instead of bytes
f.write(content.encode()) # in which case you would remove the .encode()
If you're already using .iterparse()
it's more generic to rely just on events:如果您已经在使用.iterparse()
,那么仅依赖事件会更通用:
import xml.etree.ElementTree as ET
from pathlib import Path
all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))
for event, element in context:
if event == 'end':
if element.tag == 'talkid':
title = element.text
elif element.tag == 'content':
content = element.text
elif element.tag == 'file' and title and content:
with open(all_talks.with_name(title + '.txt'), 'w') as f:
f.write(content)
elif element.tag == 'file':
content = title = None
Upd.更新。 In similar question @ Leila asked how to write text from all <seekvideo>
tags to file instead of <content>
to file, so here is a solution:在类似的问题中,@ Leila询问如何将所有<seekvideo>
标签中的文本写入文件而不是<content>
文件,所以这是一个解决方案:
import xml.etree.ElementTree as ET
from pathlib import Path
all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))
for event, element in context:
if event == 'end':
if element.tag == 'file' and title and parts:
with open(all_talks.with_name(title + '.txt'), 'w') as f:
f.write('\n'.join(parts))
elif element.text:
if element.tag == 'talkid':
title = element.text
elif element.tag == 'seekvideo':
parts.append(element.text)
elif element.tag == 'file':
title = None
parts = []
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.