[英]Python: Remove everything before certain chars
I have several files on which I should work on. 我有几个应该处理的文件。 The files are xml-files, but before " < ?xml version="1.0"? > ", there are some debugging and status lines coming from the command line.
这些文件是xml文件,但是在“ <?xml version =” 1.0“?>”之前,有一些调试和状态行来自命令行。 Since I'd like to pare the file, these lines must be removed.
由于我想解析文件,因此必须删除这些行。 My question is: How is this possible?
我的问题是:这怎么可能? Preferably inplace, ie the filename stays the same.
最好就位,即文件名保持不变。
Thanks for any help. 谢谢你的帮助。
An inefficient solution would be to read the whole contents and find where this occurs: 一种低效的解决方案是读取全部内容并查找发生的位置:
fileName="yourfile.xml"
with open(fileName,'r+') as f:
contents=f.read()
contents=contents[contents.find("< ?xml version="1.0"? >"):]
f.seek(0)
f.write(contents)
f.truncate()
The file will now contain the original files contents from "< ?xml version="1.0"? >" onwards. 该文件现在将包含从“ <?xml version =“ 1.0”?>“开始的原始文件内容。
What about trimming the file headers as you read the file? 读取文件时修剪文件头该怎么办?
import xml.etree.ElementTree as et
with open("input.xml", "rb") as inf:
# find starting point
offset = 0
for line in inf:
if line.startswith('<?xml version="1.0"'):
break
else:
offset += len(line)
# read the xml file starting at that point
inf.seek(offset)
data = et.parse(inf)
(This assumes that the xml header starts on its own line, but works on my test file: (这假定xml标头以其自己的行开头,但适用于我的测试文件:
<!-- This is a line of junk -->
<!-- This is another -->
<?xml version="1.0" ?>
<abc>
<def>xy</def>
<def>hi</def>
</abc>
Since you say you have several files, using fileinput
might be better than open
. 既然您说您有几个文件,那么使用
fileinput
可能比open
更好。 You can then do something like: 然后,您可以执行以下操作:
import fileinput
import sys
prolog = '< ?xml version="1.0"? >'
reached_prolog = False
files = ['file1.xml', 'file2.xml'] # The paths of all your XML files
for line in fileinput.input(files, inplace=1):
# Decide how you want to remove the lines. Something like:
if line.startswith(prolog) and not reached_prolog:
continue
else:
reached_prolog = True
sys.stdout.write(line)
Reading the docs for fileinput
should make things clearer. 阅读文件输入
fileinput
应该使事情更清楚。
PS This is just a quick response; PS:这只是快速反应; I haven't ran/tested the code.
我还没有运行/测试代码。
A solution with regexp: 使用regexp的解决方案:
import re
import shutil
with open('myxml.xml') as ifile, open('tempfile.tmp', 'wb') as ofile:
for line in ifile:
matches = re.findall(r'< \?xml version="1\.0"\? >.+', line)
if matches:
ofile.write(matches[0])
ofile.writelines(ifile)
break
shutil.move('tempfile.tmp', 'myxml.xml')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.