Python：删除某些字符之前的所有内容

Question

I have several files on which I should work on. 我有几个应该处理的文件。 The files are xml-files, but before " < ?xml version="1.0"? > ", there are some debugging and status lines coming from the command line. 这些文件是xml文件，但是在“ <？xml version =” 1.0“？>”之前，有一些调试和状态行来自命令行。 Since I'd like to pare the file, these lines must be removed. 由于我想解析文件，因此必须删除这些行。 My question is: How is this possible? 我的问题是：这怎么可能？ Preferably inplace, ie the filename stays the same. 最好就位，即文件名保持不变。

Thanks for any help. 谢谢你的帮助。

Answer 1

An inefficient solution would be to read the whole contents and find where this occurs: 一种低效的解决方案是读取全部内容并查找发生的位置：

fileName="yourfile.xml"
with open(fileName,'r+') as f:
  contents=f.read()
  contents=contents[contents.find("< ?xml version="1.0"? >"):]
  f.seek(0)
  f.write(contents)
  f.truncate()

The file will now contain the original files contents from "< ?xml version="1.0"? >" onwards. 该文件现在将包含从“ <？xml version =“ 1.0”？>“开始的原始文件内容。

Answer 2

What about trimming the file headers as you read the file? 读取文件时修剪文件头该怎么办？

import xml.etree.ElementTree as et

with open("input.xml", "rb") as inf:
    # find starting point
    offset = 0
    for line in inf:
        if line.startswith('<?xml version="1.0"'):
            break
        else:
            offset += len(line)

    # read the xml file starting at that point
    inf.seek(offset)
    data = et.parse(inf)

(This assumes that the xml header starts on its own line, but works on my test file: （这假定xml标头以其自己的行开头，但适用于我的测试文件：

<!-- This is a line of junk -->
<!-- This is another -->
<?xml version="1.0" ?>
<abc>
    <def>xy</def>
    <def>hi</def>
</abc>

Answer 3

Since you say you have several files, using fileinput might be better than open . 既然您说您有几个文件，那么使用fileinput可能比open更好。 You can then do something like: 然后，您可以执行以下操作：

import fileinput
import sys

prolog = '< ?xml version="1.0"? >'
reached_prolog = False
files = ['file1.xml', 'file2.xml'] # The paths of all your XML files
for line in fileinput.input(files, inplace=1):
    # Decide how you want to remove the lines. Something like:
    if line.startswith(prolog) and not reached_prolog:
        continue
    else:
        reached_prolog = True
        sys.stdout.write(line)

Reading the docs for fileinput should make things clearer. 阅读文件输入fileinput应该使事情更清楚。

PS This is just a quick response; PS：这只是快速反应； I haven't ran/tested the code. 我还没有运行/测试代码。

Answer 4

A solution with regexp: 使用regexp的解决方案：

import re
import shutil

with open('myxml.xml') as ifile, open('tempfile.tmp', 'wb') as ofile:
    for line in ifile:
        matches = re.findall(r'< \?xml version="1\.0"\? >.+', line)
        if matches:
            ofile.write(matches[0])
            ofile.writelines(ifile)
            break
    shutil.move('tempfile.tmp', 'myxml.xml')

Python：删除某些字符之前的所有内容

问题描述

4 个解决方案

解决方案1
2 2014-04-06 23:09:26

解决方案2
0 2014-04-06 23:23:54

解决方案3
0 2014-04-06 23:25:30

解决方案4
0 2014-04-06 23:57:05

Python：删除某些字符之前的所有内容

问题描述

4 个解决方案

解决方案1 2 2014-04-06 23:09:26

解决方案2 0 2014-04-06 23:23:54

解决方案3 0 2014-04-06 23:25:30

解决方案4 0 2014-04-06 23:57:05

解决方案1
2 2014-04-06 23:09:26

解决方案2
0 2014-04-06 23:23:54

解决方案3
0 2014-04-06 23:25:30

解决方案4
0 2014-04-06 23:57:05