简体   繁体   中英

Python: Remove everything before certain chars

I have several files on which I should work on. The files are xml-files, but before " < ?xml version="1.0"? > ", there are some debugging and status lines coming from the command line. Since I'd like to pare the file, these lines must be removed. My question is: How is this possible? Preferably inplace, ie the filename stays the same.

Thanks for any help.

An inefficient solution would be to read the whole contents and find where this occurs:

fileName="yourfile.xml"
with open(fileName,'r+') as f:
  contents=f.read()
  contents=contents[contents.find("< ?xml version="1.0"? >"):]
  f.seek(0)
  f.write(contents)
  f.truncate()

The file will now contain the original files contents from "< ?xml version="1.0"? >" onwards.

What about trimming the file headers as you read the file?

import xml.etree.ElementTree as et

with open("input.xml", "rb") as inf:
    # find starting point
    offset = 0
    for line in inf:
        if line.startswith('<?xml version="1.0"'):
            break
        else:
            offset += len(line)

    # read the xml file starting at that point
    inf.seek(offset)
    data = et.parse(inf)

(This assumes that the xml header starts on its own line, but works on my test file:

<!-- This is a line of junk -->
<!-- This is another -->
<?xml version="1.0" ?>
<abc>
    <def>xy</def>
    <def>hi</def>
</abc>

Since you say you have several files, using fileinput might be better than open . You can then do something like:

import fileinput
import sys

prolog = '< ?xml version="1.0"? >'
reached_prolog = False
files = ['file1.xml', 'file2.xml'] # The paths of all your XML files
for line in fileinput.input(files, inplace=1):
    # Decide how you want to remove the lines. Something like:
    if line.startswith(prolog) and not reached_prolog:
        continue
    else:
        reached_prolog = True
        sys.stdout.write(line)

Reading the docs for fileinput should make things clearer.

PS This is just a quick response; I haven't ran/tested the code.

A solution with regexp:

import re
import shutil

with open('myxml.xml') as ifile, open('tempfile.tmp', 'wb') as ofile:
    for line in ifile:
        matches = re.findall(r'< \?xml version="1\.0"\? >.+', line)
        if matches:
            ofile.write(matches[0])
            ofile.writelines(ifile)
            break
    shutil.move('tempfile.tmp', 'myxml.xml')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM