简体   繁体   中英

For each line in a file, replace multiple-whitespace substring of variable length with line break

Using Python 2.7.1, I read in a file:

input = open(file, "rU")
tmp = input.readlines()

which looks like this:

>name     -----meoidoad
>longname -lksowkdkfg
>nm       --kdmknskoeoe---
>nmee     dowdbnufignwwwwcds--

That is, each line has a short substring of whitespaces, but the length of this substring varies by line.

I would like to write script that edits my tmp object such that when I write tmp to file, the result is

>name
-----meoidoad
>longname
-lksowkdkfg
>nm
--kdmknskoeoe---
>nmee
dowdbnufignwwwwcds--

Ie I would like to break each line into two lines, at that substring of whitespaces (and get rid of the spaces in the process).

The starting position of the string after the whitespaces is always the same within a file, but may vary among a large batch of files I am working with. So, I need a solution that does not rely on positions.

I've seen many similar questions on here, with many well-liked answers that use short regex scripts to do so, so it is possible I am duplicating a previous question. However, none of what I've seen so far has worked for me.

import re

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    for line in infile:
        outfile.write(re.sub('\s\s+', '\n', line))

If the file isn't huge (ie hundreds of MB), you can do this concisely with split() and join() :

with open(file, 'rU') as f, open(outfilename, 'w') as o:
    o.write('\n'.join(f.read().split()))

I would also recommend against naming anything input , as that will mask the built-in.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM