简体   繁体   中英

Parse a text file with columns aligned with white spaces

I am trying to parse a text file where are entries aligned as columns using multiple white spaces. The text looks like this:

Blah blah, blah               bao     123456     
hello, hello, hello           miao    299292929  

I have already checked that it is not tab-delimited. The entries are in fact aligned with multiple spaces.

It has not been a problem splitting the text into single rows and then I also noticed that there are trailing spaces after the numerical sequence. So what I have now is:

["Blah blah, blah               bao     123456     ",   
 "hello, hello, hello           miao    299292929  "]

The desired output would be :

[["Blah blah, blah", "bao", "123456"],
 ["hello, hello, hello", "miao", "299292929"]]

You can use re.split() , and use \\s{2,} as a delimiter pattern:

>>> l = ["Blah blah, blah               bao     123456     ",   
...      "hello, hello, hello           miao    299292929  "]
>>> for item in l:
...     re.split('\s{2,}', item.strip())
... 
['Blah blah, blah', 'bao', '123456']
['hello, hello, hello', 'miao', '299292929']

\\s{2,} matches 2 or more consequent whitespace characters.

You can simply split by index. You could either hardcode the indexes, or detect them:

l=["Blah blah, blah               bao     123456     ",   
   "hello, hello, hello           miao    299292929  "]

def detect_column_indexes( list_of_lines ):
    indexes=[0]
    transitions= [col.count(' ')==len(list_of_lines) for col in zip(*list_of_lines)]
    last=False
    for i, x in enumerate(transitions):
        if not x and last:
            indexes.append(i)
        last=x
    indexes.append( len(list_of_lines[0])+1 )
    return indexes

def split_line_by_indexes( indexes, line ):
    tokens=[]
    for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
        tokens.append( line[i1:i2].rstrip() )
    return tokens

indexes= detect_column_indexes( l )
parsed= [split_line_by_indexes(indexes, line) for line in l] 
print indexes
print parsed

output:

[0, 30, 38, 50]
[['Blah blah, blah', 'bao', '123456'], ['hello, hello, hello', 'miao', '299292929']]

Obviously, it's not possible to tell apart trailing whitespace on each collumn - but you can detect leading whitespace by using rstrip instead of strip .

This method is not foolproof, but is more robust than detecting two consecutive whitespaces.

If you know the width of each field, it's easy. The first field is 30 characters wide, second one is 8 characters and the last one is 11 characters. So you can do something like this:

line = 'Blah blah, blah               bao     123456     '
parts = [line[:30].strip(), line[30:39].strip(), line[38:].strip()]

use re module

import re
l1 = re.split('  +', l[0])
l2 = re.split('  +', l[1])
print [l1.remove(''), l2.remove('')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM