Parse a text file with columns aligned with white spaces

Question

I am trying to parse a text file where are entries aligned as columns using multiple white spaces. The text looks like this:

Blah blah, blah               bao     123456     
hello, hello, hello           miao    299292929

I have already checked that it is not tab-delimited. The entries are in fact aligned with multiple spaces.

It has not been a problem splitting the text into single rows and then I also noticed that there are trailing spaces after the numerical sequence. So what I have now is:

["Blah blah, blah               bao     123456     ",   
 "hello, hello, hello           miao    299292929  "]

The desired output would be :

[["Blah blah, blah", "bao", "123456"],
 ["hello, hello, hello", "miao", "299292929"]]

Answer 1

You can use re.split() , and use \\s{2,} as a delimiter pattern:

>>> l = ["Blah blah, blah               bao     123456     ",   
...      "hello, hello, hello           miao    299292929  "]
>>> for item in l:
...     re.split('\s{2,}', item.strip())
... 
['Blah blah, blah', 'bao', '123456']
['hello, hello, hello', 'miao', '299292929']

\\s{2,} matches 2 or more consequent whitespace characters.

Answer 2

You can simply split by index. You could either hardcode the indexes, or detect them:

l=["Blah blah, blah               bao     123456     ",   
   "hello, hello, hello           miao    299292929  "]

def detect_column_indexes( list_of_lines ):
    indexes=[0]
    transitions= [col.count(' ')==len(list_of_lines) for col in zip(*list_of_lines)]
    last=False
    for i, x in enumerate(transitions):
        if not x and last:
            indexes.append(i)
        last=x
    indexes.append( len(list_of_lines[0])+1 )
    return indexes

def split_line_by_indexes( indexes, line ):
    tokens=[]
    for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
        tokens.append( line[i1:i2].rstrip() )
    return tokens

indexes= detect_column_indexes( l )
parsed= [split_line_by_indexes(indexes, line) for line in l] 
print indexes
print parsed

output:

[0, 30, 38, 50]
[['Blah blah, blah', 'bao', '123456'], ['hello, hello, hello', 'miao', '299292929']]

Obviously, it's not possible to tell apart trailing whitespace on each collumn - but you can detect leading whitespace by using rstrip instead of strip .

This method is not foolproof, but is more robust than detecting two consecutive whitespaces.

Answer 3

If you know the width of each field, it's easy. The first field is 30 characters wide, second one is 8 characters and the last one is 11 characters. So you can do something like this:

line = 'Blah blah, blah               bao     123456     '
parts = [line[:30].strip(), line[30:39].strip(), line[38:].strip()]

Answer 4

use re module

import re
l1 = re.split('  +', l[0])
l2 = re.split('  +', l[1])
print [l1.remove(''), l2.remove('')]

Parse a text file with columns aligned with white spaces

Question

4 answers

solution1
6 ACCPTED 2014-05-14 10:33:02

solution2
1 2014-05-14 10:29:59

solution3
1 2014-05-14 10:31:42

solution4
1 2014-05-14 10:36:27

Parse a text file with columns aligned with white spaces

Question

4 answers

solution1 6 ACCPTED 2014-05-14 10:33:02

solution2 1 2014-05-14 10:29:59

solution3 1 2014-05-14 10:31:42

solution4 1 2014-05-14 10:36:27

solution1
6 ACCPTED 2014-05-14 10:33:02

solution2
1 2014-05-14 10:29:59

solution3
1 2014-05-14 10:31:42

solution4
1 2014-05-14 10:36:27