I am trying to parse a text file where are entries aligned as columns using multiple white spaces. The text looks like this:
Blah blah, blah bao 123456
hello, hello, hello miao 299292929
I have already checked that it is not tab-delimited. The entries are in fact aligned with multiple spaces.
It has not been a problem splitting the text into single rows and then I also noticed that there are trailing spaces after the numerical sequence. So what I have now is:
["Blah blah, blah bao 123456 ",
"hello, hello, hello miao 299292929 "]
The desired output would be :
[["Blah blah, blah", "bao", "123456"],
["hello, hello, hello", "miao", "299292929"]]
You can use re.split()
, and use \\s{2,}
as a delimiter pattern:
>>> l = ["Blah blah, blah bao 123456 ",
... "hello, hello, hello miao 299292929 "]
>>> for item in l:
... re.split('\s{2,}', item.strip())
...
['Blah blah, blah', 'bao', '123456']
['hello, hello, hello', 'miao', '299292929']
\\s{2,}
matches 2 or more consequent whitespace characters.
You can simply split by index. You could either hardcode the indexes, or detect them:
l=["Blah blah, blah bao 123456 ",
"hello, hello, hello miao 299292929 "]
def detect_column_indexes( list_of_lines ):
indexes=[0]
transitions= [col.count(' ')==len(list_of_lines) for col in zip(*list_of_lines)]
last=False
for i, x in enumerate(transitions):
if not x and last:
indexes.append(i)
last=x
indexes.append( len(list_of_lines[0])+1 )
return indexes
def split_line_by_indexes( indexes, line ):
tokens=[]
for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
tokens.append( line[i1:i2].rstrip() )
return tokens
indexes= detect_column_indexes( l )
parsed= [split_line_by_indexes(indexes, line) for line in l]
print indexes
print parsed
output:
[0, 30, 38, 50]
[['Blah blah, blah', 'bao', '123456'], ['hello, hello, hello', 'miao', '299292929']]
Obviously, it's not possible to tell apart trailing whitespace on each collumn - but you can detect leading whitespace by using rstrip
instead of strip
.
This method is not foolproof, but is more robust than detecting two consecutive whitespaces.
If you know the width of each field, it's easy. The first field is 30 characters wide, second one is 8 characters and the last one is 11 characters. So you can do something like this:
line = 'Blah blah, blah bao 123456 '
parts = [line[:30].strip(), line[30:39].strip(), line[38:].strip()]
use re module
import re
l1 = re.split(' +', l[0])
l2 = re.split(' +', l[1])
print [l1.remove(''), l2.remove('')]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.