简体   繁体   English

解析具有与空格对齐的列的文本文件

[英]Parse a text file with columns aligned with white spaces

I am trying to parse a text file where are entries aligned as columns using multiple white spaces. 我正在尝试解析一个文本文件,其中条目使用多个空格对齐为列。 The text looks like this: 文字如下:

Blah blah, blah               bao     123456     
hello, hello, hello           miao    299292929  

I have already checked that it is not tab-delimited. 我已经检查过它不是制表符分隔的。 The entries are in fact aligned with multiple spaces. 这些条目实际上与多个空格对齐。

It has not been a problem splitting the text into single rows and then I also noticed that there are trailing spaces after the numerical sequence. 将文本拆分成单行并不是一个问题,然后我也注意到数字序列后面有尾随空格。 So what I have now is: 所以我现在拥有的是:

["Blah blah, blah               bao     123456     ",   
 "hello, hello, hello           miao    299292929  "]

The desired output would be : 期望的输出是:

[["Blah blah, blah", "bao", "123456"],
 ["hello, hello, hello", "miao", "299292929"]]

You can use re.split() , and use \\s{2,} as a delimiter pattern: 您可以使用re.split() ,并使用\\s{2,}作为分隔符模式:

>>> l = ["Blah blah, blah               bao     123456     ",   
...      "hello, hello, hello           miao    299292929  "]
>>> for item in l:
...     re.split('\s{2,}', item.strip())
... 
['Blah blah, blah', 'bao', '123456']
['hello, hello, hello', 'miao', '299292929']

\\s{2,} matches 2 or more consequent whitespace characters. \\s{2,}匹配2个或更多后续空白字符。

You can simply split by index. 您可以简单地按索引进行拆分。 You could either hardcode the indexes, or detect them: 您可以对索引进行硬编码,也可以检测它们:

l=["Blah blah, blah               bao     123456     ",   
   "hello, hello, hello           miao    299292929  "]

def detect_column_indexes( list_of_lines ):
    indexes=[0]
    transitions= [col.count(' ')==len(list_of_lines) for col in zip(*list_of_lines)]
    last=False
    for i, x in enumerate(transitions):
        if not x and last:
            indexes.append(i)
        last=x
    indexes.append( len(list_of_lines[0])+1 )
    return indexes

def split_line_by_indexes( indexes, line ):
    tokens=[]
    for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
        tokens.append( line[i1:i2].rstrip() )
    return tokens

indexes= detect_column_indexes( l )
parsed= [split_line_by_indexes(indexes, line) for line in l] 
print indexes
print parsed

output: 输出:

[0, 30, 38, 50]
[['Blah blah, blah', 'bao', '123456'], ['hello, hello, hello', 'miao', '299292929']]

Obviously, it's not possible to tell apart trailing whitespace on each collumn - but you can detect leading whitespace by using rstrip instead of strip . 显然,不可能在每个列上分辨尾随空格 - 但是您可以使用rstrip而不是strip来检测前导空格。

This method is not foolproof, but is more robust than detecting two consecutive whitespaces. 这种方法并非万无一失,但比检测两个连续的空格更加健壮。

If you know the width of each field, it's easy. 如果您知道每个字段的宽度,那很容易。 The first field is 30 characters wide, second one is 8 characters and the last one is 11 characters. 第一个字段宽30个字符,第二个字符是8个字符,最后一个字符是11个字符。 So you can do something like this: 所以你可以这样做:

line = 'Blah blah, blah               bao     123456     '
parts = [line[:30].strip(), line[30:39].strip(), line[38:].strip()]

use re module 使用re模块

import re
l1 = re.split('  +', l[0])
l2 = re.split('  +', l[1])
print [l1.remove(''), l2.remove('')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM