解析具有與空格對齊的列的文本文件

Question

我正在嘗試解析一個文本文件，其中條目使用多個空格對齊為列。 文字如下：

Blah blah, blah               bao     123456     
hello, hello, hello           miao    299292929

我已經檢查過它不是制表符分隔的。 這些條目實際上與多個空格對齊。

將文本拆分成單行並不是一個問題，然后我也注意到數字序列后面有尾隨空格。 所以我現在擁有的是：

["Blah blah, blah               bao     123456     ",   
 "hello, hello, hello           miao    299292929  "]

期望的輸出是：

[["Blah blah, blah", "bao", "123456"],
 ["hello, hello, hello", "miao", "299292929"]]

Answer 1

您可以使用re.split() ，並使用\\s{2,}作為分隔符模式：

>>> l = ["Blah blah, blah               bao     123456     ",   
...      "hello, hello, hello           miao    299292929  "]
>>> for item in l:
...     re.split('\s{2,}', item.strip())
... 
['Blah blah, blah', 'bao', '123456']
['hello, hello, hello', 'miao', '299292929']

\\s{2,}匹配2個或更多后續空白字符。

Answer 2

您可以簡單地按索引進行拆分。 您可以對索引進行硬編碼，也可以檢測它們：

l=["Blah blah, blah               bao     123456     ",   
   "hello, hello, hello           miao    299292929  "]

def detect_column_indexes( list_of_lines ):
    indexes=[0]
    transitions= [col.count(' ')==len(list_of_lines) for col in zip(*list_of_lines)]
    last=False
    for i, x in enumerate(transitions):
        if not x and last:
            indexes.append(i)
        last=x
    indexes.append( len(list_of_lines[0])+1 )
    return indexes

def split_line_by_indexes( indexes, line ):
    tokens=[]
    for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
        tokens.append( line[i1:i2].rstrip() )
    return tokens

indexes= detect_column_indexes( l )
parsed= [split_line_by_indexes(indexes, line) for line in l] 
print indexes
print parsed

輸出：

[0, 30, 38, 50]
[['Blah blah, blah', 'bao', '123456'], ['hello, hello, hello', 'miao', '299292929']]

顯然，不可能在每個列上分辨尾隨空格 - 但是您可以使用rstrip而不是strip來檢測前導空格。

這種方法並非萬無一失，但比檢測兩個連續的空格更加健壯。

Answer 3

如果您知道每個字段的寬度，那很容易。 第一個字段寬30個字符，第二個字符是8個字符，最后一個字符是11個字符。 所以你可以這樣做：

line = 'Blah blah, blah               bao     123456     '
parts = [line[:30].strip(), line[30:39].strip(), line[38:].strip()]

Answer 4

使用re模塊

import re
l1 = re.split('  +', l[0])
l2 = re.split('  +', l[1])
print [l1.remove(''), l2.remove('')]

解析具有與空格對齊的列的文本文件

問題描述

4 個解決方案

解決方案1
6 已采納 2014-05-14 10:33:02

解決方案2
1 2014-05-14 10:29:59

解決方案3
1 2014-05-14 10:31:42

解決方案4
1 2014-05-14 10:36:27

解析具有與空格對齊的列的文本文件

問題描述

4 個解決方案

解決方案1 6 已采納 2014-05-14 10:33:02

解決方案2 1 2014-05-14 10:29:59

解決方案3 1 2014-05-14 10:31:42

解決方案4 1 2014-05-14 10:36:27

解決方案1
6 已采納 2014-05-14 10:33:02

解決方案2
1 2014-05-14 10:29:59

解決方案3
1 2014-05-14 10:31:42

解決方案4
1 2014-05-14 10:36:27