简体   繁体   English

Python 中的多字分割线

[英]Split lines with multiple words in Python

I have a (very ugly) txt output from an SQL query which is performed by external system that I can't change.我有一个(非常丑陋的)txt output 来自 SQL 查询,该查询由我无法更改的外部系统执行。 Here is the output example:这是 output 示例:

FruitName      Owner             OwnerPhone
=============  ================= ============
Red Apple      Sr Lorem Ipsum    123123
Yellow Banana  Ms Dolor sir Amet 456456

As you can see, the FruitName column and the Owner column may consists of few words and there's no fixed pattern in how many words could be in these columns.如您所见, FruitName列和Owner列可能由几个单词组成,并且这些列中可以包含多少个单词没有固定的模式。 If I use line.split() to make array on each line Python, it will remove all the whitespace and make the array become like this:如果我使用line.split()在每一行 Python 上制作数组,它将删除所有空格并使数组变成这样:

['Red', 'Apple', 'Sr', 'Lorem', 'Ipsum', '123123']
['Yellow', 'Banana', 'Ms', 'Dolor', 'sir', 'Amet', '456456']

The question is, how can I split it properly into output like this:问题是,我怎样才能正确地将其拆分为 output ,如下所示:

['Red Apple', 'Sr Lorem Ipsum', '123123']
['Yellow Banana', 'Ms Dolor sir Amet', '456456']

I'm a newbie in Python and I dont know if such thing is possible or not.我是 Python 的新手,我不知道这样的事情是否可能。 Any help will be very much appreciated.任何帮助将不胜感激。 Thanks!谢谢!

You can use the ==== dividers to your advantage in that you can get slices in all lines corresponding to the start and end indices of each ==== that represents a column:您可以使用====分隔符来发挥您的优势,因为您可以在所有行中获得与每个====表示列的开始和结束索引相对应的切片:

def get_divider_indices(line):
  i, j = 0, line.index(' ')
  indices = []
  while i != -1:
    indices.append((i, j))
    i = line.find('=', j)
    j = line.find(' ', i)
    if j == -1: j = len(line)
  return indices

with open('data.txt', 'r') as f:
  lines = f.readlines()
  dividers = get_divider_indices(lines[1])
  rows= []
  for line in lines[2:]:
    rows.append([line[s:e].strip() for s, e in dividers])

print(rows)

Output Output

[['Red Apple', 'Sr Lorem Ipsum', '123123'], ['Yellow Banana', 'Ms Dolor sir Amet', '456456']]

Note that you can use str.find() to get the index of a character in a string (which I use above to get the index of an = or a space in the divider line).请注意,您可以使用str.find()来获取字符串中字符的索引(我在上面使用它来获取=的索引或分隔线中的空格)。

Columns have fixed widths so you can use it and slice lines列具有固定宽度,因此您可以使用它并分割线

data = '''FruitName      Owner             OwnerPhone
=============  ================= ============
Red Apple      Sr Lorem Ipsum    123123
Yellow Banana  Ms Dolor sir Amet 456456'''

lines = data.split('\n')

for line in lines[2:]:
    fruit = line[:13].strip()
    owner = line[13:32].strip()
    phone = line[32:].strip()
    print([fruit, owner, phone])

More complex solution would use second line - with === - to calculate widths for columns and use them in slicing.更复杂的解决方案将使用第二行 - 带有=== - 来计算列的宽度并在切片中使用它们。

As suggested by others you can use the length of each divider to calculate the length of the columns .正如其他人所建议的,您可以使用每个分隔线的长度来计算columns的长度。 The following example illustrates just that:以下示例说明了这一点:

rows = list()
with open('data.txt', 'r') as f:
    lines = f.readlines()

    dividers = lines[1].split() 

    for line in lines[2:]:
        row = []
        prvLength = 0
        for d in dividers:
            start = prvLength
            length = start+len(d)+1
            row.append(line[start:start+length].strip())
            prvLength += length
        rows.append(row)
print(rows)

Output Output

[['Red Apple', 'Sr Lorem Ipsum', '123123'], ['Yellow Banana', 'Ms Dolor sir Amet', '456456']]


You can also check if the columns are separated by TABS , ie '\t' . 您还可以检查列是否由TABS分隔,即'\t' If so, you can just split the line of lines using line.split('\t') which would be much more simple. 如果是这样,您可以使用lines line.split('\t')拆分line ,这会更简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM