[英]Split lines with multiple words in Python
I have a (very ugly) txt output from an SQL query which is performed by external system that I can't change.我有一个(非常丑陋的)txt output 来自 SQL 查询,该查询由我无法更改的外部系统执行。 Here is the output example:
这是 output 示例:
FruitName Owner OwnerPhone
============= ================= ============
Red Apple Sr Lorem Ipsum 123123
Yellow Banana Ms Dolor sir Amet 456456
As you can see, the FruitName
column and the Owner
column may consists of few words and there's no fixed pattern in how many words could be in these columns.如您所见,
FruitName
列和Owner
列可能由几个单词组成,并且这些列中可以包含多少个单词没有固定的模式。 If I use line.split()
to make array on each line Python, it will remove all the whitespace and make the array become like this:如果我使用
line.split()
在每一行 Python 上制作数组,它将删除所有空格并使数组变成这样:
['Red', 'Apple', 'Sr', 'Lorem', 'Ipsum', '123123']
['Yellow', 'Banana', 'Ms', 'Dolor', 'sir', 'Amet', '456456']
The question is, how can I split it properly into output like this:问题是,我怎样才能正确地将其拆分为 output ,如下所示:
['Red Apple', 'Sr Lorem Ipsum', '123123']
['Yellow Banana', 'Ms Dolor sir Amet', '456456']
I'm a newbie in Python and I dont know if such thing is possible or not.我是 Python 的新手,我不知道这样的事情是否可能。 Any help will be very much appreciated.
任何帮助将不胜感激。 Thanks!
谢谢!
You can use the ====
dividers to your advantage in that you can get slices in all lines corresponding to the start and end indices of each ====
that represents a column:您可以使用
====
分隔符来发挥您的优势,因为您可以在所有行中获得与每个====
表示列的开始和结束索引相对应的切片:
def get_divider_indices(line):
i, j = 0, line.index(' ')
indices = []
while i != -1:
indices.append((i, j))
i = line.find('=', j)
j = line.find(' ', i)
if j == -1: j = len(line)
return indices
with open('data.txt', 'r') as f:
lines = f.readlines()
dividers = get_divider_indices(lines[1])
rows= []
for line in lines[2:]:
rows.append([line[s:e].strip() for s, e in dividers])
print(rows)
Output Output
[['Red Apple', 'Sr Lorem Ipsum', '123123'], ['Yellow Banana', 'Ms Dolor sir Amet', '456456']]
Note that you can use str.find()
to get the index of a character in a string (which I use above to get the index of an =
or a space in the divider line).请注意,您可以使用
str.find()
来获取字符串中字符的索引(我在上面使用它来获取=
的索引或分隔线中的空格)。
Columns have fixed widths so you can use it and slice lines列具有固定宽度,因此您可以使用它并分割线
data = '''FruitName Owner OwnerPhone
============= ================= ============
Red Apple Sr Lorem Ipsum 123123
Yellow Banana Ms Dolor sir Amet 456456'''
lines = data.split('\n')
for line in lines[2:]:
fruit = line[:13].strip()
owner = line[13:32].strip()
phone = line[32:].strip()
print([fruit, owner, phone])
More complex solution would use second line - with ===
- to calculate widths for columns and use them in slicing.更复杂的解决方案将使用第二行 - 带有
===
- 来计算列的宽度并在切片中使用它们。
As suggested by others you can use the length of each divider to calculate the length of the columns
.正如其他人所建议的,您可以使用每个分隔线的长度来计算
columns
的长度。 The following example illustrates just that:以下示例说明了这一点:
rows = list()
with open('data.txt', 'r') as f:
lines = f.readlines()
dividers = lines[1].split()
for line in lines[2:]:
row = []
prvLength = 0
for d in dividers:
start = prvLength
length = start+len(d)+1
row.append(line[start:start+length].strip())
prvLength += length
rows.append(row)
print(rows)
Output Output
[['Red Apple', 'Sr Lorem Ipsum', '123123'], ['Yellow Banana', 'Ms Dolor sir Amet', '456456']]
TABS
, ie '\t'
.
TABS
分隔,即'\t'
。
If so, you can just split the line
of lines
using line.split('\t')
which would be much more simple.
lines
line.split('\t')
拆分line
,这会更简单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.