简体   繁体   English

Python-将多行读入列表

[英]Python - Reading multiple lines into list

OK guys/gals stuck again on something simple 好家伙/女孩再次陷入简单的事情
I have a text file which has multiple lines per entry, the data is in the following format 我有一个文本文件,每个条目有多行,数据采用以下格式

firstword word word word 第一词词词词
wordx word word word interesting1 word word word word wordx单词单词word 有趣的1个单词单词单词单词单词
wordy word word word 罗y的词
wordz word word word interesting2 word word word lastword wordz单词单词word 有趣的2单词单词单词lastword

this sequence repeats a hundred or so times, all other words are the same apart from interesting1 and interesting2, no blank lines. 此序列重复了一百次左右,除有趣的1和有趣的2外,所有其他单词都相同,没有空行。 The interesting2 is pertinent to interesting1 but not to anything else and I want to link the two interesting items together, discarding the rest such as 有趣的2与有趣的1有关,但与其他任何东西都不相关,我想将两个有趣的项目链接在一起,而丢弃诸如

interesting1 = interesting2 有趣的1 =有趣的2
interesting1 = interesting2 有趣的1 =有趣的2
interesting1 = interesting2 有趣的1 =有趣的2
etc, 1 lne per sequence 等等,每个序列1公升

Each line begins with a different word 每行以不同的词开头
my attempt was to read the file and do an "if wordx in line" statement to identify the first interesting line, slice out the value, find the second line, ("if wordz in line) slice out the value and concatenate the second with the first. 我的尝试是读取文件并执行“ if wordx in line”语句以标识第一个有趣的行,将值切出,找到第二行,(“ if wordz in line)切出值并将第二个连接起来首先。
It's clumsy though, I had to use global variables, temp variables etc, and I'm sure there must be a way of identifying the range between firstword and lastword and placing that into a single list, then slicing both values out together. 不过这很笨拙,我必须使用全局变量,临时变量等,而且我敢肯定,必须有一种方法来识别第一个单词和最后一个单词之间的范围,并将其放入单个列表中,然后将两个值切在一起。

Any suggestions gratefully acknowledged, thanks for your time 任何建议表示感谢,感谢您的宝贵时间

from itertools import izip, tee, islice

i1, i2 = tee(open("foo.txt"))

for line2, line4 in izip(islice(i1,1, None, 4), islice(i2, 3, None, 4)) :
    print line2.split(" ")[4], "=", line4.split(" ")[4]

In that case, make a regexp that matches the repeating text, and has groups for the interesting bits. 在这种情况下,请创建一个与重复文本匹配的正则表达式,并为感兴趣的位提供分组。 Then you should be able to use findall to find all cases of interesting1 and interesting2. 然后,您应该能够使用findall查找有趣的1和有趣的2的所有情况。

Like so: import re 像这样:import re

text = open("foo.txt").read()
RE = re.compile('firstword.*?wordx word word word (.*?) word.*?wordz word word word (.*?) word', re.DOTALL)
print RE.findall(text)

Although as mentioned in the comments, the islice is definitely a neater solution. 尽管如评论中所述,islice绝对是更整洁的解决方案。

I've thrown in a bagful of assertions to check the regularity of your data layout. 我抛出了一堆断言来检查数据布局的规律性。

C:\SO>type words.py

# sample pseudo-file contents
guff = """\
firstword word word word
wordx word word word interesting1-1 word word word word
wordy word word word
wordz word word word interesting2-1 word word word lastword

miscellaneous rubbish

firstword word word word
wordx word word word interesting1-2 word word word word
wordy word word word
wordz word word word interesting2-2 word word word lastword
firstword word word word
wordx word word word interesting1-3 word word word word
wordy word word word
wordz word word word interesting2-3 word word word lastword

"""

# change the RHS of each of these to reflect reality
FIRSTWORD = 'firstword'
WORDX = 'wordx'
WORDY = 'wordy'
WORDZ = 'wordz'
LASTWORD = 'lastword'

from StringIO import StringIO
f = StringIO(guff)

while True:
    a = f.readline()
    if not a: break # end of file
    a = a.split()
    if not a: continue # empty line
    if a[0] != FIRSTWORD: continue # skip extraneous matter
    assert len(a) == 4
    b = f.readline().split(); assert len(b) == 9
    c = f.readline().split(); assert len(c) == 4
    d = f.readline().split(); assert len(d) == 9
    assert a[0] == FIRSTWORD
    assert b[0] == WORDX
    assert c[0] == WORDY
    assert d[0] == WORDZ
    assert d[-1] == LASTWORD
    print b[4], d[4]

C:\SO>\python26\python words.py
interesting1-1 interesting2-1
interesting1-2 interesting2-2
interesting1-3 interesting2-3

C:\SO>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM