简体   繁体   English

Python:如何遍历行块

[英]Python: How to loop through blocks of lines

How to go through blocks of lines separated by an empty line? 如何通过空行分隔的行块? The file looks like the following: 该文件如下所示:

ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13

ID: 4
Name: M
FamilyN: Z
Age: 25

I want to loop through the blocks and grab the fields Name, Family name and Age in a list of 3 columns: 我想遍历块并在3列的列表中获取名称,姓氏和年龄字段:

Y X 20
F H 23
Y S 13
Z M 25

Here's another way, using itertools.groupby . 这是另一种方法,使用itertools.groupby The function groupy iterates through lines of the file and calls isa_group_separator(line) for each line . 函数groupy遍历文件的行,并为每一line调用isa_group_separator(line) isa_group_separator returns either True or False (called the key ), and itertools.groupby then groups all the consecutive lines that yielded the same True or False result. isa_group_separator返回True或False(称为key ),然后itertools.groupby将产生相同True或False结果的所有连续行分组。

This is a very convenient way to collect lines into groups. 这是将线路收集到组中的一种非常方便的方法。

import itertools

def isa_group_separator(line):
    return line=='\n'

with open('data_file') as f:
    for key,group in itertools.groupby(f,isa_group_separator):
        # print(key,list(group))  # uncomment to see what itertools.groupby does.
        if not key:
            data={}
            for item in group:
                field,value=item.split(':')
                value=value.strip()
                data[field]=value
            print('{FamilyN} {Name} {Age}'.format(**data))

# Y X 20
# F H 23
# Y S 13
# Z M 25
import re
result = re.findall(
    r"""(?mx)           # multiline, verbose regex
    ^ID:.*\s*           # Match ID: and anything else on that line 
    Name:\s*(.*)\s*     # Match name, capture all characters on this line
    FamilyN:\s*(.*)\s*  # etc. for family name
    Age:\s*(.*)$        # and age""", 
    subject)

Result will then be 结果将是

[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]

which can be trivially changed into whatever string representation you want. 这可以简单地改成你想要的任何字符串表示。

Use a generator. 使用发电机。

def blocks( iterable ):
    accumulator= []
    for line in iterable:
        if start_pattern( line ):
            if accumulator:
                yield accumulator
                accumulator= []
        # elif other significant patterns
        else:
            accumulator.append( line )
     if accumulator:
         yield accumulator

If file is not huge you can read whole file with: 如果文件不是很大,你可以用以下内容读取整个文件:

content = f.open(filename).read()

then you can split content to blocks using: 然后您可以使用以下contentcontent拆分为块

blocks = content.split('\n\n')

Now you can create function to parse block of text. 现在您可以创建解析文本块的函数。 I would use split('\\n') to get lines from block and split(':') to get key and value, eventually with str.strip() or some help of regular expressions. 我会使用split('\\n')来获取block和split(':')以获取键和值,最后使用str.strip()或正则表达式的一些帮助。

Without checking if block has required data code can look like: 不检查块是否具有所需的数据代码可能如下所示:

f = open('data.txt', 'r')
content = f.read()
f.close()
for block in content.split('\n\n'):
    person = {}
    for l in block.split('\n'):
        k, v = l.split(': ')
        person[k] = v
    print('%s %s %s' % (person['FamilyN'], person['Name'], person['Age']))

If your file is too large to read into memory all at once, you can still use a regular expressions based solution by using a memory mapped file, with the mmap module : 如果您的文件太大而无法一次性读入内存,您仍然可以使用内存映射文件使用基于正则表达式的解决方案,使用mmap模块

import sys
import re
import os
import mmap

block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)

filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)

for block_match in block_expr.finditer(contents):
    print block_match.group()

The mmap trick will provide a "pretend string" to make regular expressions work on the file without having to read it all into one large string. mmap技巧将提供一个“假装字符串”,使正则表达式在文件上工作,而不必将其全部读入一个大字符串。 And the find_iter() method of the regular expression object will yield matches without creating an entire list of all matches at once (which findall() does). 并且正则表达式对象的find_iter()方法将产生匹配,而不会立即创建所有匹配的完整列表( findall()会创建)。

I do think this solution is overkill for this use case however (still: it's a nice trick to know...) 我确实认为这个解决方案对于这个用例来说太过分了(不过:这是一个很好的诀窍......)

import itertools import itertools

# Assuming input in file input.txt
data = open('input.txt').readlines()

records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)    
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]

# You can change output to generator by    
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)

# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]    
#You can iterate and change the order of elements in the way you want    
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output

This answer isn't necessarily better than what's already been posted, but as an illustration of how I approach problems like this it might be useful, especially if you're not used to working with Python's interactive interpreter. 这个答案不一定比已经发布的更好,但作为我如何处理这样的问题的例证它可能是有用的,特别是如果你不习惯使用Python的交互式解释器。

I've started out knowing two things about this problem. 我开始知道关于这个问题的两件事。 First, I'm going to use itertools.groupby to group the input into lists of data lines, one list for each individual data record. 首先,我将使用itertools.groupby将输入分组为数据行列表,每个数据记录列表一个列表。 Second, I want to represent those records as dictionaries so that I can easily format the output. 其次,我想将这些记录表示为字典,以便我可以轻松地格式化输出。

One other thing that this shows is how using generators makes breaking a problem like this down into small parts easy. 另一件事就是如何使用发电机将这样的问题简单地分解成小部件。

>>> # first let's create some useful test data and put it into something 
>>> # we can easily iterate over:
>>> data = """ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13"""
>>> data = data.split("\n")
>>> # now we need a key function for itertools.groupby.
>>> # the key we'll be grouping by is, essentially, whether or not
>>> # the line is empty.
>>> # this will make groupby return groups whose key is True if we
>>> care about them.
>>> def is_data(line):
        return True if line.strip() else False

>>> # make sure this really works
>>> "\n".join([line for line in data if is_data(line)])
'ID: 1\nName: X\nFamilyN: Y\nAge: 20\nID: 2\nName: H\nFamilyN: F\nAge: 23\nID: 3\nName: S\nFamilyN: Y\nAge: 13\nID: 4\nName: M\nFamilyN: Z\nAge: 25'

>>> # does groupby return what we expect?
>>> import itertools
>>> [list(value) for (key, value) in itertools.groupby(data, is_data) if key]
[['ID: 1', 'Name: X', 'FamilyN: Y', 'Age: 20'], ['ID: 2', 'Name: H', 'FamilyN: F', 'Age: 23'], ['ID: 3', 'Name: S', 'FamilyN: Y', 'Age: 13'], ['ID: 4', 'Name: M', 'FamilyN: Z', 'Age: 25']]
>>> # what we really want is for each item in the group to be a tuple
>>> # that's a key/value pair, so that we can easily create a dictionary
>>> # from each item.
>>> def make_key_value_pair(item):
        items = item.split(":")
        return (items[0].strip(), items[1].strip())

>>> make_key_value_pair("a: b")
('a', 'b')
>>> # let's test this:
>>> dict(make_key_value_pair(item) for item in ["a:1", "b:2", "c:3"])
{'a': '1', 'c': '3', 'b': '2'}
>>> # we could conceivably do all this in one line of code, but this 
>>> # will be much more readable as a function:
>>> def get_data_as_dicts(data):
        for (key, value) in itertools.groupby(data, is_data):
            if key:
                yield dict(make_key_value_pair(item) for item in value)

>>> list(get_data_as_dicts(data))
[{'FamilyN': 'Y', 'Age': '20', 'ID': '1', 'Name': 'X'}, {'FamilyN': 'F', 'Age': '23', 'ID': '2', 'Name': 'H'}, {'FamilyN': 'Y', 'Age': '13', 'ID': '3', 'Name': 'S'}, {'FamilyN': 'Z', 'Age': '25', 'ID': '4', 'Name': 'M'}]
>>> # now for an old trick:  using a list of column names to drive the output.
>>> columns = ["Name", "FamilyN", "Age"]
>>> print "\n".join(" ".join(d[c] for c in columns) for d in get_data_as_dicts(data))
X Y 20
H F 23
S Y 13
M Z 25
>>> # okay, let's package this all into one function that takes a filename
>>> def get_formatted_data(filename):
        with open(filename, "r") as f:
            columns = ["Name", "FamilyN", "Age"]
            for d in get_data_as_dicts(f):
                yield " ".join(d[c] for c in columns)

>>> print "\n".join(get_formatted_data("c:\\temp\\test_data.txt"))
X Y 20
H F 23
S Y 13
M Z 25

使用dict,namedtuple或自定义类来存储每个属性,然后在到达空行或EOF时将对象附加到列表中。

simple solution: 简单解决方案

result = []
for record in content.split('\n\n'):
    try:
        id, name, familyn, age = map(lambda rec: rec.split(' ', 1)[1], record.split('\n'))
    except ValueError:
        pass
    except IndexError:
        pass
    else:
        result.append((familyn, name, age))

Along with the half-dozen other solutions I already see here, I'm a bit surprised that no one has been so simple-minded (that is, generator-, regex-, map-, and read-free) as to propose, for example, 除了我已经在这里看到的其他六种解决方案之外,我有点惊讶的是,没有人如此简单(即发电机,正则表达式,地图和无读取)建议,例如,

fp = open(fn)
def get_one_value():
    line = fp.readline()
    if not line:
        return None
    parts = line.split(':')
    if 2 != len(parts):
        return ''
    return parts[1].strip()

# The result is supposed to be a list.
result = []
while 1:
        # We don't care about the ID.
   if get_one_value() is None:
       break
   name = get_one_value()
   familyn = get_one_value()
   age = get_one_value()
   result.append((name, familyn, age))
       # We don't care about the block separator.
   if get_one_value() is None:
       break

for item in result:
    print item

Re-format to taste. 重新格式化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM