使用numpy.genfromtxt进行过滤

Question

我有一个文件，只需要将某些值读入数组即可。 该文件按指定TIMESTEP值的行划分。 我需要文件中最高TIMESTEP之后的数据部分。

这些文件将包含200,000行，尽管我不知道任何给定文件所需的节TIMESTEP一行开始，也不知道TIMESTEP最大值是TIMESTEP 。

假设如果我可以找到最大的TIMESTEP的行号，那么我可以从该行开始导入。 所有这些TIMESTEP行都以空格字符开头。 关于如何进行的任何想法都会有所帮助。

样本文件

 headerline 1 to skip
 headerline 2 to skip
 headerline 3 to skip
 TIMESTEP =    0.00000000    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
 TIMESTEP =   0.119999997    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
3,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
 TIMESTEP =    3.00000000    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0

基本代码

import numpy as np

with open('myfile.txt') as f_in:
  data = np.genfromtxt(f_in, skip_header=3, comments=" ")

Answer 1

您可以在使用genfromtxt()时精确地使用filter() genfromtxt() ，因为genfromtxt接受生成器。

with open('myfile.txt', 'rb') as f_in:
    lines = filter(lambda x: not x.startswith(b' '), f_in)
    data = genfromtxt(lines, delimiter=',')

然后，在您的情况下，您无需skip_header 。

Answer 2

您可以使用自定义迭代器。

这是一个工作示例：

从numpy import genfromtxt

class Iter(object):
    ' a custom iterator which returns a timestep and corresponding data '

    def __init__(self, fd):
        self.__fd = fd
        self.__timestep = None
        self.__next_timestep = None
        self.__finish = False
        for _ in self.to_next_timestep(): pass # skip header

    def to_next_timestep(self):
        ' iterate until next timestep '
        for line in self.__fd:
            if 'TIMESTEP' in line:
                self.__timestep = self.__next_timestep
                self.__next_timestep = float(line.split('=')[1])
                return
            yield line
        self.__timestep = self.__next_timestep
        self.__finish = True

    def __iter__(self): return self

    def next(self):
        if self.__finish:
            raise StopIteration
        data = genfromtxt(self.to_next_timestep(), delimiter=',')
        return self.__timestep, data

with open('myfile.txt') as fd:
    iter = Iter(fd)
    for timestep, data in iter:
        print timestep, data # data can be selected upon highest timestep

Answer 3

这是一个使用常规Python文件读取的解决方案，将genfromtxt应用于行列表。 出于说明目的，我正在解析每个数据块，但可以轻松地对其进行修改以跳过不符合您的时间步标准的数据块。

我首先用StringIO编写了此StringIO ，这在许多genfromtxt doc示例中都使用过，但是它所需要的只是一个可迭代的过程。 因此，行列表就可以了。

import numpy as np
filename = 'stack26008436.txt'

def parse(tstep, block):
    print tstep
    print np.genfromtxt(block, delimiter=',')

with open(filename) as f:
    block = []
    for line in f:
        if 'TIMESTEP' in line:
            if block:
                parse(tstep, block)
            block = []
            tstep = float(line.strip().split('=')[1])
        else:
            if 'header' not in line:
                block.append(line)
    parse(tstep, block)

从您的样品中产生：

0901:~/mypy$ python2.7 stack26008436.py
0.0
[[ 0.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 ...
 [ 3.  1.  1.  1.  1.  1.  1.]]
3.0
[[ 0.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 [ 2.  1.  1.  1.  1.  1.  1.]]

使用numpy.genfromtxt进行过滤

问题描述

3 个解决方案

解决方案1
2 2014-09-24 10:06:27

解决方案2
1 已采纳 2014-09-24 12:38:23

解决方案3
0 2014-09-24 16:08:38

使用numpy.genfromtxt进行过滤

问题描述

3 个解决方案

解决方案1 2 2014-09-24 10:06:27

解决方案2 1 已采纳 2014-09-24 12:38:23

解决方案3 0 2014-09-24 16:08:38

解决方案1
2 2014-09-24 10:06:27

解决方案2
1 已采纳 2014-09-24 12:38:23

解决方案3
0 2014-09-24 16:08:38