熊猫：将文件读入DataFrame时，忽略特定字符串后的所有行

Question

I have a pandas DataFrame which can be summarized as this: 我有一个熊猫DataFrame，可以总结为：

[Header]
Some_info = some_info
[Data]
Col1    Col2
0.532   Point
0.234   Point
0.123   Point
1.455   Square
14.64   Square
[Other data]
Other1  Other2
Test1   PASS
Test2   FAIL

My goal is to read only the portion of text between [Data] and [Other data] , which is variable (different length). 我的目标是仅读取[Data]和[Other data]之间的文本部分，该部分是可变的（不同长度）。 The header has always the same length, so skiprows from pandas.read_csv can be used. 标头的长度始终相同，因此可以使用skiprows的pandas.read_csv 。 However, skipfooter needs the number of lines to skip, which can change between files. 但是， skipfooter需要跳过的行数 ，这可以在文件之间改变。

What would be the best solution here? 什么是最好的解决方案？ I would like to avoid altering the file externally unless there's no other solution. 除非没有其他解决方案，否则我想避免从外部更改文件。

Answer 1

Numpy's genfromtxt has the ability to take a generator as an input (rather than a file directly) -- the generator can just stop yielding as soon as it hits your footer. Numpy的genfromtxt能够将生成器作为输入（而不是直接作为文件）-生成器只要打到页脚，就可以立即停止屈服。 The resulting structured array could be converted to a pandas DataFrame. 生成的结构化数组可以转换为pandas DataFrame。 It's not ideal, but it didn't look like pandas' read_csv could take the generator directly. 这并不理想，但是看起来熊猫的read_csv不能直接使用生成器。

import numpy as np
import pandas as pd

def skip_variable_footer(infile):
    for line in infile:
        if line.startswith('[Other data]'):
            raise StopIteration
        else:
            yield line


with open(filename, 'r') as infile:
    data = np.genfromtxt(skip_variable_footer(infile), delimiter=',', names=True, dtype=None)

df = pd.DataFrame(data)

Answer 2

This method has to run over the file twice. 此方法必须对文件运行两次。

import itertools as it

def get_footer(file_):
    with open(file_) as f:
        g = it.dropwhile(lambda x: x != '[Other data]\n', f)
        footer_len = len([i for i, _ in enumerate(g)])
    return footer_len

footer_len = get_footer('file.txt')
df = pd.read_csv('file.txt', … skipfooter=footer_len)

熊猫：将文件读入DataFrame时，忽略特定字符串后的所有行

问题描述

2 个解决方案

解决方案1
4 2013-10-04 17:00:40

解决方案2
2 已采纳 2013-10-02 15:42:52

熊猫：将文件读入DataFrame时，忽略特定字符串后的所有行

问题描述

2 个解决方案

解决方案1 4 2013-10-04 17:00:40

解决方案2 2 已采纳 2013-10-02 15:42:52

解决方案1
4 2013-10-04 17:00:40

解决方案2
2 已采纳 2013-10-02 15:42:52