簡體   English   中英

帶有多個列定義的熊貓read_table

[英]Pandas read_table with multiple column definitions

我有一個生成文本數據的代碼,其中在運行過程中將診斷輸出附加到單個文本文件中。 根據我的設置方式,將進行不同的測量,並且在每次運行開始時都會有一個相關的標題行。 輸出類似於以下內容:

# time diagnostic_1, diagnostic_2
0.3 0.25376334 0.07494259
1.7 0.3407481 0.03018158
2.2 0.45349798 0.85539953
3.4 0.22368132 0.52276335
4.8 0.17906047 0.40659944
# time diagnostic_1, diagnostic_3
3.4 0.65968555 0.67085918
4.8 0.2122165 0.80855038
5.1 0.96943873 0.41903639
6.8 0.16242912 0.91949807
7.0 0.68513815 0.22881037
8.8 0.83304083 0.02394251
9.2 0.01699944 0.58386401
# time diagnostic_2, diagnostic_3
8 0.79595325  0.8913367 
9 0.46277533  0.47859048
10 0.30773957  0.64765873
11 0.19077614  0.39109832
12 0.0020474  0.44365015

有沒有辦法讓pandas.read_table在讀取指定的字符串之后而不是在指定的行數之后返回? 我現在的解決方法是對grep進行第一遍查找拆分位置,然后使用numpy.loadtxt加載數組

from subprocess import check_output
import numpy as np
import pandas as pd
from itertools import cycle

fname = 'foo'
headerrows = [int(s.split(b':')[0])
              for s in check_output(['grep', '-on', '^#', fname]).split()]
# -1 to the range, because the header row is read separately
limiters = [range(a, b-1) for a, b in zip(headerrows[:-1],   headerrows[1:])]
limiters += [cycle([True, ]), ]

nameses = [['t', 'diagnostic_1', 'diagnostic_2'],
           ['t', 'diagnostic_1', 'diagnostic_3'],
           ['t', 'diagnostic_2', 'diagnostic_3']]
dat = []
with open(fname, 'r') as fobj:
    for names, limit in zip(nameses, limiters):
        line = fobj.readline()
        dat.append(pd.DataFrame(np.loadtxt((s for i, s in zip(limit, fobj))),
                                columns=names))

包含我想要的信息的數據框的完整腳本。 具有更新和刪除列的猴子業務對於保持復合索引是必要的。 retval.merge(dset, how='outer')給出相同的列,但給出一個整數索引。

from subprocess import check_output
import numpy as np
import pandas as pd
from itertools import cycle

fname = 'foo'
headerrows = [int(s.split(b':')[0])
              for s in check_output(['grep', '-on', '^#', fname]).split()]
# subtract one because header column is read separately
limiters = [range(a, b-1) for a, b in zip(headerrows[:-1], headerrows[1:])]
limiters += [cycle([True, ]), ]

nameses = [['t', 'diagnostic_1', 'diagnostic_2'],
           ['t', 'diagnostic_1', 'diagnostic_3'],
           ['t', 'diagnostic_2', 'diagnostic_3']]

with open(fname, 'r') as fobj:
    for names, limit in zip(nameses, limiters):
        line = fobj.readline()
        dset = pd.DataFrame(np.loadtxt((line for i, line in zip(limit, fobj))),
                            columns=names)
        dset.set_index('t', inplace=True)
        # if the return value already exists, merge in the new dataset
        try:
            retval = retval.merge(dset, how='outer',
                                  left_index=True, right_index=True,
                                  suffixes=('', '_'))
            for col in (c for c in retval.columns if not c.endswith('_')):
                upd = ''.join((col, '_'))
                try:
                    retval[col].update(retval[upd])
                    retval.drop(upd, axis=1, inplace=True)
                except KeyError:
                    pass
        except NameError:
            retval = dset
print(retval)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM