[英]Pandas read_table with multiple column definitions
I have a code generating text data, where the diagnostic output is appended to a single text file over the course of the run. 我有一个生成文本数据的代码,其中在运行过程中将诊断输出附加到单个文本文件中。 Depending on how I set it up there will be different measurements taken, with an associated header row at the start of each run. 根据我的设置方式,将进行不同的测量,并且在每次运行开始时都会有一个相关的标题行。 The output resembles this: 输出类似于以下内容:
# time diagnostic_1, diagnostic_2
0.3 0.25376334 0.07494259
1.7 0.3407481 0.03018158
2.2 0.45349798 0.85539953
3.4 0.22368132 0.52276335
4.8 0.17906047 0.40659944
# time diagnostic_1, diagnostic_3
3.4 0.65968555 0.67085918
4.8 0.2122165 0.80855038
5.1 0.96943873 0.41903639
6.8 0.16242912 0.91949807
7.0 0.68513815 0.22881037
8.8 0.83304083 0.02394251
9.2 0.01699944 0.58386401
# time diagnostic_2, diagnostic_3
8 0.79595325 0.8913367
9 0.46277533 0.47859048
10 0.30773957 0.64765873
11 0.19077614 0.39109832
12 0.0020474 0.44365015
Is there a way to have pandas.read_table return after reading a specified string rather than after a specified number of lines? 有没有办法让pandas.read_table在读取指定的字符串之后而不是在指定的行数之后返回? The work around I have right now is to do a first pass with grep to find where the splits are, and load the arrays using numpy.loadtxt 我现在的解决方法是对grep进行第一遍查找拆分位置,然后使用numpy.loadtxt加载数组
from subprocess import check_output
import numpy as np
import pandas as pd
from itertools import cycle
fname = 'foo'
headerrows = [int(s.split(b':')[0])
for s in check_output(['grep', '-on', '^#', fname]).split()]
# -1 to the range, because the header row is read separately
limiters = [range(a, b-1) for a, b in zip(headerrows[:-1], headerrows[1:])]
limiters += [cycle([True, ]), ]
nameses = [['t', 'diagnostic_1', 'diagnostic_2'],
['t', 'diagnostic_1', 'diagnostic_3'],
['t', 'diagnostic_2', 'diagnostic_3']]
dat = []
with open(fname, 'r') as fobj:
for names, limit in zip(nameses, limiters):
line = fobj.readline()
dat.append(pd.DataFrame(np.loadtxt((s for i, s in zip(limit, fobj))),
columns=names))
Full script that spits out a dataframe with the information I want. 包含我想要的信息的数据框的完整脚本。 The monkey business with updating and dropping columns is necessary to keep the composite index. 具有更新和删除列的猴子业务对于保持复合索引是必要的。 retval.merge(dset, how='outer')
Give the same columns, but an integer index. retval.merge(dset, how='outer')
给出相同的列,但给出一个整数索引。
from subprocess import check_output
import numpy as np
import pandas as pd
from itertools import cycle
fname = 'foo'
headerrows = [int(s.split(b':')[0])
for s in check_output(['grep', '-on', '^#', fname]).split()]
# subtract one because header column is read separately
limiters = [range(a, b-1) for a, b in zip(headerrows[:-1], headerrows[1:])]
limiters += [cycle([True, ]), ]
nameses = [['t', 'diagnostic_1', 'diagnostic_2'],
['t', 'diagnostic_1', 'diagnostic_3'],
['t', 'diagnostic_2', 'diagnostic_3']]
with open(fname, 'r') as fobj:
for names, limit in zip(nameses, limiters):
line = fobj.readline()
dset = pd.DataFrame(np.loadtxt((line for i, line in zip(limit, fobj))),
columns=names)
dset.set_index('t', inplace=True)
# if the return value already exists, merge in the new dataset
try:
retval = retval.merge(dset, how='outer',
left_index=True, right_index=True,
suffixes=('', '_'))
for col in (c for c in retval.columns if not c.endswith('_')):
upd = ''.join((col, '_'))
try:
retval[col].update(retval[upd])
retval.drop(upd, axis=1, inplace=True)
except KeyError:
pass
except NameError:
retval = dset
print(retval)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.