[英]How to fill pandas dataframes in a loop?
我试图通过在列标题中搜索字符串来从更大的数据帧构建数据帧的子集。
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
我正在尝试在cdf dataframe列中搜索wellname,并将包含该wellname的列放入名为wellname的新数据框中。
我能够构建我的新子数据帧但是数据帧没有大小(0,0),而cdf是(21973,91)。
well_cols也可以正确填充列表。
这些是一些cdf列标题。 每列有20k行数据。
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
我想创建一个新的数据框,每个标题包含“井”IE,所有列和数据的新数据帧,列名包含N1,另一个用于N2等。
新数据帧在循环内部时正确填充,但在循环中断时消失... print(well)
的代码输出print(well)
:
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC这应该足够了:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
如果你想填充某些东西,通常可以使用字典。 在这种情况下,如果您输入well_dict['N1']
,您将获得第一个数据帧,依此类推。
迭代时,数组的元素是不可变的。 也就是说,这是基于你的例子它正在做的事情:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
但是在任何时候你都没有更改数组,或存储新的数据帧(尽管在迭代结束时你仍然会将最后一个数据帧存储在well
)。
IMO,似乎将数据帧存储在dict中会更容易使用:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
但是,如果您真的希望它在列表中,您可以执行以下操作:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
解决该问题的一种方法是使用pd.MultiIndex
和Groupby
。
您可以添加构造一个由井标识符和变量名组成的MultiIndex。 如果你有df
:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
您可以使用df.columns.str.split('_', expand=True)
来解析井标识符对应的变量名称(即a
或b
)。
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
哪个回报:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
然后,您可以移调数据帧和groupby
的多指标0级。
grouped = df.T.groupby(level=0)
要返回未转置的子数据帧列表,您可以使用:
wells = [group.T for _, group in grouped]
wells[0]
是:
N1
a b
0 2 2
1 7 8
和wells[1]
是:
N2
a b
0 3 4
1 9 10
最后一步是相当不必要的,因为数据可从分组的对象进行访问grouped
。
全部一起:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
使用contains
df[df.columns.str.contains('|'.join(wells))]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.