[英]How to fill pandas dataframes in a loop?
我試圖通過在列標題中搜索字符串來從更大的數據幀構建數據幀的子集。
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
我正在嘗試在cdf dataframe列中搜索wellname,並將包含該wellname的列放入名為wellname的新數據框中。
我能夠構建我的新子數據幀但是數據幀沒有大小(0,0),而cdf是(21973,91)。
well_cols也可以正確填充列表。
這些是一些cdf列標題。 每列有20k行數據。
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
我想創建一個新的數據框,每個標題包含“井”IE,所有列和數據的新數據幀,列名包含N1,另一個用於N2等。
新數據幀在循環內部時正確填充,但在循環中斷時消失... print(well)
的代碼輸出print(well)
:
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC這應該足夠了:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
如果你想填充某些東西,通常可以使用字典。 在這種情況下,如果您輸入well_dict['N1']
,您將獲得第一個數據幀,依此類推。
迭代時,數組的元素是不可變的。 也就是說,這是基於你的例子它正在做的事情:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
但是在任何時候你都沒有更改數組,或存儲新的數據幀(盡管在迭代結束時你仍然會將最后一個數據幀存儲在well
)。
IMO,似乎將數據幀存儲在dict中會更容易使用:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
但是,如果您真的希望它在列表中,您可以執行以下操作:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
解決該問題的一種方法是使用pd.MultiIndex
和Groupby
。
您可以添加構造一個由井標識符和變量名組成的MultiIndex。 如果你有df
:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
您可以使用df.columns.str.split('_', expand=True)
來解析井標識符對應的變量名稱(即a
或b
)。
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
哪個回報:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
然后,您可以移調數據幀和groupby
的多指標0級。
grouped = df.T.groupby(level=0)
要返回未轉置的子數據幀列表,您可以使用:
wells = [group.T for _, group in grouped]
wells[0]
是:
N1
a b
0 2 2
1 7 8
和wells[1]
是:
N2
a b
0 3 4
1 9 10
最后一步是相當不必要的,因為數據可從分組的對象進行訪問grouped
。
全部一起:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
使用contains
df[df.columns.str.contains('|'.join(wells))]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.