簡體   English   中英

如何在循環中填充pandas數據幀?

[英]How to fill pandas dataframes in a loop?

我試圖通過在列標題中搜索字符串來從更大的數據幀構建數據幀的子集。

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
    wellname = well
    well = pd.DataFrame()
    well_cols = [col for col in cdf.columns if wellname in col]
    well = cdf[well_cols]

我正在嘗試在cdf dataframe列中搜索wellname,並將包含該wellname的列放入名為wellname的新數據框中。

我能夠構建我的新子數據幀但是數據幀沒有大小(0,0),而cdf是(21973,91)。

well_cols也可以正確填充列表。

這些是一些cdf列標題。 每列有20k行數據。

Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
       'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
       'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
       'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
       'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
       'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
       'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
       'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
       'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
       'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',

我想創建一個新的數據框,每個標題包含“井”IE,所有列和數據的新數據幀,列名包含N1,另一個用於N2等。

新數據幀在循環內部時正確填充,但在循環中斷時消失... print(well)的代碼輸出print(well)

[27884 rows x 10 columns]
       N9_Inj_Casing_Gas_Valve  ...  N9_Inj_Casing_Gas_Pressure
0                    74.375000  ...                 2485.602364
1                    74.520833  ...                 2485.346000
2                    74.437500  ...                 2485.341091

IIUC這應該足夠了:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:

    well_cols = [col for col in cdf.columns if well in col]
    well_dict[well] = cdf[well_cols]

如果你想填充某些東西,通常可以使用字典。 在這種情況下,如果您輸入well_dict['N1'] ,您將獲得第一個數據幀,依此類推。

迭代時,數組的元素是不可變的。 也就是說,這是基於你的例子它正在做的事情:

# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...

但是在任何時候你都沒有更改數組,或存儲新的數據幀(盡管在迭代結束時你仍然會將最后一個數據幀存儲在well )。

IMO,似乎將數據幀存儲在dict中會更容易使用:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
    well_cols = [col for col in cdf.columns if well in col]
    well_dfs[well] = cdf[well_cols]

但是,如果您真的希望它在列表中,您可以執行以下操作:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
    well_cols = [col for col in cdf.columns if well in col]
    wells[ix] = cdf[well_cols]

解決該問題的一種方法是使用pd.MultiIndexGroupby

您可以添加構造一個由井標識符和變量名組成的MultiIndex。 如果你有df

   N1_a  N1_b  N2_a  N2_b
1     2     2     3     4
2     7     8     9    10

您可以使用df.columns.str.split('_', expand=True)來解析井標識符對應的變量名稱(即ab )。

df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

哪個回報:

  N1    N2    
   a  b  a   b
0  2  2  3   4
1  7  8  9  10

然后,您可以移調數據幀和groupby的多指標0級。

grouped = df.T.groupby(level=0)

要返回未轉置的子數據幀列表,您可以使用:

wells = [group.T for _, group in grouped]

wells[0]是:

  N1   
   a  b
0  2  2
1  7  8

wells[1]是:

  N2    
   a   b
0  3   4
1  9  10

最后一步是相當不必要的,因為數據可從分組的對象進行訪問grouped

全部一起:

import pandas as pd
from io import StringIO

data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""

df = pd.read_csv(StringIO(data)) 

# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

# Group by well name
grouped = df.T.groupby(level=0)

#bulist list of sub dataframes
wells = [group.T for _, group in grouped]

使用contains

df[df.columns.str.contains('|'.join(wells))]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM