简体   繁体   中英

How to fill pandas dataframes in a loop?

I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
    wellname = well
    well = pd.DataFrame()
    well_cols = [col for col in cdf.columns if wellname in col]
    well = cdf[well_cols]

I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.

I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).

well_cols also populates correctly as a list.

These are some of cdf column headings. Each column has 20k rows of data.

Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
       'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
       'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
       'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
       'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
       'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
       'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
       'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
       'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
       'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',

I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.

The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well) :

[27884 rows x 10 columns]
       N9_Inj_Casing_Gas_Valve  ...  N9_Inj_Casing_Gas_Pressure
0                    74.375000  ...                 2485.602364
1                    74.520833  ...                 2485.346000
2                    74.437500  ...                 2485.341091

IIUC this should be enough:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:

    well_cols = [col for col in cdf.columns if well in col]
    well_dict[well] = cdf[well_cols]

Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'] , you'll get your first dataframe, and so on.

The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:

# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...

But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).

IMO, it seems like storing the dataframes in a dict would be easier to use:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
    well_cols = [col for col in cdf.columns if well in col]
    well_dfs[well] = cdf[well_cols]

However, if you really want it in a list, you could do something like:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
    well_cols = [col for col in cdf.columns if well in col]
    wells[ix] = cdf[well_cols]

One way to approach the problem is to use pd.MultiIndex and Groupby .

You can add the construct a MultiIndex composed of well identifier and variable name. If you have df :

   N1_a  N1_b  N2_a  N2_b
1     2     2     3     4
2     7     8     9    10

You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (ie a or b ).

df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

Which returns:

  N1    N2    
   a  b  a   b
0  2  2  3   4
1  7  8  9  10

Then you can transpose the data frame and groupby the MultiIndex level 0.

grouped = df.T.groupby(level=0)

To return a list of untransposed sub-data frames you can use:

wells = [group.T for _, group in grouped]

where wells[0] is:

  N1   
   a  b
0  2  2
1  7  8

and wells[1] is:

  N2    
   a   b
0  3   4
1  9  10

The last step is rather unnecessary because the data can be accessed from the grouped object grouped .

All together:

import pandas as pd
from io import StringIO

data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""

df = pd.read_csv(StringIO(data)) 

# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

# Group by well name
grouped = df.T.groupby(level=0)

#bulist list of sub dataframes
wells = [group.T for _, group in grouped]

使用contains

df[df.columns.str.contains('|'.join(wells))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM