简体   繁体   English

如何在循环中填充pandas数据帧?

[英]How to fill pandas dataframes in a loop?

I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings. 我试图通过在列标题中搜索字符串来从更大的数据帧构建数据帧的子集。

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
    wellname = well
    well = pd.DataFrame()
    well_cols = [col for col in cdf.columns if wellname in col]
    well = cdf[well_cols]

I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname. 我正在尝试在cdf dataframe列中搜索wellname,并将包含该wellname的列放入名为wellname的新数据框中。

I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91). 我能够构建我的新子数据帧但是数据帧没有大小(0,0),而cdf是(21973,91)。

well_cols also populates correctly as a list. well_cols也可以正确填充列表。

These are some of cdf column headings. 这些是一些cdf列标题。 Each column has 20k rows of data. 每列有20k行数据。

Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
       'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
       'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
       'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
       'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
       'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
       'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
       'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
       'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
       'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',

I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc. 我想创建一个新的数据框,每个标题包含“井”IE,所有列和数据的新数据帧,列名包含N1,另一个用于N2等。

The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well) : 新数据帧在循环内部时正确填充,但在循环中断时消失... print(well)的代码输出print(well)

[27884 rows x 10 columns]
       N9_Inj_Casing_Gas_Valve  ...  N9_Inj_Casing_Gas_Pressure
0                    74.375000  ...                 2485.602364
1                    74.520833  ...                 2485.346000
2                    74.437500  ...                 2485.341091

IIUC this should be enough: IIUC这应该足够了:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:

    well_cols = [col for col in cdf.columns if well in col]
    well_dict[well] = cdf[well_cols]

Dictionaries are usually the way to go if you want to populate something. 如果你想填充某些东西,通常可以使用字典。 In this case, then, if you input well_dict['N1'] , you'll get your first dataframe, and so on. 在这种情况下,如果您输入well_dict['N1'] ,您将获得第一个数据帧,依此类推。

The elements of an array are not mutable when iterating over it. 迭代时,数组的元素是不可变的。 That is, here's what it's doing based on your example: 也就是说,这是基于你的例子它正在做的事情:

# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...

But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration). 但是在任何时候你都没有更改数组,或存储新的数据帧(尽管在迭代结束时你仍然会将最后一个数据帧存储在well )。

IMO, it seems like storing the dataframes in a dict would be easier to use: IMO,似乎将数据帧存储在dict中会更容易使用:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
    well_cols = [col for col in cdf.columns if well in col]
    well_dfs[well] = cdf[well_cols]

However, if you really want it in a list, you could do something like: 但是,如果您真的希望它在列表中,您可以执行以下操作:

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
    well_cols = [col for col in cdf.columns if well in col]
    wells[ix] = cdf[well_cols]

One way to approach the problem is to use pd.MultiIndex and Groupby . 解决该问题的一种方法是使用pd.MultiIndexGroupby

You can add the construct a MultiIndex composed of well identifier and variable name. 您可以添加构造一个由井标识符和变量名组成的MultiIndex。 If you have df : 如果你有df

   N1_a  N1_b  N2_a  N2_b
1     2     2     3     4
2     7     8     9    10

You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (ie a or b ). 您可以使用df.columns.str.split('_', expand=True)来解析井标识符对应的变量名称(即ab )。

df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

Which returns: 哪个回报:

  N1    N2    
   a  b  a   b
0  2  2  3   4
1  7  8  9  10

Then you can transpose the data frame and groupby the MultiIndex level 0. 然后,您可以移调数据帧和groupby的多指标0级。

grouped = df.T.groupby(level=0)

To return a list of untransposed sub-data frames you can use: 要返回未转置的子数据帧列表,您可以使用:

wells = [group.T for _, group in grouped]

where wells[0] is: wells[0]是:

  N1   
   a  b
0  2  2
1  7  8

and wells[1] is: wells[1]是:

  N2    
   a   b
0  3   4
1  9  10

The last step is rather unnecessary because the data can be accessed from the grouped object grouped . 最后一步是相当不必要的,因为数据可从分组的对象进行访问grouped

All together: 全部一起:

import pandas as pd
from io import StringIO

data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""

df = pd.read_csv(StringIO(data)) 

# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

# Group by well name
grouped = df.T.groupby(level=0)

#bulist list of sub dataframes
wells = [group.T for _, group in grouped]

使用contains

df[df.columns.str.contains('|'.join(wells))]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM