简体   繁体   English

多索引数据帧熊猫中的操作

[英]Operations in multi index dataframe pandas

I need to process geographic and statistical data from a big data csv.我需要处理来自大数据 csv 的地理和统计数据。 It contains data from geographical administrative and geostatistical.它包含来自地理行政和地理统计的数据。 Municipality, Location, geostatistical basic division and block constitute the hierarchical indexes.市级、区位、地统计基本区划和地块构成层次指标。

I have to create a new column ['data2'] for every element the max value of the data in the geo index, and divide each block value by that value.我必须为每个元素创建一个新列 ['data2'] 地理索引中数据的最大值,并将每个块值除以该值。 For each index level, and the index level value must be different from 0, because the 0 index level value accounts for other types of info not used in the calculation.对于每个索引级别,索引级别值必须不为 0,因为 0 索引级别值考虑了计算中未使用的其他类型的信息。

                       data1  data2
mun  loc  geo  block
1    0    0    0       20     20
1    1    0    0       10     10
1    1    1    0       10     10   
1    1    1    1       3      3/4
1    1    1    2       4      4/4
1    1    2    0       30     30   
1    1    2    1       1      1/3
1    1    2    2       3      3/3
1    2    1    1       10     10/12
1    2    1    2       12     12/12
2    1    1    1       123    123/123
2    1    1    2       7      7/123
2    1    2    1       6      6/6
2    1    2    2       1      1/6

Any ideas?有任何想法吗? I have tried with for loops, converting the indexes in columns with reset_index() and iterating by column and row values but the computation is taking forever and I think that is not the correct way to do this kind of operations.我尝试过 for 循环,使用 reset_index() 转换列中的索引并按列和行值迭代,但计算需要永远,我认为这不是执行此类操作的正确方法。

Also, what if I want to get my masks like this, so I can run my calculations to every level.另外,如果我想得到这样的面具,我可以将我的计算运行到每个级别。

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     False       
1    1    1    0     True          
1    1    1    1     False        
1    1    1    2     False        
1    1    2    0     True          
1    1    2    1     False        
1    1    2    2     False        

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     True       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
1    2    0    0     True
1    2    2    0     False          
1    2    2    1     False        

mun  loc  geo  block
1    0    0    0     True       
1    1    0    0     False       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
2    0    0    0     True
2    1    1    0     False          
2    1    2    1     False   

You can first create mask from MultiIndex , compare with 0 and check at least one True (at least one 0 ) by any :您可以首先从MultiIndex创建mask ,与0进行比较并通过any检查至少一个True (至少一个0 ):

mask = (pd.DataFrame(df.index.values.tolist(), index=df.index) == 0).any(axis=1)
print (mask)
mun  loc  geo  block
1    0    0    0         True
     1    0    0         True
          1    0         True
               1        False
               2        False
          2    0         True
               1        False
               2        False
     2    1    1        False
               2        False
2    1    1    1        False
               2        False
          2    1        False
               2        False
dtype: bool

Then get max values by groupby per first, second and third index, but before filter by boolean indexing only values where are not True in mask :然后按第一个,第二个和第三个索引通过groupby获取max ,但在通过boolean indexing过滤之前,仅在mask中不为True值:

df1 = df.ix[~mask, 'data1'].groupby(level=['mun','loc','geo']).max()
print (df1)
mun  loc  geo
1    1    1        4
          2        3
     2    1       12
2    1    1      123
          2        6

Then reindex df1 by df.index , remove last level of Multiindex by reset_index , mask values where no change by mask (also is necessary remove last level) and fillna by 1 , because dividing return same value.然后reindex df1df.index ,除去的末级Multiindexreset_indexmask值,其中通过不改变mask (也有必要删除最后级)和fillna1 ,因为分割返回相同的值。

df1 = df1.reindex(df.reset_index(level=3, drop=True).index)
         .mask(mask.reset_index(level=3, drop=True)).fillna(1)
print (df1)
Name: data1, dtype: int64
mun  loc  geo
1    0    0        1.0
     1    0        1.0
          1        1.0
          1        4.0
          1        4.0
          2        1.0
          2        3.0
          2        3.0
     2    1       12.0
          1       12.0
2    1    1      123.0
          1      123.0
          2        6.0
          2        6.0
Name: data1, dtype: float64

Last divide by div :最后除以div

print (df['data1'].div(df1.values,axis=0))
mun  loc  geo  block
1    0    0    0        20.000000
     1    0    0        10.000000
          1    0        10.000000
               1         0.750000
               2         1.000000
          2    0        30.000000
               1         0.333333
               2         1.000000
     2    1    1         0.833333
               2         1.000000
2    1    1    1         1.000000
               2         0.056911
          2    1         1.000000
               2         0.166667
dtype: float64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM