[英]Operations in multi index dataframe pandas
I need to process geographic and statistical data from a big data csv.我需要处理来自大数据 csv 的地理和统计数据。 It contains data from geographical administrative and geostatistical.它包含来自地理行政和地理统计的数据。 Municipality, Location, geostatistical basic division and block constitute the hierarchical indexes.市级、区位、地统计基本区划和地块构成层次指标。
I have to create a new column ['data2'] for every element the max value of the data in the geo index, and divide each block value by that value.我必须为每个元素创建一个新列 ['data2'] 地理索引中数据的最大值,并将每个块值除以该值。 For each index level, and the index level value must be different from 0, because the 0 index level value accounts for other types of info not used in the calculation.对于每个索引级别,索引级别值必须不为 0,因为 0 索引级别值考虑了计算中未使用的其他类型的信息。
data1 data2
mun loc geo block
1 0 0 0 20 20
1 1 0 0 10 10
1 1 1 0 10 10
1 1 1 1 3 3/4
1 1 1 2 4 4/4
1 1 2 0 30 30
1 1 2 1 1 1/3
1 1 2 2 3 3/3
1 2 1 1 10 10/12
1 2 1 2 12 12/12
2 1 1 1 123 123/123
2 1 1 2 7 7/123
2 1 2 1 6 6/6
2 1 2 2 1 1/6
Any ideas?有任何想法吗? I have tried with for loops, converting the indexes in columns with reset_index() and iterating by column and row values but the computation is taking forever and I think that is not the correct way to do this kind of operations.我尝试过 for 循环,使用 reset_index() 转换列中的索引并按列和行值迭代,但计算需要永远,我认为这不是执行此类操作的正确方法。
Also, what if I want to get my masks like this, so I can run my calculations to every level.另外,如果我想得到这样的面具,我可以将我的计算运行到每个级别。
mun loc geo block
1 0 0 0 False
1 1 0 0 False
1 1 1 0 True
1 1 1 1 False
1 1 1 2 False
1 1 2 0 True
1 1 2 1 False
1 1 2 2 False
mun loc geo block
1 0 0 0 False
1 1 0 0 True
1 1 1 0 False
1 1 1 1 False
1 1 1 2 False
1 2 0 0 True
1 2 2 0 False
1 2 2 1 False
mun loc geo block
1 0 0 0 True
1 1 0 0 False
1 1 1 0 False
1 1 1 1 False
1 1 1 2 False
2 0 0 0 True
2 1 1 0 False
2 1 2 1 False
You can first create mask
from MultiIndex
, compare with 0
and check at least one True
(at least one 0
) by any
:您可以首先从MultiIndex
创建mask
,与0
进行比较并通过any
检查至少一个True
(至少一个0
):
mask = (pd.DataFrame(df.index.values.tolist(), index=df.index) == 0).any(axis=1)
print (mask)
mun loc geo block
1 0 0 0 True
1 0 0 True
1 0 True
1 False
2 False
2 0 True
1 False
2 False
2 1 1 False
2 False
2 1 1 1 False
2 False
2 1 False
2 False
dtype: bool
Then get max
values by groupby
per first, second and third index, but before filter by boolean indexing
only values where are not True
in mask
:然后按第一个,第二个和第三个索引通过groupby
获取max
,但在通过boolean indexing
过滤之前,仅在mask
中不为True
值:
df1 = df.ix[~mask, 'data1'].groupby(level=['mun','loc','geo']).max()
print (df1)
mun loc geo
1 1 1 4
2 3
2 1 12
2 1 1 123
2 6
Then reindex
df1
by df.index
, remove last level of Multiindex
by reset_index
, mask
values where no change by mask
(also is necessary remove last level) and fillna
by 1
, because dividing return same value.然后reindex
df1
由df.index
,除去的末级Multiindex
由reset_index
, mask
值,其中通过不改变mask
(也有必要删除最后级)和fillna
由1
,因为分割返回相同的值。
df1 = df1.reindex(df.reset_index(level=3, drop=True).index)
.mask(mask.reset_index(level=3, drop=True)).fillna(1)
print (df1)
Name: data1, dtype: int64
mun loc geo
1 0 0 1.0
1 0 1.0
1 1.0
1 4.0
1 4.0
2 1.0
2 3.0
2 3.0
2 1 12.0
1 12.0
2 1 1 123.0
1 123.0
2 6.0
2 6.0
Name: data1, dtype: float64
print (df['data1'].div(df1.values,axis=0))
mun loc geo block
1 0 0 0 20.000000
1 0 0 10.000000
1 0 10.000000
1 0.750000
2 1.000000
2 0 30.000000
1 0.333333
2 1.000000
2 1 1 0.833333
2 1.000000
2 1 1 1 1.000000
2 0.056911
2 1 1.000000
2 0.166667
dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.