[英]python pandas filter function regex behavior on MultiIndex dataframe
I have a dataframe df
that looks like (see Appendix for code to generate the dataframe):我有一个 dataframe
df
,看起来像(有关生成数据帧的代码,请参见附录):
fy 2018 2019 tag uom Assets USD 3.753190e+11 3.385160e+11 AssetsCurrent USD 1.286450e+11 1.628190e+11 AssetsNoncurrent USD 2.466740e+11 1.756970e+11 DeferredTaxAssetsDeferredCostSharing USD 6.670000e+08 NaN DeferredTaxAssetsDeferredIncome USD 1.521000e+09 1.141000e+09 DeferredTaxAssetsGoodwillAndIntangibleAssets USD NaN 1.143300e+10 DeferredTaxAssetsLiabilitiesNet USD 5.834000e+09 5.834000e+09 DeferredTaxAssetsNet USD 8.974000e+09 6.610000e+09 DeferredTaxAssetsOther USD 8.340000e+08 7.970000e+08 DeferredTaxAssetsPropertyPlantAndEquipment USD 1.230000e+09 1.370000e+08 DeferredTaxAssetsTaxDeferredExpenseCompensation... USD 7.030000e+08 5.130000e+08 DeferredTaxAssetsTaxDeferredExpenseReservesAndA... USD 4.019000e+09 3.151000e+09 DeferredTaxAssetsUnrealizedLossesOnAvailablefor... USD 0.000000e+00 8.710000e+08 DerivativeAssetsReductionforMasterNettingArrang... USD 1.400000e+09 2.100000e+09 IncreaseDecreaseInOtherOperatingAssets USD -1.055000e+09 5.318000e+09 NoncurrentAssets USD 3.378300e+10 4.130400e+10 OtherAssetsCurrent USD 1.208700e+10 1.208700e+10 OtherAssetsNoncurrent USD 2.228300e+10 2.228300e+10
Which is a MultiIndex pivot table with indices tag
and uom
.这是一个带有索引
tag
和uom
的 MultiIndex pivot 表。 My goal is to filter rows by the tag
index using a regex and the filter function .我的目标是使用正则表达式和过滤器 function按
tag
索引过滤行。 For example:例如:
df.filter(regex="^Assets$", axis="index")
Which ideally would filter out the row:理想情况下会过滤掉该行:
fy 2018 2019 tag uom Assets USD 3.753190e+11 3.385160e+11
However, when I do so it outputs an empty dataframe:但是,当我这样做时,它会输出一个空的 dataframe:
Empty DataFrame Columns: [2018, 2019] Index: []
I'm able to circumvent this problem by using:我可以通过使用来规避这个问题:
df.index.get_level_values("tag").str.contains("^Assets$")
or as a function或作为 function
search = lambda df, regex, index_name: df.loc[df.index.get_level_values(index_name).str.contains(regex)]
But this is way less satisfying to me.但这对我来说不太令人满意。 Am I missing something about the pandas filter function and how its regex input works?
我是否缺少有关 pandas 过滤器 function 及其正则表达式输入的工作原理的信息? It does not behave as expected, and my guess is it's because I have 2 indices:
tag
and uom
thus the regex is failing in the uom
index when I use "^Assets$"
as my regex.它的行为不像预期的那样,我的猜测是因为我有 2 个索引:
tag
和uom
因此当我使用"^Assets$"
作为我的正则表达式时,正则表达式在uom
索引中失败。 This is supported by using the regex "^Assets$|USD"
which returns the entire dataframe because all rows have uom=USD
, and it shows the filter function takes both indices into account.这通过使用正则表达式
"^Assets$|USD"
得到支持,它返回整个 dataframe 因为所有行都有uom=USD
,并且它显示过滤器 function 考虑了这两个索引。 If this is the case, then how do I selectively choose index= tag
for the filter function on a MultiIndex dataframe?如果是这种情况,那么我如何有选择地为 MultiIndex dataframe 上的过滤器 function 选择 index=
tag
?
Appendix:附录:
import pandas as pd
import numpy as np
levels = ['Assets',
'AssetsCurrent',
'AssetsNoncurrent',
'DeferredTaxAssetsDeferredCostSharing',
'DeferredTaxAssetsDeferredIncome',
'DeferredTaxAssetsGoodwillAndIntangibleAssets',
'DeferredTaxAssetsLiabilitiesNet',
'DeferredTaxAssetsNet',
'DeferredTaxAssetsOther',
'DeferredTaxAssetsPropertyPlantAndEquipment',
'DeferredTaxAssetsTaxDeferredExpenseCompensationAndBenefitsShareBasedCompensationCost',
'DeferredTaxAssetsTaxDeferredExpenseReservesAndAccruals',
'DeferredTaxAssetsUnrealizedLossesOnAvailableforSaleSecuritiesGross',
'DerivativeAssetsReductionforMasterNettingArrangements',
'IncreaseDecreaseInOtherOperatingAssets',
'NoncurrentAssets',
'OtherAssetsCurrent',
'OtherAssetsNoncurrent']
codes = ['USD' for i in range(len(levels))]
index = pd.MultiIndex.from_arrays([levels, codes], names=['tag', 'uom'])
columns = pd.Int64Index([2018, 2019], dtype='int64', name='fy')
values = [[3.75319e+11, 3.38516e+11],
[1.28645e+11, 1.62819e+11],
[2.46674e+11, 1.75697e+11],
[6.67000e+08, np.NaN],
[1.52100e+09, 1.14100e+09],
[np.NaN, 1.14330e+10],
[5.83400e+09, 5.83400e+09],
[8.97400e+09, 6.61000e+09],
[8.34000e+08, 7.97000e+08],
[1.23000e+09, 1.37000e+08],
[7.03000e+08, 5.13000e+08],
[4.01900e+09, 3.15100e+09],
[0.00000e+00, 8.71000e+08],
[1.40000e+09, 2.10000e+09],
[-1.05500e+09, 5.31800e+09],
[3.37830e+10, 4.13040e+10],
[1.20870e+10, 1.20870e+10],
[2.22830e+10, 2.22830e+10]]
df = pd.DataFrame(values, columns=columns, index=index)
The implementation of the regex part of the filter function is short and easy to adapt for a multi-index scenario where you still want to only regex 1 part of the multi-index.过滤器 function 的正则表达式部分的实现很短,很容易适应多索引场景,在这种场景中,您仍然希望只对多索引的 1 部分进行正则表达式。 I know this is not a direct answer to what you asked because you're right, as implemented the filter function does not handle multi-index.
我知道这不是您所问问题的直接答案,因为您是对的,因为过滤器 function 不处理多索引。
I ended up here with the same problem and thought it might be a useful answer to others to post the code I have used, adapted from the pandas original:我在这里遇到了同样的问题,并认为发布我使用的代码可能对其他人有用,该代码改编自 pandas 原版:
import regex as re
def filter_multi(df, index_level_name, regex, axis=0):
def f(x):
return matcher.search(str(x)) is not None
matcher = re.compile(regex)
values = df.axes[axis].get_level_values(index_level_name).map(f)
return df.loc(axis=axis)[values]
Using the code in your Appendix:使用附录中的代码:
print(df)
print(filter_multi(df, index_level_name='tag', regex='^Assets$', axis=0))
print(filter_multi(df, index_level_name='fy', regex='^2019$', axis=1))
If you want to filter a unique value from the first part of a multi-index, you can just use loc
:如果你想从多索引的第一部分过滤一个唯一值,你可以使用
loc
:
df.loc[['Assets']]
which gives:这使:
fy 2018 2019
tag uom
Assets USD 3.753190e+11 3.385160e+11
If for your real problem, filter must be used, you should reset the unused part of the index and set it back after filtering:如果对于您的实际问题,必须使用过滤器,您应该重置索引中未使用的部分并在过滤后将其重新设置:
df.reset_index(level='uom').filter(regex='^Assets$', axis=0).set_index('uom', append=True)
Another option is to first remove uom
from your index, apply filter
(which then will be applied to the only index tag
) and add uom
back to your index, as in另一种选择是首先从您的索引中删除
uom
,应用filter
(然后将其应用于唯一的索引tag
)并将uom
添加回您的索引,如
df.reset_index('uom').filter(regex="^Assets$", axis=0).set_index('uom', append=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.