简体   繁体   English

使用查询字符串过滤 MultiIndex

[英]Filter MultiIndex with Query Strings

I have a fairly large DataFrame, say 600 indexes, and want to use filter criteria to produce a reduced version of the DataFrame where the criteria is true.我有一个相当大的 DataFrame,比如说 600 个索引,并且想要使用过滤条件来生成一个简化版本的 DataFrame,其中条件为真。 From the research I've done, filtering works well when you're applying expressions to the data, and already know the index you're operating on.根据我所做的研究,当您将表达式应用于数据并且已经知道您正在操作的索引时,过滤效果很好。 What I want to do, however, is apply the filtering criteria to the index.但是,我想要做的是将过滤条件应用于索引。 See example below.请参阅下面的示例。

MultiIndex is bold, names of MultiIndex names are italic. MultiIndex 为粗体,MultiIndex 名称为斜体。

在此处输入图片说明

I'd like to apply the criteria like follows (or something) along these lines:我想沿着这些方向应用如下(或其他)标准:

df = df[MultiIndex.query('base == 115 & Al.isin(stn)')]

Then maybe do something like this:然后也许做这样的事情:

df = df.transpose()[MultiIndex.query('Fault.isin(cont)')].transpose

To result in:以导致:

在此处输入图片说明

I think fundamentally I'm trying to produce a boolean list to mask the MultiIndex.我认为从根本上讲,我正在尝试生成一个布尔列表来掩盖 MultiIndex。 If there is a quick way to apply the pandas query to a 2d list?如果有一种快速的方法可以将 Pandas 查询应用于 2d 列表? that would be acceptable.那是可以接受的。 As of now it seems like an option would be to take the MultiIndex, convert it to a DataFrame, then I can apply filtering as I want to get the TF array.到目前为止,似乎一个选项是采用 MultiIndex,将其转换为 DataFrame,然后我可以应用过滤,因为我想获取 TF 数组。 I'm concerned that this will be slow though.我担心这会很慢。

As you noticed, indexes aren't great for querying using filter expressions.正如您所注意到的,索引不适用于使用过滤器表达式进行查询。 There's df.filter() but it doesn't really seem to work well on a MultiIndex.df.filter()但它在 MultiIndex 上似乎不太好用。

You can still filter the MultiIndex values as an iterable of Python tuples, and then use .loc to access the filtered results.您仍然可以将 MultiIndex 值过滤为 Python 元组的可迭代对象,然后使用.loc访问过滤后的结果。

This works:这有效:

rows = [(season, cont)
        for (season, cont) in df.index
        if 'Fault' in cont]
cols = [(stn, base)
        for (stn, base) in df.columns
        if base == 115 and 'Al' in stn]
df.loc[rows, cols]

If what you're after is using the df.query() nifty syntax to slice your data, then you're better off "unpivoting" your DataFrame, turning all indices and column labels into regular fields.如果您所追求的是使用df.query()漂亮的语法来切片数据,那么您最好“取消透视”您的 DataFrame,将所有索引和列标签转换为常规字段。

You can create an "unpivot" DataFrame with:您可以使用以下命令创建“unpivot”DataFrame:

df_unpivot = df.stack(level=[0, 1]).rename('value').reset_index()

Which will produce a DataFrame that looks like this:这将产生一个如下所示的 DataFrame:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
1 Summer Fault Beta   115   0.8
2 Summer Fault Gamma  230   0.7
3 Summer Trip  Alpha  115   1.2
4 Summer Trip  Beta   115   0.9
...

Which you can then query with:然后您可以查询:

df_unpivot.query(
    'cont.str.contains("Fault") and '
    'stn.str.contains("Al") and '
    'base == 115'
)

Which produces:其中产生:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
6 Winter Fault Alpha  115   0.7

Which is the two values you were expecting.这是您期望的两个值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM