如何有條件地在熊貓中切片數據框

Question

考慮一個構造如下的pandas DataFrame：

df = pandas.DataFrame({'a':['one','two','three']})

然后我可以找到包含two類似內容的數據框的特定行：

df[df.a == 'two']

但是到目前為止，我發現將DataFrame到這一行的唯一方法是：

df[:df[df.a == 'two'].index[0]]

但這很丑陋，所以：

有沒有更合適的方法來完成此子集？

具體來說，我對如何在給定列與任意文本字符串（在這種情況下為“ two”）匹配的行索引之間切片DataFrame感興趣。 對於這種特殊情況，它相當於df[:2] 。 但是，總的來說，基於列值為切片的起點和/或終點定位索引的能力似乎是合理的嗎？

最后一個例子，也許會有所幫助； 我希望能夠做這樣的事情：

df[df.a == 'one' : df.a == 'three']

獲取包含DataFrame第1行和第2行的切片，等效於df [0：3]

Answer 1

您要標識特定起始值和終止值的索引，並獲取匹配的行以及之間的所有行。 一種方法是找到索引並建立范圍，但是您已經說過，您不喜歡這種方法。 這是一個使用布爾邏輯的通用解決方案，該邏輯應該對您有用。

首先，讓我們做一個更有趣的例子：

import pandas as pd
df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})

假設start = "two"和stop = "four" 。 也就是說，您想要獲得以下輸出DataFrame：

       a
1    two
2  three
3   four

我們可以通過以下方式找到邊界行的索引：

df["a"].isin({start, stop})
#0    False
#1     True
#2    False
#3     True
#4    False
#Name: a, dtype: bool

如果索引2的值為True ，則將完成此操作，因為我們可以僅將此輸出用作掩碼。 因此，讓我們找到一種創建所需遮罩的方法。

首先，我們可以使用cummax()和布爾XOR運算符（ ^ ）實現：

(df["a"]==start).cummax() ^ (df["a"]==stop).cummax()
#0    False
#1     True
#2     True
#3    False
#4    False
#Name: a, dtype: bool

這幾乎是我們想要的，除了缺少終止值索引。 因此，讓我們按位或（ | ）停止條件：

#0    False
#1     True
#2     True
#3     True
#4    False
#Name: a, dtype: bool

這得到了我們想要的結果。 因此，創建一個掩碼，並為數據框編制索引：

mask = (df["a"]==start).cummax() ^ (df["a"]==stop).cummax() | (df["a"]==stop)
print(df[mask])
#       a
#1    two
#2  three
#3   four

我們可以將這些發現擴展到一個函數中，該函數還支持對一行進行索引或從一行到末尾進行索引：

def get_rows(df, col, start, stop):
    if start is None:
        mask = ~((df[col] == stop).cummax() ^ (df[col] == stop))
    else:
        mask = (df[col]==start).cummax() ^ (df[col]==stop).cummax() | (df[col]==stop)
    return df[mask]

# get rows between "two" and "four" inclusive
print(get_rows(df=df, col="a", start="two", stop="four"))
#       a
#1    two
#2  three
#3   four

# get rows from "two" until the end
print(get_rows(df=df, col="a", start="two", stop=None))
#       a
#1    two
#2  three
#3   four
#4   five

# get rows up to "two"
print(get_rows(df=df, col="a", start=None, stop="two"))
#     a
#0  one
#1  two

更新：

為了完整起見，這是基於索引的解決方案。

def get_rows_indexing(df, col, start, stop):
    min_ind = min(df.index[df[col]==start].tolist() or [0])
    max_ind = max(df.index[df[col]==stop].tolist() or [len(df)])
    return df[min_ind:max_ind+1]

此功能與其他版本基本上具有相同的作用，但可能更易於理解。 而且，由於其他版本依賴於None而不是所需列中的值，因此它更加健壯。

Answer 2

如果臨時使用列“ a”作為索引，那么locate方法（loc）完全可以滿足您的要求。

df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})
start = 'two'
stop = 'four'
df = df.set_index('a').loc[start:stop].reset_index()

如何有條件地在熊貓中切片數據框

問題描述

2 個解決方案

解決方案1
1 已采納 2018-04-26 20:01:03

解決方案2
1 2019-03-02 18:02:04

如何有條件地在熊貓中切片數據框

問題描述

2 個解決方案

解決方案1 1 已采納 2018-04-26 20:01:03

解決方案2 1 2019-03-02 18:02:04

解決方案1
1 已采納 2018-04-26 20:01:03

解決方案2
1 2019-03-02 18:02:04