[英]pandas dataframe - find longest consecutive rows with a certain condition
使用名為“df”的熊貓數據框如下
A
2015-05-01 True
2015-05-02 True
2015-05-03 False
2015-05-04 False
2015-05-05 False
2015-05-06 False
2015-05-07 True
2015-05-08 False
2015-05-09 False
我想返回一個切片,它是最長連續行數,其中列 'A' 讀取為 'False'。 這能做到嗎?
您可以使用cumsum
來檢測A
列中的更改,因為可以對 Python 中的boolean
求和。
# Test data
df= DataFrame([True, True, False, False, False, False, True, False, False],
index=pd.to_datetime(['2015-05-01', '2015-05-02', '2015-05-03',
'2015-05-04', '2015-05-05', '2015-05-06',
'2015-05-07', '2015-05-08', '2015-05-09']),
columns=['A'])
# We have to ensure that the index is sorted
df.sort_index(inplace=True)
# Resetting the index to create a column
df.reset_index(inplace=True)
# Grouping by the cumsum and counting the number of dates and getting their min and max
df = df.groupby(df['A'].cumsum()).agg(
{'index': ['count', 'min', 'max']})
# Removing useless column level
df.columns = df.columns.droplevel()
print(df)
# count min max
# A
# 1 1 2015-05-01 2015-05-01
# 2 5 2015-05-02 2015-05-06
# 3 3 2015-05-07 2015-05-09
# Getting the max
df[df['count']==df['count'].max()]
# count min max
# A
# 2 5 2015-05-02 2015-05-06
很抱歉帶回舊帖子,但我注意到 Romain 的回答結果略有偏差 - 計數不正確,導致結果不准確。 計數列中應該有 4 個項目:[2, 4, 1, 2],最大值為 4。
為了演示這個問題 - 我已經把它分解了一點(df 與上面接受的答案相同)。 您可以看到結果組不正確:
# sort
dfS = df.sort_index(inplace=True)
# reset
dfSR = dfS.reset_index(inplace=True)
# group
dfG = dfSR.groupby(df['A'].cumsum())
# show resulting groups
for group in dfG: print(group)
# (1, index A
# 0 2015-05-01 True)
# (2, index A
# 1 2015-05-02 True
# 2 2015-05-03 False
# 3 2015-05-04 False
# 4 2015-05-05 False
# 5 2015-05-06 False)
# (3, index A
# 6 2015-05-07 True
# 7 2015-05-08 False
# 8 2015-05-09 False)
由於帝斯曼的答案在這里,當然羅曼的回答,結合兩個職位的技術得到了答案。 它們已經在它們來自的帖子中進行了解釋,因此我將其保留在下面的代碼中。
import pandas as pd
df = pd.DataFrame([True, True, False, False, False, False, True, False, False],
index=pd.to_datetime(['2015-05-01', '2015-05-02', '2015-05-03',
'2015-05-04', '2015-05-05', '2015-05-06',
'2015-05-07', '2015-05-08', '2015-05-09']),
columns=['A'])
df.sort_index(inplace=True)
df.reset_index(inplace=True)
dfBool = df['A'] != df['A'].shift()
dfCumsum = dfBool.cumsum()
groups = df.groupby(dfCumsum)
for g in groups: print(g)
groupCounts = groups.agg({'index':['count', 'min', 'max']})
groupCounts.columns = groupCounts.columns.droplevel()
print('\n', groupCounts, '\n')
maxCount = groupCounts[groupCounts['count'] == groupCounts['count'].max()]
print(maxCount, '\n')
輸出:
(1, index A
0 2015-05-01 True
1 2015-05-02 True)
(2, index A
2 2015-05-03 False
3 2015-05-04 False
4 2015-05-05 False
5 2015-05-06 False)
(3, index A
6 2015-05-07 True)
(4, index A
7 2015-05-08 False
8 2015-05-09 False)
count min max
A
1 2 2015-05-01 2015-05-02
2 4 2015-05-03 2015-05-06
3 1 2015-05-07 2015-05-07
4 2 2015-05-08 2015-05-09
count min max
A
2 4 2015-05-03 2015-05-06
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.