pandas - 找到第一次出现

Question

Suppose I have a structured dataframe as follows:假设我有一个结构化的 dataframe 如下：

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

The A column has previously been sorted. A列之前已排序。 I wish to find the first row index of where df[df.A!='a'] .我希望找到df[df.A!='a']的第一行索引。 The end goal is to use this index to break the data frame into groups based on A .最终目标是使用此索引根据A将数据框分成几组。

Now I realise that there is a groupby functionality.现在我意识到有一个 groupby 功能。 However, the dataframe is quite large and this is a simplified toy example.然而，dataframe 相当大，这是一个简化的玩具示例。 Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a' .由于A已经排序，如果我能找到where df.A!='a'的第一个索引会更快。 Therefore it is important that whatever method that you use the scanning stops once the first element is found .因此，一旦找到第一个元素，您使用的任何扫描方法都必须停止，这一点很重要。

Answer 1

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.如果最大值出现多次， idxmax和argmax将返回最大值的位置或第一个位置。

use idxmax on df.A.ne('a')在df.A.ne('a')上使用idxmax

df.A.ne('a').idxmax()

3

or the numpy equivalent或numpy等价物

(df.A.values != 'a').argmax()

3

However, if A has already been sorted, then we can use searchsorted但是，如果A已经被排序，那么我们可以使用searchsorted

df.A.searchsorted('a', side='right')

array([3])

Or the numpy equivalent或numpy等价物

df.A.values.searchsorted('a', side='right')

3

Answer 2

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:我发现 Pandas DataFrames 有 first_valid_index 函数可以完成这项工作，可以按如下方式使用它：

df[df.A!='a'].first_valid_index()

3

However, this function seems to be very slow.但是，这个功能似乎很慢。 Even taking the first index of the filtered dataframe is faster:即使采用过滤数据帧的第一个索引也更快：

df.loc[df.A!='a','A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:下面我比较了这两个选项和上面所有代码重复计算 100 次的总时间（秒）：

                      total_time_sec    ratio wrt fastest algo
searchsorted numpy:        0.0007        1.00
argmax numpy:              0.0009        1.29
for loop:                  0.0045        6.43
searchsorted pandas:       0.0075       10.71
idxmax pandas:             0.0267       38.14
index[0]:                  0.0295       42.14
first_valid_index pandas:  0.1181      168.71

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance.请注意 numpy 的 searchsorted 是赢家，而 first_valid_index 表现出最差的性能。 Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.一般来说，numpy 算法更快，for 循环也没有那么糟糕，但这只是因为数据帧的条目很少。

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:对于具有 10,000 个条目的数据帧，其中所需条目更接近末尾，结果不同，searchsorted 提供最佳性能：

                     total_time_sec ratio wrt fastest algo
searchsorted numpy:        0.0007       1.00
searchsorted pandas:       0.0076      10.86
argmax numpy:              0.0117      16.71
index[0]:                  0.0815     116.43
idxmax pandas:             0.0904     129.14
first_valid_index pandas:  0.1691     241.57
for loop:                  9.6504   13786.29

The code to produce these results is below:产生这些结果的代码如下：

import timeit

# code snippet to be executed only once 
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''

# code snippets whose execution time is to be measured   
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]

mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")

mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")

mycode_set.append(  '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")

mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")

mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")

mycode_set.append( '''for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        ans = index
        break
        ''')
message.append("for loop: ")

total_time_in_sec = []
for i in range(len(mycode_set)):
    mycode = mycode_set[i]
    total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
         stmt = mycode, number = 100),4))

output = pd.DataFrame(total_time_in_sec, index = message, \
                      columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)

output = output.sort_values(by = "total_time_sec")
display(output)

For the larger dataframe:对于较大的数据框：

mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''

Answer 3

Using pandas groupby() to group by column or list of columns.使用 pandas groupby()按列或列列表进行分组。 Then first() to get the first value in each group.然后first()获取每个组中的第一个值。

import pandas as pd

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

#Group df by column and get the first value in each group                   
grouped_df = df.groupby("A").first()

#Reset indices to match format
first_values = grouped_df.reset_index()

print(first_values)
>>>    A  B
   0   a  1
   1   b  1

Answer 4

If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.如果您只想找到第一个实例而不遍历整个数据帧，则可以使用 for 循环方式。

df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        print(index)
        break

The index is the row number of the 1st index of where df.A!='a'索引是第一个索引的行号 where df.A!='a'

Answer 5

For multiple conditions:对于多个条件：

Let's say we have:假设我们有：

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

And we want to find the first item different than a and c , we do:我们想要找到不同于a和c的第一项，我们这样做：

n = np.logical_and(s.values != 'a', s.values != 'c').argmax()

Times:次数：

import numpy as np
import pandas as pd
from datetime import datetime

ITERS = 1000

def pandas_multi_condition(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = s[(s != 'a') & (s != 'c')].index[0]
    print(n)
    print(datetime.now() - ts)

def numpy_bitwise_and(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
    print(n)
    print(datetime.now() - ts)

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)

Output:输出：

pandas_multi_condition():
4
0:00:01.144767

numpy_bitwise_and():
4
0:00:00.019013

Answer 6

You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:您可以通过数据帧行进行迭代（它很慢）并创建自己的逻辑来获取您想要的值：

def getMaxIndex(df, col)
    max = -999999
    rtn_index = 0
    for index, row in df.iterrows():
            if row[col] > max:
                max = row[col]
                rtn_index = index
    return rtn_index

Answer 7

Generalized Form:广义形式：

index = df.loc[df.column_name == 'value_you_looking_for'].index[0]

Example:例子：

index_of_interest = df.loc[df.A == 'a'].index[0]

pandas - 找到第一次出现

问题描述

7 个解决方案

解决方案1
44 已采纳 2016-12-21 04:41:43

解决方案2
27 2019-01-16 19:47:22

解决方案3
3 2021-05-01 06:26:05

解决方案4
1 2016-12-21 05:20:15

解决方案5
0 2019-05-22 19:21:47

解决方案6
0 2019-10-08 16:35:06

解决方案7
0 2022-08-27 01:13:16

pandas - 找到第一次出现

问题描述

7 个解决方案

解决方案1 44 已采纳 2016-12-21 04:41:43

解决方案2 27 2019-01-16 19:47:22

解决方案3 3 2021-05-01 06:26:05

解决方案4 1 2016-12-21 05:20:15

解决方案5 0 2019-05-22 19:21:47

解决方案6 0 2019-10-08 16:35:06

解决方案7 0 2022-08-27 01:13:16

解决方案1
44 已采纳 2016-12-21 04:41:43

解决方案2
27 2019-01-16 19:47:22

解决方案3
3 2021-05-01 06:26:05

解决方案4
1 2016-12-21 05:20:15

解决方案5
0 2019-05-22 19:21:47

解决方案6
0 2019-10-08 16:35:06

解决方案7
0 2022-08-27 01:13:16