简体   繁体   English

pandas - 找到第一次出现

[英]pandas - find first occurrence

Suppose I have a structured dataframe as follows:假设我有一个结构化的 dataframe 如下:

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

The A column has previously been sorted. A列之前已排序。 I wish to find the first row index of where df[df.A!='a'] .我希望找到df[df.A!='a']的第一行索引。 The end goal is to use this index to break the data frame into groups based on A .最终目标是使用此索引根据A将数据框分成几组。

Now I realise that there is a groupby functionality.现在我意识到有一个 groupby 功能。 However, the dataframe is quite large and this is a simplified toy example.然而,dataframe 相当大,这是一个简化的玩具示例。 Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a' .由于A已经排序,如果我能找到where df.A!='a'的第一个索引会更快。 Therefore it is important that whatever method that you use the scanning stops once the first element is found .因此,一旦找到第一个元素,您使用的任何扫描方法都必须停止,这一点很重要。

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.如果最大值出现多次, idxmaxargmax将返回最大值的位置或第一个位置。

use idxmax on df.A.ne('a')df.A.ne('a')上使用idxmax

df.A.ne('a').idxmax()

3

or the numpy equivalentnumpy等价物

(df.A.values != 'a').argmax()

3

However, if A has already been sorted, then we can use searchsorted但是,如果A已经被排序,那么我们可以使用searchsorted

df.A.searchsorted('a', side='right')

array([3])

Or the numpy equivalentnumpy等价物

df.A.values.searchsorted('a', side='right')

3

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:我发现 Pandas DataFrames 有 first_valid_index 函数可以完成这项工作,可以按如下方式使用它:

df[df.A!='a'].first_valid_index()

3

However, this function seems to be very slow.但是,这个功能似乎很慢。 Even taking the first index of the filtered dataframe is faster:即使采用过滤数据帧的第一个索引也更快:

df.loc[df.A!='a','A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:下面我比较了这两个选项和上面所有代码重复计算 100 次的总时间(秒):

                      total_time_sec    ratio wrt fastest algo
searchsorted numpy:        0.0007        1.00
argmax numpy:              0.0009        1.29
for loop:                  0.0045        6.43
searchsorted pandas:       0.0075       10.71
idxmax pandas:             0.0267       38.14
index[0]:                  0.0295       42.14
first_valid_index pandas:  0.1181      168.71

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance.请注意 numpy 的 searchsorted 是赢家,而 first_valid_index 表现出最差的性能。 Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.一般来说,numpy 算法更快,for 循环也没有那么糟糕,但这只是因为数据帧的条目很少。

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:对于具有 10,000 个条目的数据帧,其中所需条目更接近末尾,结果不同,searchsorted 提供最佳性能:

                     total_time_sec ratio wrt fastest algo
searchsorted numpy:        0.0007       1.00
searchsorted pandas:       0.0076      10.86
argmax numpy:              0.0117      16.71
index[0]:                  0.0815     116.43
idxmax pandas:             0.0904     129.14
first_valid_index pandas:  0.1691     241.57
for loop:                  9.6504   13786.29

The code to produce these results is below:产生这些结果的代码如下:

import timeit

# code snippet to be executed only once 
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''

# code snippets whose execution time is to be measured   
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]

mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")

mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")

mycode_set.append(  '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")

mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")

mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")

mycode_set.append( '''for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        ans = index
        break
        ''')
message.append("for loop: ")

total_time_in_sec = []
for i in range(len(mycode_set)):
    mycode = mycode_set[i]
    total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
         stmt = mycode, number = 100),4))

output = pd.DataFrame(total_time_in_sec, index = message, \
                      columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)

output = output.sort_values(by = "total_time_sec")
display(output)

For the larger dataframe:对于较大的数据框:

mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''

Using pandas groupby() to group by column or list of columns.使用 pandas groupby()按列或列列表进行分组。 Then first() to get the first value in each group.然后first()获取每个组中的第一个值。

import pandas as pd

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

#Group df by column and get the first value in each group                   
grouped_df = df.groupby("A").first()

#Reset indices to match format
first_values = grouped_df.reset_index()

print(first_values)
>>>    A  B
   0   a  1
   1   b  1

If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.如果您只想找到第一个实例而不遍历整个数据帧,则可以使用 for 循环方式。

df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        print(index)
        break

The index is the row number of the 1st index of where df.A!='a'索引是第一个索引的行号 where df.A!='a'

For multiple conditions:对于多个条件:

Let's say we have:假设我们有:

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

And we want to find the first item different than a and c , we do:我们想要找到不同于ac的第一项,我们这样做:

n = np.logical_and(s.values != 'a', s.values != 'c').argmax()

Times:次数:

import numpy as np
import pandas as pd
from datetime import datetime

ITERS = 1000

def pandas_multi_condition(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = s[(s != 'a') & (s != 'c')].index[0]
    print(n)
    print(datetime.now() - ts)

def numpy_bitwise_and(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
    print(n)
    print(datetime.now() - ts)

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)

Output:输出:

pandas_multi_condition():
4
0:00:01.144767

numpy_bitwise_and():
4
0:00:00.019013

You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:您可以通过数据帧行进行迭代(它很慢)并创建自己的逻辑来获取您想要的值:

def getMaxIndex(df, col)
    max = -999999
    rtn_index = 0
    for index, row in df.iterrows():
            if row[col] > max:
                max = row[col]
                rtn_index = index
    return rtn_index 

Generalized Form:广义形式:

index = df.loc[df.column_name == 'value_you_looking_for'].index[0]

Example:例子:

index_of_interest = df.loc[df.A == 'a'].index[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pandas - 查找每个用户的首次出现计数 - pandas - find first occurrence count per user 熊猫:在时间序列中每天查找首次出现的事件 - Pandas: Find first occurrence - on daily basis in a timeseries 找出pandas数据帧中事件的中间出现“0”和第一次出现的“1” - Find out middle occurrence of “0” and first occurrence ''1" of an event in pandas dataframe 如何使用 pandas 查找给定日期的 boolean 值的第一次出现? - How to find the first occurrence of a boolean value for a given day using pandas? 如何找到 pandas dataframe 值的第一次显着差异? - How to find first occurrence of a significant difference in values of a pandas dataframe? Python pandas dataframe - 找到大于特定值的第一个匹配项 - Python pandas dataframe - find the first occurrence that is greater than a specific value 查找熊猫数据框中首次出现的特定部分字符串的索引位置 - Find index location of first occurrence of a specific partial string in pandas dataframe 如何使用 pandas 根据日期时间列查找每个 id 的第一次出现? - How to find first occurrence for each id based on datetime column with pandas? 分组并减去熊猫中的第一次出现和最后一次出现 - group by and subtract first occurrence and last occurrence in pandas Pandas - 在子集中查找出现 - Pandas - find occurrence within a subset
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM