简体   繁体   English

如何检查 pandas dataframe 和 output 中的字符串值序列

[英]How to check for a sequence of string values in pandas dataframe and output the subsequent

I'm trying to check for the sequence of BBB in the dataframe.我正在尝试检查 dataframe 中 BBB 的序列。

d = {'A': ['A','B','C','D','B','B','B','A','A','E','F','B','B','B','F','A','A']}
testdf = pd.DataFrame(data=d)

array = []
seq = pd.Series(['B', 'B', 'B'])

for i in testdf.index:
    
    if testdf.A[i:len(seq)] == seq:
        
        array.append(testdf.A[i:len(seq)+1])

I get an error:我得到一个错误:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I get it working?我怎样才能让它工作? I don't understand what's "ambiguous" about this code我不明白这段代码有什么“歧义”

My desired output here is:我想要的 output 是:

A, F
  1. The ambiguous comparison comes from the fact that when you test 2 Series for equalty (they should be same size), a pair comparison is done and you obtain a Series with only True/False value, you should then decide if you want all true, all false, at least one true... using .any(), .all(), ... ambiguous的比较来自这样一个事实,即当您测试 2 个Series是否相等(它们应该大小相同)时,进行了一对比较并且您获得了一个只有True/False值的Series ,然后您应该决定是否要全部为真,全部为假,至少有一个为真……使用.any(), .all(), ...

     s1 = pd.Series(['B', 'B', 'B']) s2 = pd.Series(['A', 'B', 'B']) print(s1 == s2) 0 False 1 True 2 True dtype: bool print((s1 == s2).all()) False
  2. To access a subsequence, prefer the use of .iloc要访问子序列,更喜欢使用.iloc

  3. You need to use [i:i + len(seq)] and not [i:len(seq)] because this is a [from:to] notation您需要使用[i:i + len(seq)]而不是[i:len(seq)]因为这是一个[from:to]符号

  4. You need to use Series.reset_index(drop=True) because to compare series they must have the same index, so as seq if always indexed 0,1,2 you need same for sht subsequence you compute (because testdf.A.iloc[1:3] is indexed 1,2,3 ]您需要使用Series.reset_index(drop=True)因为要比较系列它们必须具有相同的索引,因此如果seq始终索引为0,1,2 ,您需要为计算的 sht 子序列使用相同的索引(因为testdf.A.iloc[1:3]索引为1,2,3 ]

  5. Verify the length before checking the Series to avoid an Exception at the end when the subsequence will be smaller在检查系列之前验证长度以避免在子序列变小时最后出现异常

You end with:你结束于:

values = {'A': ['A', 'B', 'C', 'D', 'B', 'B', 'B', 'A', 'A', 'E', 'F', 'B', 'B', 'B', 'F', 'A', 'A']}
testdf = pd.DataFrame(values)
array = []
seq = pd.Series(['B', 'B', 'B'])
for i in testdf.index:
    test_seq = testdf.A.iloc[i:i + len(seq)].reset_index(drop=True)
    if len(test_seq) == len(seq) and (test_seq == seq).all():
        array.append(testdf['A'].iloc[i + len(seq)])
print(array)  # ['A', 'F']

Instead of iterating over every row in the DataFrame, we can iterate over the much smaller sequence (Much beter when len(seq) << len(df) ).我们可以迭代更小的序列(当len(seq) << len(df)时更好),而不是遍历 DataFrame 中的每一行。 Use shift + np.logical_and.reduce to locate the sequence in the DataFrame and where it ends.使用shift + np.logical_and.reduce定位 DataFrame 中的序列及其结束位置。 Then we'll roll to get the next row after, which are the values you want.然后我们将roll以获取下一行,这是您想要的值。 (Modified slightly from my related answer here ) (从我这里的相关回答稍作修改)

import numpy as np

def find_next_row(seq, df, col):
    seq = seq[::-1]  # to get last index
    m = np.logical_and.reduce([df[col].shift(i).eq(seq[i]) for i in range(len(seq))])

    m = np.roll(m, 1)
    m[0] = False  # Don't wrap around
    
    return df.loc[m]
    # return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#    A
#7   A
#14  F

If you just want the list and don't care for the DataFrame, change the return to what's currently commented out: return df.loc[m, col].tolist()如果您只想要list而不关心 DataFrame,请将返回更改为当前注释掉的内容: return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#['A', 'F']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM