如何检查 pandas dataframe 和 output 中的字符串值序列

Question

I'm trying to check for the sequence of BBB in the dataframe.我正在尝试检查 dataframe 中 BBB 的序列。

d = {'A': ['A','B','C','D','B','B','B','A','A','E','F','B','B','B','F','A','A']}
testdf = pd.DataFrame(data=d)

array = []
seq = pd.Series(['B', 'B', 'B'])

for i in testdf.index:
    
    if testdf.A[i:len(seq)] == seq:
        
        array.append(testdf.A[i:len(seq)+1])

I get an error:我得到一个错误：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I get it working?我怎样才能让它工作？ I don't understand what's "ambiguous" about this code我不明白这段代码有什么“歧义”

My desired output here is:我想要的 output 是：

A, F

Answer 1

The ambiguous comparison comes from the fact that when you test 2 Series for equalty (they should be same size), a pair comparison is done and you obtain a Series with only True/False value, you should then decide if you want all true, all false, at least one true... using .any(), .all(), ... ambiguous的比较来自这样一个事实，即当您测试 2 个Series是否相等（它们应该大小相同）时，进行了一对比较并且您获得了一个只有True/False值的Series ，然后您应该决定是否要全部为真，全部为假，至少有一个为真……使用.any(), .all(), ...
```
 s1 = pd.Series(['B', 'B', 'B']) s2 = pd.Series(['A', 'B', 'B']) print(s1 == s2) 0 False 1 True 2 True dtype: bool print((s1 == s2).all()) False
```
To access a subsequence, prefer the use of .iloc要访问子序列，更喜欢使用.iloc
You need to use [i:i + len(seq)] and not [i:len(seq)] because this is a [from:to] notation您需要使用[i:i + len(seq)]而不是[i:len(seq)]因为这是一个[from:to]符号
You need to use Series.reset_index(drop=True) because to compare series they must have the same index, so as seq if always indexed 0,1,2 you need same for sht subsequence you compute (because testdf.A.iloc[1:3] is indexed 1,2,3 ]您需要使用Series.reset_index(drop=True)因为要比较系列它们必须具有相同的索引，因此如果seq始终索引为0,1,2 ，您需要为计算的 sht 子序列使用相同的索引（因为testdf.A.iloc[1:3]索引为1,2,3 ]
Verify the length before checking the Series to avoid an Exception at the end when the subsequence will be smaller在检查系列之前验证长度以避免在子序列变小时最后出现异常

You end with:你结束于：

values = {'A': ['A', 'B', 'C', 'D', 'B', 'B', 'B', 'A', 'A', 'E', 'F', 'B', 'B', 'B', 'F', 'A', 'A']}
testdf = pd.DataFrame(values)
array = []
seq = pd.Series(['B', 'B', 'B'])
for i in testdf.index:
    test_seq = testdf.A.iloc[i:i + len(seq)].reset_index(drop=True)
    if len(test_seq) == len(seq) and (test_seq == seq).all():
        array.append(testdf['A'].iloc[i + len(seq)])
print(array)  # ['A', 'F']

Answer 2

Instead of iterating over every row in the DataFrame, we can iterate over the much smaller sequence (Much beter when len(seq) << len(df) ).我们可以迭代更小的序列（当len(seq) << len(df)时更好），而不是遍历 DataFrame 中的每一行。 Use shift + np.logical_and.reduce to locate the sequence in the DataFrame and where it ends.使用shift + np.logical_and.reduce定位 DataFrame 中的序列及其结束位置。 Then we'll roll to get the next row after, which are the values you want.然后我们将roll以获取下一行，这是您想要的值。 (Modified slightly from my related answer here ) （从我这里的相关回答稍作修改）

import numpy as np

def find_next_row(seq, df, col):
    seq = seq[::-1]  # to get last index
    m = np.logical_and.reduce([df[col].shift(i).eq(seq[i]) for i in range(len(seq))])

    m = np.roll(m, 1)
    m[0] = False  # Don't wrap around
    
    return df.loc[m]
    # return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#    A
#7   A
#14  F

If you just want the list and don't care for the DataFrame, change the return to what's currently commented out: return df.loc[m, col].tolist()如果您只想要list而不关心 DataFrame，请将返回更改为当前注释掉的内容： return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#['A', 'F']

如何检查 pandas dataframe 和 output 中的字符串值序列

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-07-17 17:59:40

解决方案2
1 2020-07-17 18:20:02

如何检查 pandas dataframe 和 output 中的字符串值序列

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-07-17 17:59:40

解决方案2 1 2020-07-17 18:20:02

解决方案1
4 已采纳 2020-07-17 17:59:40

解决方案2
1 2020-07-17 18:20:02