[英]How to check for a sequence of string values in pandas dataframe and output the subsequent
I'm trying to check for the sequence of BBB in the dataframe.我正在尝试检查 dataframe 中 BBB 的序列。
d = {'A': ['A','B','C','D','B','B','B','A','A','E','F','B','B','B','F','A','A']}
testdf = pd.DataFrame(data=d)
array = []
seq = pd.Series(['B', 'B', 'B'])
for i in testdf.index:
if testdf.A[i:len(seq)] == seq:
array.append(testdf.A[i:len(seq)+1])
I get an error:我得到一个错误:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I get it working?我怎样才能让它工作? I don't understand what's "ambiguous" about this code我不明白这段代码有什么“歧义”
My desired output here is:我想要的 output 是:
A, F
The ambiguous
comparison comes from the fact that when you test 2 Series
for equalty (they should be same size), a pair comparison is done and you obtain a Series
with only True/False
value, you should then decide if you want all true, all false, at least one true... using .any(), .all(), ...
ambiguous
的比较来自这样一个事实,即当您测试 2 个Series
是否相等(它们应该大小相同)时,进行了一对比较并且您获得了一个只有True/False
值的Series
,然后您应该决定是否要全部为真,全部为假,至少有一个为真……使用.any(), .all(), ...
s1 = pd.Series(['B', 'B', 'B']) s2 = pd.Series(['A', 'B', 'B']) print(s1 == s2) 0 False 1 True 2 True dtype: bool print((s1 == s2).all()) False
To access a subsequence, prefer the use of .iloc
要访问子序列,更喜欢使用.iloc
You need to use [i:i + len(seq)]
and not [i:len(seq)]
because this is a [from:to]
notation您需要使用[i:i + len(seq)]
而不是[i:len(seq)]
因为这是一个[from:to]
符号
You need to use Series.reset_index(drop=True)
because to compare series they must have the same index, so as seq
if always indexed 0,1,2
you need same for sht subsequence you compute (because testdf.A.iloc[1:3]
is indexed 1,2,3
]您需要使用Series.reset_index(drop=True)
因为要比较系列它们必须具有相同的索引,因此如果seq
始终索引为0,1,2
,您需要为计算的 sht 子序列使用相同的索引(因为testdf.A.iloc[1:3]
索引为1,2,3
]
Verify the length before checking the Series to avoid an Exception at the end when the subsequence will be smaller在检查系列之前验证长度以避免在子序列变小时最后出现异常
You end with:你结束于:
values = {'A': ['A', 'B', 'C', 'D', 'B', 'B', 'B', 'A', 'A', 'E', 'F', 'B', 'B', 'B', 'F', 'A', 'A']}
testdf = pd.DataFrame(values)
array = []
seq = pd.Series(['B', 'B', 'B'])
for i in testdf.index:
test_seq = testdf.A.iloc[i:i + len(seq)].reset_index(drop=True)
if len(test_seq) == len(seq) and (test_seq == seq).all():
array.append(testdf['A'].iloc[i + len(seq)])
print(array) # ['A', 'F']
Instead of iterating over every row in the DataFrame, we can iterate over the much smaller sequence (Much beter when len(seq) << len(df)
).我们可以迭代更小的序列(当len(seq) << len(df)
时更好),而不是遍历 DataFrame 中的每一行。 Use shift
+ np.logical_and.reduce
to locate the sequence in the DataFrame and where it ends.使用shift
+ np.logical_and.reduce
定位 DataFrame 中的序列及其结束位置。 Then we'll roll
to get the next row after, which are the values you want.然后我们将roll
以获取下一行,这是您想要的值。 (Modified slightly from my related answer here ) (从我这里的相关回答稍作修改)
import numpy as np
def find_next_row(seq, df, col):
seq = seq[::-1] # to get last index
m = np.logical_and.reduce([df[col].shift(i).eq(seq[i]) for i in range(len(seq))])
m = np.roll(m, 1)
m[0] = False # Don't wrap around
return df.loc[m]
# return df.loc[m, col].tolist()
find_next_row(['B', 'B', 'B'], df, col='A')
# A
#7 A
#14 F
If you just want the list
and don't care for the DataFrame, change the return to what's currently commented out: return df.loc[m, col].tolist()
如果您只想要list
而不关心 DataFrame,请将返回更改为当前注释掉的内容: return df.loc[m, col].tolist()
find_next_row(['B', 'B', 'B'], df, col='A')
#['A', 'F']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.