[英]Validation for repeated sub-string in a dataframe
Suppose I have a dataframe like this:假设我有一个这样的数据框:
df = pd.DataFrame({'A': ["asdfg", "abcdef", "ababab", "ghhzgghz", "qwerty"], 'B': [1, 2, 3, 4, 5]})
df.head()
O/P:开/关:
A B
asdfg 1
abcdef 2
ababab 3
ghhzgghz 4
qwerty 5
How do I go around and validate if there are any repeated sub-string/s within column A?如何验证 A 列中是否有任何重复的子字符串?
A B C
asdfg 1 False
abcdef 2 False
ababab 3 True (matches for ab)
ghhzgghz 4 True (matches for gh)
qwerty 5 False
A general logic for return s in (s + s)[1:-1]
, but I want it to be streamlined for any general substring repetition within each of these rows. return s in (s + s)[1:-1]
的一般逻辑,但我希望针对这些行中的每一行中的任何一般子字符串重复进行简化。
Idea is create all possible substrings and then count them by Counter
with check if at least one count >1
:想法是创建所有可能的子字符串,然后通过Counter
对它们进行Counter
并检查是否至少有一个计数>1
:
from collections import Counter
#modified https://stackoverflow.com/a/22470047/2901002
def test_substrings(input_string):
length = len(input_string)
s = [input_string[i:j+1] for i in range(length) for j in range(i,length)]
return any(x > 1 for x in Counter(s).values())
Another solution with easy way for modify minimal length of tested strings:另一种修改测试字符串最小长度的简单方法的解决方案:
from itertools import chain, combinations
#changed first word asdfg to asdfa
df = pd.DataFrame({'A': ["asdfa", "abcdef", "ababab", "ghhzgghz", "qwerty"],
'B': [1, 2, 3, 4, 5]})
def test_substrings(input_string, N):
s = chain(*[combinations(input_string,x) for x in range(N,len(input_string)+1)])
return any(x > 1 for x in Counter(s).values())
df['C'] = df['A'].apply(lambda x: test_substrings(x, 1))
df['D'] = df['A'].apply(lambda x: test_substrings(x, 2))
print (df)
A B C D
0 asdfa 1 True False
1 abcdef 2 False False
2 ababab 3 True True
3 ghhzgghz 4 True True
4 qwerty 5 False False
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.