Suppose I have a dataframe like this:
df = pd.DataFrame({'A': ["asdfg", "abcdef", "ababab", "ghhzgghz", "qwerty"], 'B': [1, 2, 3, 4, 5]})
df.head()
O/P:
A B
asdfg 1
abcdef 2
ababab 3
ghhzgghz 4
qwerty 5
How do I go around and validate if there are any repeated sub-string/s within column A?
A B C
asdfg 1 False
abcdef 2 False
ababab 3 True (matches for ab)
ghhzgghz 4 True (matches for gh)
qwerty 5 False
A general logic for return s in (s + s)[1:-1]
, but I want it to be streamlined for any general substring repetition within each of these rows.
Idea is create all possible substrings and then count them by Counter
with check if at least one count >1
:
from collections import Counter
#modified https://stackoverflow.com/a/22470047/2901002
def test_substrings(input_string):
length = len(input_string)
s = [input_string[i:j+1] for i in range(length) for j in range(i,length)]
return any(x > 1 for x in Counter(s).values())
Another solution with easy way for modify minimal length of tested strings:
from itertools import chain, combinations
#changed first word asdfg to asdfa
df = pd.DataFrame({'A': ["asdfa", "abcdef", "ababab", "ghhzgghz", "qwerty"],
'B': [1, 2, 3, 4, 5]})
def test_substrings(input_string, N):
s = chain(*[combinations(input_string,x) for x in range(N,len(input_string)+1)])
return any(x > 1 for x in Counter(s).values())
df['C'] = df['A'].apply(lambda x: test_substrings(x, 1))
df['D'] = df['A'].apply(lambda x: test_substrings(x, 2))
print (df)
A B C D
0 asdfa 1 True False
1 abcdef 2 False False
2 ababab 3 True True
3 ghhzgghz 4 True True
4 qwerty 5 False False
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.