简体   繁体   中英

Validation for repeated sub-string in a dataframe

Suppose I have a dataframe like this:

df = pd.DataFrame({'A': ["asdfg", "abcdef", "ababab", "ghhzgghz", "qwerty"], 'B': [1, 2, 3, 4, 5]})
df.head()

O/P:

A         B
asdfg     1
abcdef    2
ababab    3
ghhzgghz  4 
qwerty    5

How do I go around and validate if there are any repeated sub-string/s within column A?

A         B    C
asdfg     1    False
abcdef    2    False
ababab    3    True (matches for ab)
ghhzgghz  4    True (matches for gh)
qwerty    5    False

A general logic for return s in (s + s)[1:-1] , but I want it to be streamlined for any general substring repetition within each of these rows.

Idea is create all possible substrings and then count them by Counter with check if at least one count >1 :

from collections import Counter

#modified https://stackoverflow.com/a/22470047/2901002
def test_substrings(input_string):
  length = len(input_string)
  s = [input_string[i:j+1] for i in range(length) for j in range(i,length)]
  return any(x > 1 for x in Counter(s).values())

Another solution with easy way for modify minimal length of tested strings:

from itertools import chain, combinations

#changed first word asdfg to asdfa
df = pd.DataFrame({'A': ["asdfa", "abcdef", "ababab", "ghhzgghz", "qwerty"],
                   'B': [1, 2, 3, 4, 5]})

def test_substrings(input_string, N):
  s = chain(*[combinations(input_string,x) for x in range(N,len(input_string)+1)])
  return any(x > 1 for x in Counter(s).values())

df['C'] = df['A'].apply(lambda x: test_substrings(x, 1))
df['D'] = df['A'].apply(lambda x: test_substrings(x, 2))
print (df)
          A  B      C      D
0     asdfa  1   True  False
1    abcdef  2  False  False
2    ababab  3   True   True
3  ghhzgghz  4   True   True
4    qwerty  5  False  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM