简体   繁体   English

验证数据帧中的重复子字符串

[英]Validation for repeated sub-string in a dataframe

Suppose I have a dataframe like this:假设我有一个这样的数据框:

df = pd.DataFrame({'A': ["asdfg", "abcdef", "ababab", "ghhzgghz", "qwerty"], 'B': [1, 2, 3, 4, 5]})
df.head()

O/P:开/关:

A         B
asdfg     1
abcdef    2
ababab    3
ghhzgghz  4 
qwerty    5

How do I go around and validate if there are any repeated sub-string/s within column A?如何验证 A 列中是否有任何重复的子字符串?

A         B    C
asdfg     1    False
abcdef    2    False
ababab    3    True (matches for ab)
ghhzgghz  4    True (matches for gh)
qwerty    5    False

A general logic for return s in (s + s)[1:-1] , but I want it to be streamlined for any general substring repetition within each of these rows. return s in (s + s)[1:-1]的一般逻辑,但我希望针对这些行中的每一行中的任何一般子字符串重复进行简化。

Idea is create all possible substrings and then count them by Counter with check if at least one count >1 :想法是创建所有可能的子字符串,然后通过Counter对它们进行Counter并检查是否至少有一个计数>1

from collections import Counter

#modified https://stackoverflow.com/a/22470047/2901002
def test_substrings(input_string):
  length = len(input_string)
  s = [input_string[i:j+1] for i in range(length) for j in range(i,length)]
  return any(x > 1 for x in Counter(s).values())

Another solution with easy way for modify minimal length of tested strings:另一种修改测试字符串最小长度的简单方法的解决方案:

from itertools import chain, combinations

#changed first word asdfg to asdfa
df = pd.DataFrame({'A': ["asdfa", "abcdef", "ababab", "ghhzgghz", "qwerty"],
                   'B': [1, 2, 3, 4, 5]})

def test_substrings(input_string, N):
  s = chain(*[combinations(input_string,x) for x in range(N,len(input_string)+1)])
  return any(x > 1 for x in Counter(s).values())

df['C'] = df['A'].apply(lambda x: test_substrings(x, 1))
df['D'] = df['A'].apply(lambda x: test_substrings(x, 2))
print (df)
          A  B      C      D
0     asdfa  1   True  False
1    abcdef  2  False  False
2    ababab  3   True   True
3  ghhzgghz  4   True   True
4    qwerty  5  False  False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何创建一个DataFrame系列作为DataFrame索引的子字符串? - How to create a DataFrame series as a sub-string of a DataFrame Index? 在 Pandas Dataframe 中提取字符串中两个字符之间的子字符串 - Extracting Sub-string Between Two Characters in String in Pandas Dataframe 数据框通过搜索子字符串来切片列内容 - Dataframe to slice column content by searching sub-string 如何有条件地替换熊猫数据框列中的子字符串? - How to replace a sub-string conditionally in a pandas dataframe column? 用另一个列值的子字符串替换 dataframe 的 null 值 - Replacing null values of a dataframe with a sub-string of another column value 在DataFrame的开头而不是结尾附加一个子字符串 - Append a sub-string at the beginning in a DataFrame instead of at the end 最短重复子串 - Shortest Repeating Sub-String 如何将 Dataframe 列中的字符串与另一个 Dataframe 中的子字符串进行比较并提取值 - How to Compare String in a Dataframe column with a sub-string in another Dataframe and extract the value 如何使用DataFrame和Pandas检查列中的字符串是否是另一列中的子字符串 - How can I check if a string in a column is a sub-string in another column using dataframe and pandas 基于从非结构化格式搜索行中的子字符串,仅将行加载到 Dataframe 中 - Load only rows into Dataframe Based on searching for sub-string in the row from unstructured format
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM