简体   繁体   English

在字符串列表中查找分隔符/定界符

[英]Finding separators/delimiters in lists of strings

I am trying to find separators in a file that may or may not have separators, and what those separators are - if any - is also not known. 我试图在可能包含或不包含分隔符的文件中查找分隔符,并且这些分隔符是什么(如果有的话)也是未知的。

So far I have written the following code in an attempt to "solve" this: 到目前为止,我已经编写了以下代码来尝试“解决”此问题:

strings = [
    'cabhb2k4ack_sfdfd~ffrref_lk',
    'iodja_24ed~092oi3jelk_fcjcad',
    'lkn04432m_90osidjlknxc~o_pf'
]

# Process first line
line1 = strings[0]
separators = set()
for sep in set(line1):
    separators.add(( sep, line1.count(sep) ))

# Process all the others
for line in strings:
    for sep,sepcount in separators.copy():
        if line.count(sep) != sepcount: separators.remove( (sep,sepcount) )

print separators

It returns the set: set([('_', 2), ('~', 1)]) which is good - but unfortunately does not contain the order of the separators in the file. 它返回一个好的set: set([('_', 2), ('~', 1)]) -不幸的是,它不包含分隔符的顺序。 In fact, its not even known if there was a consistent order for these separators. 实际上,甚至不知道这些分隔符的顺序是否一致。

The rules for separators are simple: 分隔符的规则很简单:

  1. They must occur the same number of times per line, 每行它们必须发生相同的次数,
  2. They must occur in the same order on each line, 它们必须在每一行上以相同的顺序出现,
  3. None of the non-separators characters can be separator characters. 非分隔符都不能是分隔符。

Note that in the example above, '4' was excluded as a separator as it comes up twice in the third string for reason 1 and 3. 请注意,在上面的示例中,“ 4”作为分隔符被排除在外,因为第三个字符串在原因1和3中出现两次。

The Question 问题
How can I modify this code to check rule 2 correctly print the order of the separators? 如何修改此代码以检查规则2正确打印分隔符的顺序?

I'd use a Counter instead of .count , take skrrgwasme's suggestion to use a list, and use itertools.combinations to help iterate over the subsets of possible separators: 我会用一个Counter来代替.count ,采取skrrgwasme的建议,使用列表,并使用itertools.combinations帮助叠代可能分离的子集:

from collections import Counter
from itertools import combinations

def subsets(elems):
    for width in range(1, len(elems)+1):
        for comb in combinations(elems, width):
            yield comb

def sep_order(string, chars):
    chars = set(chars)
    order = tuple(c for c in string if c in chars)
    return order

def find_viable_separators(strings):
    counts = [Counter(s) for s in strings]
    chars = {c for c in counts[0]
             if all(count[c]==counts[0][c] for count in counts)}
    for seps in subsets(chars):
        orders = {sep_order(s, seps) for s in strings}
        if len(orders) == 1:
            yield seps, next(iter(orders))

which gives me 这给了我

>>> 
... strings = [
...     'cabhb2k4ack_sfdfd~ffrref_lk',
...     'iodja_24ed~092oi3jelk_fcjcad',
...     'lkn04432m_90osidjlknxc~o_pf'
... ]
... 
... for seps, order in find_viable_separators(strings):
...     print("possible separators:", seps, "with order:", order)
...             
possible separators: ('~',) with order: ('~',)
possible separators: ('_',) with order: ('_', '_')
possible separators: ('~', '_') with order: ('_', '~', '_')

Given the rule 1, each separator has a number of occurences / line that is steady from the first line to the last one of the list. 给定规则1,每个分隔符都有从列表的第一行到最后一行稳定的出现次数/行。

I don't find the rule 3 very well expressed. 我发现规则3的表达不够好。 I think it must be understood as: "every character used as separator can't be found among others characters considered non-separators in the line". 我认为必须将其理解为:“在该行中被视为非分隔符的其他字符中找不到所有用作分隔符的字符”。

Thus, given the rules 1 AND 3, every character whose number of occurences / line is varying even only one time between two successive lines can't be a separator. 因此,在给定规则1和3的情况下,每个出现次数/行数都变化的字符即使在连续的两行之间只有一次也不能成为分隔符。

So, the principle of the below code is 因此,以下代码的原理是
· firstly to create a list sep_n of all the characters present in the first line associated with their number of occurences in this first line, ·首先创建第一行中存在的所有字符的列表sep_n ,并将其与第一行中的出现次数相关联,
· and then to iterate along the list of lines S and to eliminate each character in the list sep_n whose number of occurences doesn't remain the same. ·然后沿着S行的列表进行迭代,并消除列表sep_n次数不相同的每个字符。

S = [
    'cabhb2k4ack_sfdfd~ffrref_lk',
    'iodja_24ed~092oi3jelk_fcjcad',
    'lkn04432m_90osidjlknxc~o_pf',
    'hgtr5v_8mgojnb5+87rt~lhiuhfj_n547'
    ]
# 1.They must occur the same number of times per line, 
line0 = S.pop(0)
sep_n = [ (c,line0.count(c)) for c in line0]
print(line0); print(sep_n,'\n')

for line in S:
    sep_n = [x for x in sep_n if line.count(x[0]) == x[1]]
    print(line); print(sep_n,'\n')

S.insert(0, line0)

# 2.They must occur in the same order on each line,
separators_in_order = [x[0] for x in sep_n]
print('separators_in_order : ',separators_in_order)
separators          = ''.join(set(separators_in_order))

for i,line in enumerate(S):
    if [c for c in line if c in separators] != separators_in_order:
        print(i,line)

If the characters in lines have enough variation of their occurrences (apart the separators), length of sep_n in my code decreases rapidly as the list is iterated. 如果各行中的字符有足够的变化形式(分隔符除外),则我的代码中sep_n长度会随着列表的迭代而迅速减少。

.

The instruction sep_n = [ (c,line0.count(c)) for c in line0] is responsible of the fact that the final order obtained in separators_in_order is the order in the first line of the list S . 指令sep_n = [ (c,line0.count(c)) for c in line0]负责以下事实:在separators_in_order获得的最终顺序是列表S的第一行中的顺序。

But I can't imagine a way to test that the order of separators is remaining the same from one line to the other. 但是我无法想象一种方法来测试分隔符从一行到另一行的顺序是否相同。 In fact, it seems to me it is impossible to do such a test during the iteration because the list of separators is fully known only after the iteration has been fully performed. 实际上,在我看来,不可能在迭代期间进行这样的测试,因为只有在完全执行了迭代之后才完全知道分隔符列表。

That's why a secondary control must be done after the value of sep_n has been obtained. 这就是为什么在获取sep_n的值之后必须执行辅助控件的原因。 It needs to iterate again through the list S . 它需要再次遍历列表S
The problem being that, if " every character whose number of occurences / line is varying even only one time between two successive lines can't be a separator ", it may however happen that a non-separator character would appear stricly the same number of times in all the lines, thus without possibility to detect it as non-separator on this basis of the number of occurences. 问题是,如果“出现次数/行数变化的每个字符甚至在连续的两行之间只有一次不能改变 ”,则可能会出现非分隔符字符的出现次数完全相同的情况。在所有行中都存在“非分隔符”,因此就不可能根据出现的次数将其检测为非分隔符。
But as it remains a chance that such a non-separator character wouldn't be placed always at the same place in the list of characters with steady occurrences, the secondary verification is possible. 但是由于这样的非分隔字符仍然不会总是出现在稳定出现的字符列表中的同一位置,因此有可能进行二次验证。

At last, an extreme case that could exist is the following: a non-separator character appear with exactly the same occurences in all the lines and is placed among separators in the lines so as it can't be detected even by the secondary verification; 最后,可能存在的一种极端情况是:非分隔符出现在所有行中,它们的出现完全相同,并且放置在行中的分隔符之间,这样即使通过二次验证也无法检测到;
I don't know how to solve this case.... 我不知道该怎么解决...

The result is 结果是

cabhb2k4ack_sfdfd~ffrref_lk
[('c', 2), ('a', 2), ('b', 2), ('h', 1), ('b', 2), ('2', 1), ('k', 3), ('4', 1), ('a', 2), ('c', 2), ('k', 3), ('_', 2), ('s', 1), ('f', 5), ('d', 2), ('f', 5), ('d', 2), ('~', 1), ('f', 5), ('f', 5), ('r', 2), ('r', 2), ('e', 1), ('f', 5), ('_', 2), ('l', 1), ('k', 3)] 

iodja_24ed~092oi3jelk_fcjcad
[('c', 2), ('a', 2), ('4', 1), ('a', 2), ('c', 2), ('_', 2), ('~', 1), ('_', 2), ('l', 1)] 

lkn04432m_90osidjlknxc~o_pf
[('_', 2), ('~', 1), ('_', 2)] 

hgtr5v_8mgojnb5+87rt~lhiuhfj_n547
[('_', 2), ('~', 1), ('_', 2)] 

separators_in_order :  ['_', '~', '_']

And with

S = [
    'cabhb2k4ack_sfd#fd~ffrref_lk',
    'iodja_24ed~092oi#3jelk_fcjcad',
    'lkn04432m_90osi#djlknxc~o_pf',
    'h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547'
    ]

the result is 结果是

cabhb2k4ack_sfd#fd~ffrref_lk
[('c', 2), ('a', 2), ('b', 2), ('h', 1), ('b', 2), ('2', 1), ('k', 3), ('4', 1), ('a', 2), ('c', 2), ('k', 3), ('_', 2), ('s', 1), ('f', 5), ('d', 2), ('#', 1), ('f', 5), ('d', 2), ('~', 1), ('f', 5), ('f', 5), ('r', 2), ('r', 2), ('e', 1), ('f', 5), ('_', 2), ('l', 1), ('k', 3)] 

iodja_24ed~092oi#3jelk_fcjcad
[('c', 2), ('a', 2), ('4', 1), ('a', 2), ('c', 2), ('_', 2), ('#', 1), ('~', 1), ('_', 2), ('l', 1)] 

lkn04432m_90osi#djlknxc~o_pf
[('_', 2), ('#', 1), ('~', 1), ('_', 2)] 

h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547
[('_', 2), ('#', 1), ('~', 1), ('_', 2)] 

separators_in_order :  ['_', '#', '~', '_']
1 iodja_24ed~092oi#3jelk_fcjcad
3 h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547

.
.

NB 1 NB 1
The instruction line0 = S.pop(0) is done 指令line0 = S.pop(0)完成
to stave an instruction for line in S[1:]: off, for line in S[1:]:指令for line in S[1:]:关闭,
because S[1:] creates an new list, which could be heavy. 因为S[1:]创建了一个新列表,可能很重。

.

NB 2 注意2
In order to avoid the creation of a new sep_n list at each turn of the iteration in S , 为了避免在S中每次迭代时都创建新的sep_n列表,
it is better to write the iteration as follows: 最好将迭代编写如下:

for line in S:
    for x in sep_n:
        if line.count(x[0]) == x[1]:
            sep_n = [x for x in sep_n if line.count(x[0]) == x[1]]
            break
    print(line); print(sep_n,'\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM