简体   繁体   中英

Finding separators/delimiters in lists of strings

I am trying to find separators in a file that may or may not have separators, and what those separators are - if any - is also not known.

So far I have written the following code in an attempt to "solve" this:

strings = [
    'cabhb2k4ack_sfdfd~ffrref_lk',
    'iodja_24ed~092oi3jelk_fcjcad',
    'lkn04432m_90osidjlknxc~o_pf'
]

# Process first line
line1 = strings[0]
separators = set()
for sep in set(line1):
    separators.add(( sep, line1.count(sep) ))

# Process all the others
for line in strings:
    for sep,sepcount in separators.copy():
        if line.count(sep) != sepcount: separators.remove( (sep,sepcount) )

print separators

It returns the set: set([('_', 2), ('~', 1)]) which is good - but unfortunately does not contain the order of the separators in the file. In fact, its not even known if there was a consistent order for these separators.

The rules for separators are simple:

  1. They must occur the same number of times per line,
  2. They must occur in the same order on each line,
  3. None of the non-separators characters can be separator characters.

Note that in the example above, '4' was excluded as a separator as it comes up twice in the third string for reason 1 and 3.

The Question
How can I modify this code to check rule 2 correctly print the order of the separators?

I'd use a Counter instead of .count , take skrrgwasme's suggestion to use a list, and use itertools.combinations to help iterate over the subsets of possible separators:

from collections import Counter
from itertools import combinations

def subsets(elems):
    for width in range(1, len(elems)+1):
        for comb in combinations(elems, width):
            yield comb

def sep_order(string, chars):
    chars = set(chars)
    order = tuple(c for c in string if c in chars)
    return order

def find_viable_separators(strings):
    counts = [Counter(s) for s in strings]
    chars = {c for c in counts[0]
             if all(count[c]==counts[0][c] for count in counts)}
    for seps in subsets(chars):
        orders = {sep_order(s, seps) for s in strings}
        if len(orders) == 1:
            yield seps, next(iter(orders))

which gives me

>>> 
... strings = [
...     'cabhb2k4ack_sfdfd~ffrref_lk',
...     'iodja_24ed~092oi3jelk_fcjcad',
...     'lkn04432m_90osidjlknxc~o_pf'
... ]
... 
... for seps, order in find_viable_separators(strings):
...     print("possible separators:", seps, "with order:", order)
...             
possible separators: ('~',) with order: ('~',)
possible separators: ('_',) with order: ('_', '_')
possible separators: ('~', '_') with order: ('_', '~', '_')

Given the rule 1, each separator has a number of occurences / line that is steady from the first line to the last one of the list.

I don't find the rule 3 very well expressed. I think it must be understood as: "every character used as separator can't be found among others characters considered non-separators in the line".

Thus, given the rules 1 AND 3, every character whose number of occurences / line is varying even only one time between two successive lines can't be a separator.

So, the principle of the below code is
· firstly to create a list sep_n of all the characters present in the first line associated with their number of occurences in this first line,
· and then to iterate along the list of lines S and to eliminate each character in the list sep_n whose number of occurences doesn't remain the same.

S = [
    'cabhb2k4ack_sfdfd~ffrref_lk',
    'iodja_24ed~092oi3jelk_fcjcad',
    'lkn04432m_90osidjlknxc~o_pf',
    'hgtr5v_8mgojnb5+87rt~lhiuhfj_n547'
    ]
# 1.They must occur the same number of times per line, 
line0 = S.pop(0)
sep_n = [ (c,line0.count(c)) for c in line0]
print(line0); print(sep_n,'\n')

for line in S:
    sep_n = [x for x in sep_n if line.count(x[0]) == x[1]]
    print(line); print(sep_n,'\n')

S.insert(0, line0)

# 2.They must occur in the same order on each line,
separators_in_order = [x[0] for x in sep_n]
print('separators_in_order : ',separators_in_order)
separators          = ''.join(set(separators_in_order))

for i,line in enumerate(S):
    if [c for c in line if c in separators] != separators_in_order:
        print(i,line)

If the characters in lines have enough variation of their occurrences (apart the separators), length of sep_n in my code decreases rapidly as the list is iterated.

.

The instruction sep_n = [ (c,line0.count(c)) for c in line0] is responsible of the fact that the final order obtained in separators_in_order is the order in the first line of the list S .

But I can't imagine a way to test that the order of separators is remaining the same from one line to the other. In fact, it seems to me it is impossible to do such a test during the iteration because the list of separators is fully known only after the iteration has been fully performed.

That's why a secondary control must be done after the value of sep_n has been obtained. It needs to iterate again through the list S .
The problem being that, if " every character whose number of occurences / line is varying even only one time between two successive lines can't be a separator ", it may however happen that a non-separator character would appear stricly the same number of times in all the lines, thus without possibility to detect it as non-separator on this basis of the number of occurences.
But as it remains a chance that such a non-separator character wouldn't be placed always at the same place in the list of characters with steady occurrences, the secondary verification is possible.

At last, an extreme case that could exist is the following: a non-separator character appear with exactly the same occurences in all the lines and is placed among separators in the lines so as it can't be detected even by the secondary verification;
I don't know how to solve this case....

The result is

cabhb2k4ack_sfdfd~ffrref_lk
[('c', 2), ('a', 2), ('b', 2), ('h', 1), ('b', 2), ('2', 1), ('k', 3), ('4', 1), ('a', 2), ('c', 2), ('k', 3), ('_', 2), ('s', 1), ('f', 5), ('d', 2), ('f', 5), ('d', 2), ('~', 1), ('f', 5), ('f', 5), ('r', 2), ('r', 2), ('e', 1), ('f', 5), ('_', 2), ('l', 1), ('k', 3)] 

iodja_24ed~092oi3jelk_fcjcad
[('c', 2), ('a', 2), ('4', 1), ('a', 2), ('c', 2), ('_', 2), ('~', 1), ('_', 2), ('l', 1)] 

lkn04432m_90osidjlknxc~o_pf
[('_', 2), ('~', 1), ('_', 2)] 

hgtr5v_8mgojnb5+87rt~lhiuhfj_n547
[('_', 2), ('~', 1), ('_', 2)] 

separators_in_order :  ['_', '~', '_']

And with

S = [
    'cabhb2k4ack_sfd#fd~ffrref_lk',
    'iodja_24ed~092oi#3jelk_fcjcad',
    'lkn04432m_90osi#djlknxc~o_pf',
    'h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547'
    ]

the result is

cabhb2k4ack_sfd#fd~ffrref_lk
[('c', 2), ('a', 2), ('b', 2), ('h', 1), ('b', 2), ('2', 1), ('k', 3), ('4', 1), ('a', 2), ('c', 2), ('k', 3), ('_', 2), ('s', 1), ('f', 5), ('d', 2), ('#', 1), ('f', 5), ('d', 2), ('~', 1), ('f', 5), ('f', 5), ('r', 2), ('r', 2), ('e', 1), ('f', 5), ('_', 2), ('l', 1), ('k', 3)] 

iodja_24ed~092oi#3jelk_fcjcad
[('c', 2), ('a', 2), ('4', 1), ('a', 2), ('c', 2), ('_', 2), ('#', 1), ('~', 1), ('_', 2), ('l', 1)] 

lkn04432m_90osi#djlknxc~o_pf
[('_', 2), ('#', 1), ('~', 1), ('_', 2)] 

h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547
[('_', 2), ('#', 1), ('~', 1), ('_', 2)] 

separators_in_order :  ['_', '#', '~', '_']
1 iodja_24ed~092oi#3jelk_fcjcad
3 h#gtr5v_8mgojnb5+87rt~lhiuhfj_n547

.
.

NB 1
The instruction line0 = S.pop(0) is done
to stave an instruction for line in S[1:]: off,
because S[1:] creates an new list, which could be heavy.

.

NB 2
In order to avoid the creation of a new sep_n list at each turn of the iteration in S ,
it is better to write the iteration as follows:

for line in S:
    for x in sep_n:
        if line.count(x[0]) == x[1]:
            sep_n = [x for x in sep_n if line.count(x[0]) == x[1]]
            break
    print(line); print(sep_n,'\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM