简体   繁体   中英

Python compare partial string in a list with each other

I am trying to write a code to compare each string in a list to each other and then generate its regex for similarity

list = ["LONDON-UK-L16-N1",
        "LONDON-UK-L17-N1",
        "LONDON-UK-L16-N2",
        "LONDON-UK-L17-N2",
        "PARIS-France-L16-N2"]

I am trying to get an output as below

LONDON-UK-L(16|17)-N(1|2)

is that possible? thanks

Update: just to make it clear i am trying to input: list, or strings Action: compare list items to each other, and check for similarity (to fix it-first group of a string), and use regex for any other not similar part of item, so instead of having for items, we can have a single output (using regex) output: regex to match not similar

input: tez15-3-s1-y2 tez15-3-s2-y2 bro40-55-s1-y2

output: tez15-3-s(1|2)-y2 ,bro40-55-s1-y2

Its not entirely clear from your question what the exact problem is. Since the data you gave as an example is consistent and well ordered, this problem can be solved easily by simply splitting up the items in the list and categorising them.

loc_list = ["LONDON-UK-L16-N1", "LONDON-UK-L17-N1", "LONDON-UK-L16-N2", 
            "LONDON-UK-L16-N2", "PARIS-France-L16-N2"]

split_loc_list = [location.split("-")  for location in loc_list]

locs = {}

for loc in split_loc_list:
    locs.setdefault("-".join(loc[0:2]), {}).\
                        setdefault("L", set()).add(loc[2].strip("L"))

    locs.setdefault("-".join(loc[0:2]), {}).\
                        setdefault("N", set()).add(loc[3].strip("N"))

for loc, vals in locs.items():
    L_vals_sorted = sorted(list(map(int,vals["L"])))
    L_vals_joined = "|".join(map(str,L_vals_sorted))

    N_vals_sorted = sorted(list(map(int,vals["N"])))
    N_vals_joined = "|".join(map(str,N_vals_sorted))

    print(f"{loc}-L({L_vals_joined})-N({N_vals_joined})")

will output:

LONDON-UK-L(16|17)-N(1|2)
PARIS-France-L(16)-N(2)

Since there were only two tags here ("L" and "N"), I just wrote them into the code. If there are many tags possible, then you can strip by any letter using:

import re
split = re.findall('\d+|\D+', loc[2])
key, val = split[0], split[1]
locs.setdefault("-".join(loc[0:2]), {}).\
                        setdefault(key, set()).add(val)

Then iterate through all the tags instead of just fetching "L" and "N" in the second loop.

I've implemented the following solution:

import re 

data = [
  'LONDON-UK-L16-N1',
  'LONDON-UK-L17-N1',
  'LONDON-UK-L16-N2',
  'LONDON-UK-L16-N2',
  'PARIS-France-L16-N2'
]

def deconstruct(data):
  data = [y for y in [x.split('-') for x in data]]
  result = dict()

  for x in data:
    pointer = result

    for y in x:
      substr = re.findall('(\D+)', y)
      if substr:
        substr = substr[0]
        if not substr in pointer:
          pointer[substr] = {0: set()}
        pointer = pointer[substr]

      substr = re.findall('(\d+)', y)
      if substr:
        substr = substr[0]
        pointer[0].add(substr)

  return result

def construct(data, level=0):
  result = []

  for key in data.keys():
    if key != 0:
      if len(data[key][0]) == 1:
        nums = list(data[key][0])[0]
      elif len(data[key][0]) > 1:
        nums = '(' + '|'.join(sorted(list(data[key][0]))) + ')'
      else:
        nums = ''

      deeper_result = construct(data[key], level + 1)
      if not deeper_result:
        result.append([key + nums])
      else:
        for d in deeper_result:
          result.append([key + nums] + d)

  return result if level > 0 else ['-'.join(x) for x in result]

print(construct(deconstruct(data)))
# ['LONDON-UK-L(16|17)-N(1|2)', 'PARIS-France-L16-N2']

I post this new (second) implementation on this problem, I think more accurate and hope helpful:

import re 

data = [
  'LONDON-UK-L16-N1',
  'LONDON-UK-L17-N1',
  'LONDON-UK-L16-N2',
  'LONDON-UK-L17-N2',
  'LONDON-UK-L18-N2',
  'PARIS-France-L16-N2',
]

def merge(data):
  data.sort()
  data = [y for y in [x.split('-') for x in data]]

  for col in range(len(data[0]) - 1, -1, -1):
    result = []

    def add_result():
      result.append([])
      if headstr:
        result[-1] += headstr.split('-')
      if len(list(findnum)) > 1:
        result[-1] += [f'{findstr}({"|".join(sorted(findnum))})']
      elif len(list(findnum)) == 1:
        result[-1] += [f'{findstr}{findnum[0]}']
      if tailstr:
        result[-1] += tailstr.split('-')

    _headstr = lambda x, y: '-'.join(x[:y])
    _tailstr = lambda x, y: '-'.join(x[y + 1:])
    _findstr = lambda x: re.findall('(\D+)', x)[0] if re.findall('(\D+)', x) else ''
    _findnum = lambda x: re.findall('(\d+)', x)[0] if re.findall('(\d+)', x) else ''

    headstr = _headstr(data[0], col)
    tailstr = _tailstr(data[0], col)
    findstr = _findstr(data[0][col])
    findnum = []

    for row in data:
      if headstr + findstr + tailstr != _headstr(row, col) + _findstr(row[col]) + _tailstr(row, col):
        add_result()
        headstr = _headstr(row, col)
        tailstr = _tailstr(row, col)
        findstr = _findstr(row[col])
        findnum = []
      if _findnum(row[col]) not in findnum:
        findnum.append(_findnum(row[col]))

    else:
        add_result()

    data = result[:]

  return ['-'.join(x) for x in result]

print(merge(data))  # ['LONDON-UK-L(16|17)-N(1|2)', 'LONDON-UK-L18-N2', 'PARIS-France-L16-N2']

Don't use 'list' as a variable name... it's a reserved word.

import re

lst = ['LONDON-UK-L16-N1', 'LONDON-UK-L17-N1', 'LONDON-UK-L16-N2', 'LONDON-UK-L16-N2', 'PARIS-France-L16-N2']

def check_it(string):
    return re.search(r'[a-zA-Z\-]*L(\d)*-N(\d)*', string)

[check_it(x).group(0) for x in lst]

will output:

['LONDON-UK-L16-N1',
 'LONDON-UK-L17-N1',
 'LONDON-UK-L16-N2',
 'LONDON-UK-L16-N2',
 'PARIS-France-L16-N2']

From there, look into groups and define a group to cover the pieces that you want to use for similarity.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM