简体   繁体   English

如何按 substring 模式对列表进行排序到 dicts

[英]How to sort a list by substring pattern to dict of dicts

I am trying to sort a list of values based similar substrings.我正在尝试对基于类似子字符串的值列表进行排序。 I would like to group this in in a dict of dicts of lists with a key being the similar substring and the value a list of those grouped value.我想将其分组到列表的字典中,键是类似的 substring,值是这些分组值的列表。

For example (the actual list has 24k entries):例如(实际列表有 24k 个条目):

test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

to:至:

resultdict = { 
'Doghouse' : ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill' : [ 'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill' ]
}

I tried the following but that doesn't work at all.我尝试了以下方法,但这根本不起作用。

from itertools import groupby 
test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
            'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']


res = [list(i) for j, i in groupby(test_list, 
                          lambda a: a.partition('_')[0])]
mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
            'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
test = ['Doghouse', 'by KatSkill']

Use dict and list comprehension:使用 dict 和列表理解:

res = { i: [j for j in mylist if i in j] for i in test}

Or set up your dict and use a loop with list comprehension或设置您的 dict 并使用带有列表理解的循环

resultdict = {}
for i in test:
     resultdict[i] = [j for j in mylist if i in j]

Initially, find all the sub strings separated by " " that appear in another string of the input list.最初,查找出现在输入列表的另一个字符串中的所有以“”分隔的子字符串。 In the process, build a dictionary containing all the corresponding sub strings as keys and the input strings as values.在此过程中,构建一个字典,其中包含所有相应的子字符串作为键,输入字符串作为值。 This returns a dictionary having only single sub strings as keys.这将返回一个只有单个子字符串作为键的字典。 Using the example this returns:使用该示例返回:

{'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}

To obtain the intended outcome, compaction is required.为了获得预期的结果,需要进行压实。 For compaction, it is beneficial to exploit the fact that each dictionary key is also a part of the dictionary's string lists.对于压缩,利用每个字典键也是字典字符串列表的一部分这一事实是有益的。 So iterate over the dictionary values and split the strings into sub strings again.因此迭代字典值并将字符串再次拆分为子字符串。 Then iterate over the substrings in order of the substring list and determine the sub string list ranges that contain dictionary keys.然后按照 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ 列表的顺序遍历子字符串,并确定包含字典键的子字符串列表范围。 Add the corresponding ranges to a new dict.将相应的范围添加到新的字典中。 For 24k entries this may take a while.对于 24k 条目,这可能需要一段时间。 See the souce code down below:请参阅下面的源代码:

mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

def findSimilarSubstrings(list):
    res_dict = {}
    for string in list:
        substrings = string.split(" ")
        for otherstring in list:
            # Prevent check with the same string
            if otherstring == string:
                continue
            for substring in substrings:
                if substring in otherstring:
                   if not(substring in res_dict):
                       res_dict[substring] = []
                   # Prevent duplicates
                   if not(otherstring in res_dict[substring]):
                       res_dict[substring].append(otherstring)
    return res_dict

def findOverlappingLists(dict):
    res_dict = {}
    for list in dict.values():
        for string in list:
            substrings = string.split(" ")
            lastIndex = 0
            lastKeyInDict = False
            substring = ""
            numsubstrings = len(substrings)
            for i in range(len(substrings)):
               substring = substrings[i]
               if substring in dict:
                    if not(lastKeyInDict):
                        lastIndex = i
                        lastKeyInDict = True
               elif lastKeyInDict:
                   commonstring = " ".join(substrings[lastIndex:i])
                   # Add key string to res_dict
                   if not(commonstring in res_dict):
                      res_dict[commonstring] = []
                   # Prevent duplicates
                   if not(string in res_dict[commonstring]):
                      res_dict[commonstring].append(string)
                   lastKeyInDict = False
            # Handle last substring
            if lastKeyInDict:
                commonstring = " ".join(substrings[lastIndex:numsubstrings])
                if not(commonstring in res_dict):
                    res_dict[commonstring] = []
                if not(string in res_dict[commonstring]):
                    res_dict[commonstring].append(string)
    return res_dict

# Initially find all the substrings (seperated by " ") returning:
# {'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
#  'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
#  'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}
similiarStrings = findSimilarSubstrings(mylist)
# Perform a compaction on similiarStrings.values() by lookup in the dictionary's key set
resultdict = findOverlappingLists(similiarStrings)

Here's a perhaps simpler/faster implementation这是一个可能更简单/更快的实现

from collections import Counter
from itertools import groupby
import pprint

# Strategy:
# 1.  Find common words in strings in list
# 2.  Group strings which have the same common words together

def find_common_words(lst):
  " finds strings with common words "
  cnt = Counter()
  for s in lst:
    cnt.update(s.split(" "))

  # return words which appear in more than one string
  words = set([k for k, v in cnt.items() if v > 1])
  return words
  
def grouping_key(s, words):
  " Key function for grouping strings with common words in the same sequence"
  k = []
  for i in s.split():
    if i in words:
      k.append(i)
  return ' '.join(k)

def calc_groupings(lst):
  " Generate the string groups based upon common words "
  common_words = find_common_words(lst)

  # Group strings with common words
  g = groupby(lst, lambda x: grouping_key(x, common_words))

  # Result
  return {k: list(v) for k, v in g}

t = ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(calc_groupings(t))

Output Output

{   'Doghouse': ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill': [   'House by KatSkill',
                   'Garden by KatSkill',
                   'Meadow by KatSkill']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM