如何按 substring 模式對列表進行排序到 dicts

Question

我正在嘗試對基於類似子字符串的值列表進行排序。 我想將其分組到列表的字典中，鍵是類似的 substring，值是這些分組值的列表。

例如（實際列表有 24k 個條目）：

test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

至：

resultdict = { 
'Doghouse' : ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill' : [ 'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill' ]
}

我嘗試了以下方法，但這根本不起作用。

from itertools import groupby 
test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
            'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']


res = [list(i) for j, i in groupby(test_list, 
                          lambda a: a.partition('_')[0])]

Answer 1

mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
            'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
test = ['Doghouse', 'by KatSkill']

使用 dict 和列表理解：

res = { i: [j for j in mylist if i in j] for i in test}

或設置您的 dict 並使用帶有列表理解的循環

resultdict = {}
for i in test:
     resultdict[i] = [j for j in mylist if i in j]

Answer 2

最初，查找出現在輸入列表的另一個字符串中的所有以“”分隔的子字符串。 在此過程中，構建一個字典，其中包含所有相應的子字符串作為鍵，輸入字符串作為值。 這將返回一個只有單個子字符串作為鍵的字典。 使用該示例返回：

{'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}

為了獲得預期的結果，需要進行壓實。 對於壓縮，利用每個字典鍵也是字典字符串列表的一部分這一事實是有益的。 因此迭代字典值並將字符串再次拆分為子字符串。 然后按照 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ 列表的順序遍歷子字符串，並確定包含字典鍵的子字符串列表范圍。 將相應的范圍添加到新的字典中。 對於 24k 條目，這可能需要一段時間。 請參閱下面的源代碼：

mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

def findSimilarSubstrings(list):
    res_dict = {}
    for string in list:
        substrings = string.split(" ")
        for otherstring in list:
            # Prevent check with the same string
            if otherstring == string:
                continue
            for substring in substrings:
                if substring in otherstring:
                   if not(substring in res_dict):
                       res_dict[substring] = []
                   # Prevent duplicates
                   if not(otherstring in res_dict[substring]):
                       res_dict[substring].append(otherstring)
    return res_dict

def findOverlappingLists(dict):
    res_dict = {}
    for list in dict.values():
        for string in list:
            substrings = string.split(" ")
            lastIndex = 0
            lastKeyInDict = False
            substring = ""
            numsubstrings = len(substrings)
            for i in range(len(substrings)):
               substring = substrings[i]
               if substring in dict:
                    if not(lastKeyInDict):
                        lastIndex = i
                        lastKeyInDict = True
               elif lastKeyInDict:
                   commonstring = " ".join(substrings[lastIndex:i])
                   # Add key string to res_dict
                   if not(commonstring in res_dict):
                      res_dict[commonstring] = []
                   # Prevent duplicates
                   if not(string in res_dict[commonstring]):
                      res_dict[commonstring].append(string)
                   lastKeyInDict = False
            # Handle last substring
            if lastKeyInDict:
                commonstring = " ".join(substrings[lastIndex:numsubstrings])
                if not(commonstring in res_dict):
                    res_dict[commonstring] = []
                if not(string in res_dict[commonstring]):
                    res_dict[commonstring].append(string)
    return res_dict

# Initially find all the substrings (seperated by " ") returning:
# {'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
#  'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
#  'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}
similiarStrings = findSimilarSubstrings(mylist)
# Perform a compaction on similiarStrings.values() by lookup in the dictionary's key set
resultdict = findOverlappingLists(similiarStrings)

Answer 3

這是一個可能更簡單/更快的實現

from collections import Counter
from itertools import groupby
import pprint

# Strategy:
# 1.  Find common words in strings in list
# 2.  Group strings which have the same common words together

def find_common_words(lst):
  " finds strings with common words "
  cnt = Counter()
  for s in lst:
    cnt.update(s.split(" "))

  # return words which appear in more than one string
  words = set([k for k, v in cnt.items() if v > 1])
  return words
  
def grouping_key(s, words):
  " Key function for grouping strings with common words in the same sequence"
  k = []
  for i in s.split():
    if i in words:
      k.append(i)
  return ' '.join(k)

def calc_groupings(lst):
  " Generate the string groups based upon common words "
  common_words = find_common_words(lst)

  # Group strings with common words
  g = groupby(lst, lambda x: grouping_key(x, common_words))

  # Result
  return {k: list(v) for k, v in g}

t = ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(calc_groupings(t))

Output

{   'Doghouse': ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill': [   'House by KatSkill',
                   'Garden by KatSkill',
                   'Meadow by KatSkill']}

如何按 substring 模式對列表進行排序到 dicts

問題描述

3 個解決方案

解決方案1
0 2019-10-03 13:47:52

解決方案2
0 已采納 2019-10-03 16:51:05

解決方案3
0 2019-10-05 19:27:36

Output

如何按 substring 模式對列表進行排序到 dicts

問題描述

3 個解決方案

解決方案1 0 2019-10-03 13:47:52

解決方案2 0 已采納 2019-10-03 16:51:05

解決方案3 0 2019-10-05 19:27:36

Output

解決方案1
0 2019-10-03 13:47:52

解決方案2
0 已采納 2019-10-03 16:51:05

解決方案3
0 2019-10-05 19:27:36