[英]How to sort a list by substring pattern to dict of dicts
我正在嘗試對基於類似子字符串的值列表進行排序。 我想將其分組到列表的字典中,鍵是類似的 substring,值是這些分組值的列表。
例如(實際列表有 24k 個條目):
test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna',
'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
至:
resultdict = {
'Doghouse' : ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill' : [ 'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill' ]
}
我嘗試了以下方法,但這根本不起作用。
from itertools import groupby
test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna',
'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
res = [list(i) for j, i in groupby(test_list,
lambda a: a.partition('_')[0])]
mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna',
'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
test = ['Doghouse', 'by KatSkill']
使用 dict 和列表理解:
res = { i: [j for j in mylist if i in j] for i in test}
或設置您的 dict 並使用帶有列表理解的循環
resultdict = {}
for i in test:
resultdict[i] = [j for j in mylist if i in j]
最初,查找出現在輸入列表的另一個字符串中的所有以“”分隔的子字符串。 在此過程中,構建一個字典,其中包含所有相應的子字符串作為鍵,輸入字符串作為值。 這將返回一個只有單個子字符串作為鍵的字典。 使用該示例返回:
{'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}
為了獲得預期的結果,需要進行壓實。 對於壓縮,利用每個字典鍵也是字典字符串列表的一部分這一事實是有益的。 因此迭代字典值並將字符串再次拆分為子字符串。 然后按照 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ 列表的順序遍歷子字符串,並確定包含字典鍵的子字符串列表范圍。 將相應的范圍添加到新的字典中。 對於 24k 條目,這可能需要一段時間。 請參閱下面的源代碼:
mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna',
'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
def findSimilarSubstrings(list):
res_dict = {}
for string in list:
substrings = string.split(" ")
for otherstring in list:
# Prevent check with the same string
if otherstring == string:
continue
for substring in substrings:
if substring in otherstring:
if not(substring in res_dict):
res_dict[substring] = []
# Prevent duplicates
if not(otherstring in res_dict[substring]):
res_dict[substring].append(otherstring)
return res_dict
def findOverlappingLists(dict):
res_dict = {}
for list in dict.values():
for string in list:
substrings = string.split(" ")
lastIndex = 0
lastKeyInDict = False
substring = ""
numsubstrings = len(substrings)
for i in range(len(substrings)):
substring = substrings[i]
if substring in dict:
if not(lastKeyInDict):
lastIndex = i
lastKeyInDict = True
elif lastKeyInDict:
commonstring = " ".join(substrings[lastIndex:i])
# Add key string to res_dict
if not(commonstring in res_dict):
res_dict[commonstring] = []
# Prevent duplicates
if not(string in res_dict[commonstring]):
res_dict[commonstring].append(string)
lastKeyInDict = False
# Handle last substring
if lastKeyInDict:
commonstring = " ".join(substrings[lastIndex:numsubstrings])
if not(commonstring in res_dict):
res_dict[commonstring] = []
if not(string in res_dict[commonstring]):
res_dict[commonstring].append(string)
return res_dict
# Initially find all the substrings (seperated by " ") returning:
# {'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
# 'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
# 'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}
similiarStrings = findSimilarSubstrings(mylist)
# Perform a compaction on similiarStrings.values() by lookup in the dictionary's key set
resultdict = findOverlappingLists(similiarStrings)
這是一個可能更簡單/更快的實現
from collections import Counter
from itertools import groupby
import pprint
# Strategy:
# 1. Find common words in strings in list
# 2. Group strings which have the same common words together
def find_common_words(lst):
" finds strings with common words "
cnt = Counter()
for s in lst:
cnt.update(s.split(" "))
# return words which appear in more than one string
words = set([k for k, v in cnt.items() if v > 1])
return words
def grouping_key(s, words):
" Key function for grouping strings with common words in the same sequence"
k = []
for i in s.split():
if i in words:
k.append(i)
return ' '.join(k)
def calc_groupings(lst):
" Generate the string groups based upon common words "
common_words = find_common_words(lst)
# Group strings with common words
g = groupby(lst, lambda x: grouping_key(x, common_words))
# Result
return {k: list(v) for k, v in g}
t = ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna',
'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(calc_groupings(t))
{ 'Doghouse': ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill': [ 'House by KatSkill',
'Garden by KatSkill',
'Meadow by KatSkill']}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.