[英]Merge Tuples in a list - Spacy Trainset related
我在下面的列表中有一個'n'(10K 或更多)元組(SpaCy 的培訓格式) -
[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
計划是將相同的句子分組並合並字典。 我顯然選擇了蠻力循環的想法,但如果我有 10-25K 數據,那會非常慢。 有沒有更好/最佳的方法來做到這一點?
所需 output -
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
使用 python 中的str
可以散列/索引的事實。
在這里,我使用帶有鍵的字典作為字符串或元組的第一個元素
如果您有 memory 限制,您可以批量處理或使用 Google Colab 等開源平台
temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
由於句子是您的關鍵,因此 dict 是執行此操作的自然方式。 對於列表中的每個條目,append 實體元組到該句子的運行列表,如下所示:
itemlist = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
sentence_dict = {}
for item in itemlist:
sentence = item[0]
entity = item[1]['entities'][0] # Get the tuple from the list which is the value for entities
value_dict = sentence_dict.get(sentence, {'entities': []})
value_dict['entities'].append(entity)
sentence_dict[sentence] = value_dict
list_of_tuples = []
for sentence, entity_dict in sentence_dict.items():
list_of_tuples.append((sentence, entity_dict))
>>> print(list_of_tuples)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
希望對您有所幫助,祝您編碼愉快!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.