簡體   English   中英

在列表中合並元組 - Spacy Trainset 相關

[英]Merge Tuples in a list - Spacy Trainset related

我在下面的列表中有一個'n'(10K 或更多)元組(SpaCy 的培訓格式) -

[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

計划是將相同的句子分組並合並字典。 我顯然選擇了蠻力循環的想法,但如果我有 10-25K 數據,那會非常慢。 有沒有更好/最佳的方法來做到這一點?

所需 output -

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

使用 python 中的str可以散列/索引的事實。

在這里,我使用帶有鍵的字典作為字符串或元組的第一個元素

如果您有 memory 限制,您可以批量處理或使用 Google Colab 等開源平台

temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
    if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
    else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

由於句子是您的關鍵,因此 dict 是執行此操作的自然方式。 對於列表中的每個條目,append 實體元組到該句子的運行列表,如下所示:

itemlist = [
    ('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
    ('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

sentence_dict = {}
for item in itemlist:
    sentence = item[0]
    entity = item[1]['entities'][0]  # Get the tuple from the list which is the value for entities
    value_dict = sentence_dict.get(sentence, {'entities': []})
    value_dict['entities'].append(entity)
    sentence_dict[sentence] = value_dict

list_of_tuples = []
for sentence, entity_dict in sentence_dict.items():
    list_of_tuples.append((sentence, entity_dict))

>>> print(list_of_tuples)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

希望對您有所幫助,祝您編碼愉快!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM