在列表中合並元組 - Spacy Trainset 相關

Question

我在下面的列表中有一個'n'（10K 或更多）元組（SpaCy 的培訓格式） -

[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

計划是將相同的句子分組並合並字典。 我顯然選擇了蠻力循環的想法，但如果我有 10-25K 數據，那會非常慢。 有沒有更好/最佳的方法來做到這一點？

所需 output -

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

Answer 1

使用 python 中的str可以散列/索引的事實。

在這里，我使用帶有鍵的字典作為字符串或元組的第一個元素

如果您有 memory 限制，您可以批量處理或使用 Google Colab 等開源平台

temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
    if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
    else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

Answer 2

由於句子是您的關鍵，因此 dict 是執行此操作的自然方式。 對於列表中的每個條目，append 實體元組到該句子的運行列表，如下所示：

itemlist = [
    ('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
    ('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

sentence_dict = {}
for item in itemlist:
    sentence = item[0]
    entity = item[1]['entities'][0]  # Get the tuple from the list which is the value for entities
    value_dict = sentence_dict.get(sentence, {'entities': []})
    value_dict['entities'].append(entity)
    sentence_dict[sentence] = value_dict

list_of_tuples = []
for sentence, entity_dict in sentence_dict.items():
    list_of_tuples.append((sentence, entity_dict))

>>> print(list_of_tuples)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

希望對您有所幫助，祝您編碼愉快！

在列表中合並元組 - Spacy Trainset 相關

問題描述

2 個解決方案

解決方案1
1 已采納 2020-06-05 04:00:15

解決方案2
0 2020-06-05 03:57:27

在列表中合並元組 - Spacy Trainset 相關

問題描述

2 個解決方案

解決方案1 1 已采納 2020-06-05 04:00:15

解決方案2 0 2020-06-05 03:57:27

解決方案1
1 已采納 2020-06-05 04:00:15

解決方案2
0 2020-06-05 03:57:27