I have a list of tuples. Each tuple consists of a string and a dict. Now each dict in that, consists of a list of tuples. The size of the list is around 8K entries.
Sample data:
dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
From here output expected is:
dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
I have written the code that removes all overlapping values inside a list of tuples: Example:
newinput = [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT'),(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]
# using set
visited = set()
# Output list initialization
Outputs = []
# Iteration
for a, b, c in newinput:
if not a in visited:
# print(a)
visited.add(a)
# print(visited)
Outputs.append((a, b,c))
# print(Outputs)
# elif not b in visited:
# visited.add(b)
# Output.append((a, b,c))
# else:
# pass
agn = []
newv = set()
for a, b, c in Outputs:
# print(b)
if not b in newv:
newv.add(b)
# print(newv)
agn.append((a,b,c))
print(agn)
#Output:
#[(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]
The code works fine and I am able to retain tuples with only unique numbers within the list. What I want now is to retain the sentences associated with the unique tuples(as mentioned in the expected output format). Also, my sample dataset is a huge list and I want to do the operations inplace and retain the associated sentences(example: 'made of iron oxide') also with the entities and not separate them. How can I do this effectively so that I don't use multiple lists as well as get the result in the expected format?
I have rewritten the code to find duplicate values and then combine into a new tuple.
# dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
# NEW DATA SET BASED ON COMMENT
dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (17, 20, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
seen_values = []
clean_data = []
# loop through each sentence and dict of values
for sentence, values in dataset:
for value in values['entities']:
if value[0] in seen_values:
# remove if we have seen this before
values['entities'].remove(value)
else:
# add to list if we have not seen this before
seen_values.append(value[0])
clean_data.append((sentence, values))
# ADDED TO ADDRESS REQUEST IN THE COMMENTS
seen_values = []
print(clean_data)
Output:
# clean_data = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
# NEW DATA SET OUTPUT
clean_data = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (17, 20, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.