简体   繁体   中英

Combining values inside a list of tuples in Python to retain the expected output format

I have a list of tuples. Each tuple consists of a string and a dict. Now each dict in that, consists of a list of tuples. The size of the list is around 8K entries.

Sample data:

dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]

From here output expected is:

dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]

I have written the code that removes all overlapping values inside a list of tuples: Example:

newinput = [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT'),(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]

# using set 
visited = set() 

# Output list initialization 
Outputs = [] 

# Iteration 

for a, b, c in newinput:
    if not a in visited:
        # print(a)
        visited.add(a)
        # print(visited)
        Outputs.append((a, b,c))
# print(Outputs)
    # elif not b in visited:
    #     visited.add(b) 
    #     Output.append((a, b,c))
    # else:
    #     pass

agn = []        
newv = set()    
for a, b, c in Outputs:
    # print(b)
    if not b in newv:
        newv.add(b)
        # print(newv)
        agn.append((a,b,c))

print(agn)
#Output:
#[(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]

The code works fine and I am able to retain tuples with only unique numbers within the list. What I want now is to retain the sentences associated with the unique tuples(as mentioned in the expected output format). Also, my sample dataset is a huge list and I want to do the operations inplace and retain the associated sentences(example: 'made of iron oxide') also with the entities and not separate them. How can I do this effectively so that I don't use multiple lists as well as get the result in the expected format?

I have rewritten the code to find duplicate values and then combine into a new tuple.

# dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (12, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]
# NEW DATA SET BASED ON COMMENT
dataset = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT'), (12, 19, 'PRODUCT')]}),('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (17, 20, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]


seen_values = []
clean_data = []

# loop through each sentence and dict of values
for sentence, values in dataset:
    for value in values['entities']:

        if value[0] in seen_values:
            # remove if we have seen this before
            values['entities'].remove(value)
        else:
            # add to list if we have not seen this before
            seen_values.append(value[0])
    clean_data.append((sentence, values))

    # ADDED TO ADDRESS REQUEST IN THE COMMENTS
    seen_values = []

print(clean_data)

Output:

# clean_data = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]

# NEW DATA SET OUTPUT
clean_data = [('made of iron oxide', {'entities': [(12, 16, 'PRODUCT'), (17, 20, 'PRODUCT'), (15, 24, 'PRODUCT')]}), ('made of ferric oxide', {'entities': [(10, 15, 'PRODUCT'), (17, 20, 'PRODUCT'), (624, 651, 'PRODUCT'), (1937, 1956, 'PRODUCT')]})]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM