简体   繁体   中英

Counting frequencies(efficiently) of strings in a large text file when their pre-counts are given

I have a list of lists of the form:

[['about70-130 characters long string', '332'], ['someotherrandomstring','2'], ['about70-130 characters long string', 32], ['someotherrandomstring', '3333']]

TO DO: I eventually want to sum the sizes of all the repeated strings like so:

[['about70-130 characters long string',364], ['someotherrandomstring',3335]]

I wrote a brute-force code to solve this but it's taking me a lot of time because the list has about 2 million lists. The very non-efficient code I wrote is:

final = {} 
for element in both_list:
    size = int(element[1])
    if element[0] not in final.keys():
       final[element[0]] = size
    else:
       final[element[0]] += size

I'm pretty sure there's a more time-efficient code but I can't seem to come up with any ideas. Any help and pointers in the right direction would be much appreciated. Thank you.

If you are okay to use third party library pandas

import pandas as pd
a=[['about70-130 characters long string', '332'], 
    ['someotherrandomstring','2'],['about70-130 characters long string', 32],['someotherrandomstring', '3333']]
df=pd.DataFrame(a,columns=['label','counts'])
df.counts=df.counts.astype('int')
df.groupby('label')['counts'].sum().to_dict()

It might be little faster than your solution

a=[['about70-130 characters long string', '332'], 
    ['someotherrandomstring','2'],['about70-130 characters long string', 32],['someotherrandomstring', '3333']]
d={}
for i in a:
    if i[0] not in d:
        d[i[0]]=d.get(i[0],int(i[1]))
    else:
        d[i[0]]=d.get(i[0])+int(i[1])

Using itertools.groupby with operator.itemgetter , or lambda

from itertools import groupby
from operator import itemgetter

lst = sorted(lst, key=itemgetter(0))
res = []

for k, g in groupby(lst, key=itemgetter(0)):
    res.append([k, sum([int(i[1]) for i in list(g)])])
print(res)
# [['about70-130 characters long string', 364], ['someotherrandomstring', 3335]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM