简体   繁体   中英

Python finding most common pattern in list of strings

I have a large list of API calls stored as strings, which have been stripped of all common syntax('htttp://', '.com', '.', etc..)

I would like to return a dictionary of the most common patterns which have a length > 3, where the keys are the found patterns and values are the number of occurrences of each pattern. I've tried this:

calls = ['admobapioauthcert', 'admobapinewsession', 'admobendusercampaign']

>>> from itertools import takewhile, izip
>>> ''.join(c[0] for c in takewhile(lambda x: all(x[0] == y for y in x), izip(*calls)))

returns:

'admob'

I would like it to return:

{'obap': 2, 'dmob': 3, 'admo': 3, 'admobap': 2, 'bap': 2, 'dmobap': 2, 'admobapi': 2, 'moba': 2, 'bapi': 2, 'dmo': 3, 'obapi': 2, 'mobapi': 2, 'admob': 3, 'api': 2, 'dmobapi': 2, 'dmoba': 2, 'mobap': 2, 'mob': 3, 'adm': 3, 'admoba': 2, 'oba': 2}

-My current method only works at identifying prefixes, but i need it to operate on all characters, regardless of it's position in the string, and again I would like to store the number of occurrences of each pattern as dict values. (I've tried other methods to accomplish this, but they are quite ugly).

Is this what you'd you wanted. Its gives the common patterns of strings after splitting on a dot.

calls = ['admob.api.oauthcert', 'admob.api.newsession', 'admob.endusercampaign']
from collections import Counter
Counter(reduce(lambda x,y: x+y,map (lambda x : x.split("."),calls))).most_common(2)

O/P: [('admob', 3), ('api', 2)]

filter(lambda x: x[1]>1 ,Counter(reduce(lambda x,y: x+y,map (lambda x : x.split("."),calls))).most_common())

Update : I dont know if this would work for you:

calls = ['admobapioauthcert', 'admobapinewsession', 'admobendusercamp']
filter(lambda x : x[1]>1 and len(x[0])>2,Counter(reduce(lambda x,y:x + y,reduce(lambda x,y: x+y, map(lambda z :map(lambda x : map(lambda g: z[g:x+1],range(len(z[:x+1]))),range(len(z))),calls)))).most_common())

O/P:

[('admo', 3), ('admob', 3), ('adm', 3), ('mob', 3), ('dmob', 3), ('dmo', 3), ('bapi', 2), ('dmobapi', 2), ('dmoba', 2), ('api', 2), ('obapi', 2), ('admobap', 2), ('admoba', 2), ('mobap', 2), ('dmobap', 2), ('bap', 2), ('mobapi', 2), ('moba', 2), ('obap', 2), ('oba', 2), ('admobapi', \
2)]

Use Collections.Counter , then split by dot afterall use dict comprehension-

>>>from collections import Counter
>>>calls = ['admob.api.oauthcert', 'admob.api.newsession', 'admob.endusercampaign']
>>>l = '.'.join(calls).split(".")
>>>d = Counter(l)
>>>{k:v for k,v in d.most_common(3) }
>>>{'admob': 3, 'api': 2}
>>>{k:v for k,v in d.most_common(4) }
>>>{'admob': 3, 'api': 2, 'newsession': 1, 'oauthcert': 1}

Or

>>>import re
>>>from collections import Counter
>>>d =  re.findall(r'\w+',"['admob.api.oauthcert', 'admob.api.newsession', 'admob.endusercampaign']")
>>>{k:v for k,v in Counter(d).most_common(2)}
>>>[('mob', 3), ('admob', 3), ('api', 2)]

Or

>>>from collections import Counter
>>>import re
>>>s= "['admobapioauthcert', 'admobapinewsession', 'admobendusercampaign']"
>>>w=[i for sb in re.findall(r'(?=(mob)|(api)|(admob))',s) for i in sb ]#Change (mob)|(api)|(admob) what you want
>>>{k:v for k,v in Counter(filter(bool, w)).most_common()}
>>>{'mob': 3, 'admob': 3, 'api': 2}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM