简体   繁体   中英

How to create a n-gram function from this function that I have?

I have this following function that counts character in a string in order the string is written:

def count_char(s):
    result = {}
    for i in range(len(s)):
        result[s[i]] = s.count(s[i])
    return result

For example. we have:

count_char("practice")
{'p' : 1, 'r' : 1, 'a' : 1, 'c' : 2, 't' : 1, 'i' : 1, 'e' : 1}

From this function, how do I create a function that counts the number of times each n-gram occurs in a string? For example,

ngrams("tataki",n=2)
{'ta':2, 'at':1, 'ak':1, 'ki':1}

The function ngrams should be a modification to the function count_char , but I am not sure how to do it at this point...

You can add a length parameter to your function; then just extend your slices from 1 character to that length:

def count_char(s, l = 1):
    result = {}
    for i in range(len(s)-l+1):
        result[s[i:i+l]] = s.count(s[i:i+l])
    return result

print(count_char("practice"))
print(count_char('tataki', 2))

Output:

{'p': 1, 'r': 1, 'a': 1, 'c': 2, 't': 1, 'i': 1, 'e': 1}
{'ta': 2, 'at': 1, 'ak': 1, 'ki': 1}

Note that str.count only counts non-overlapping occurrences, so count_char('ttt', 2) will return 1, not 2. If you want to count overlapping occurrences, you need to do that manually. For example:

def count_char(s, l = 1):
    result = {}
    for i in range(len(s)-l+1):
        sub = s[i:i+l]
        if sub not in result:
            result[sub] = sum(s[j:j+l] == sub for j in range(len(s)-l+1))
    return result

This function will return 2 when called as count_char('ttt', 2) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM