简体   繁体   中英

Count the occurrences of bigrams in string and save them into a dictionary

I code in Python, and I have a string which I want to count the number of occurrences of bigrams in that string. What I mean by that, is that for example I have the string "test string" and I would like to iterate through that string in sub-strings of size 2 and create a dictionary of each bigram and the number of its occurrences in the original string.
Thus, I would like to get an output of the form {te: 1, es : 1, st: 2, ...} .

Could you help me to get this started?
Best regards!

Given

s = "test string"

do

from collections import Counter
Counter(map(''.join, zip(s, s[1:])))

or

from collections import Counter
Counter(s[i:i+2] for i in range(len(s)-1))

The result of either is

Counter({'st': 2, 'te': 1, 'es': 1, 't ': 1, ' s': 1, 'tr': 1, 'ri': 1, 'in': 1, 'ng': 1})

As a side note, you're looking for bigrams . For bigger scale – there's robust implementations in different machine-learning/NLP kits.

As an ad-hoc solution, problem should be decomposed to

  1. Iterate over "current and next elements" in sequence
  2. Count unique pairs.

Solution for problem #1 is pairwise from itertools recipes

Solution for problem #2 is Counter


Putting all together is

from itertools import tee

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

Counter(pairwise('test string'))

I think something like this is simple and easy to do, and there is no need to import any library.

Firstly we remove all white-space from the string using join() .
Then we construct a list containing all sub-strings with a step of 2 .
Finally we construct and print() the dictionary which has all sub-strings as keys and their respective occurrences in the original string as values.

substr = [] # Initialize empty list that contains all substrings.
step = 2 # Initialize your step size.
s = ''.join('test string'.split()) # Remove all whitespace from string.
for i in range(len(s)):
    substr.append(s[i: i + step])
# Construct and print a dictionary which counts all occurences of substrings.
occurences = {k: substr.count(k) for k in substr if len(k) == step}
print(occurences) 

When run, it outputs a dictionary, as you requested:

{'te': 1, 'es': 1, 'st': 2, 'ts': 1, 'tr': 1, 'ri': 1, 'in': 1, 'ng': 1}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM