I am trying to create a program to calculate bigram probabilities. My first step is to work out the combinations of a sentence.
Each of these sentences start with a <s>
and end with a </s>
. So lets say my example sentence was <s> my name is python </s>
, my result should be (I have p tags because I will work out the probability after)
p(my | <s>)
p(name | my )
p (is | name)
p (python | is)
p (</s> | python)
But instead i'll get a result like:
Counter({('<', 's'): 1, ('s', '>'): 1, ('>', 'my'): 1, ('my', 'name'): 1, ('name', 'is'): 1, ('is', 'python'): 1, ('python', '<'): 1, ('<', '/s'): 1, ('/s', '>'): 1})
How would I seperate the <s>
and </s>
as a separate word and not split it.
My code is:
text = "<s> my name is python </s>"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
print(Counter(bigrams))
Edit
Lets say I have a text file
<s> a a b b c c </s> <s> a c b c </s> <s> b c c a b </s>
I then open this text file and perform the following operation on it and store it in a list.
temp = re.split("\s+",line.rstrip('\n'))
bigramText.append(temp)
So now in my list I have:
[['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
Now from this stage I want to perform calculations to get the bigram probabilities. I dont know If my initial question will help get the result but essentialy I am trying to figure out how many times those combinations occur ie you need to check how many times a letter appears next to the other
You should probably write your own bigrammizer if you can split by space (which is usually a tought condition. I'm would remain I'm and not I m)
def custom_bigrams(l):
return list(zip(l, l[1:]))
print(custom_bigrams(['<s>', 'my', 'name', 'is', 'python', '</s>']))
it prints
[('<s>', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'python'), ('python', '</s>')]
To use it on your list, you have to calculate bigrams and then use the update method from Counter.
your_list = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
c = Counter()
for x in your_list:
c.update(custom_bigrams(x))
Output
Counter({('b', 'c'): 3, ('<s>', 'a'): 2, ('a', 'b'): 2, ('c', 'c'): 2, ('c', '</s>'): 2, ('a', 'a'): 1, ('b', 'b'): 1, ('a', 'c'): 1, ('c', 'b'): 1, ('<s>', 'b'): 1, ('c', 'a'): 1, ('b', '</s>'): 1})
It is normal that NLTK tokenizer makes errors when segmenting '<s>'
and '</s>'
you should remove them before calling the tokenizer and then add them after tokenization.
text = "<s> my name is python </s>"
clean_text = text.replace('<s>','').replace('</s>','')
token = ['<s>'] + nltk.word_tokenize(clean_text) + ['</s>']
bigrams = ngrams(token,2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.