简体   繁体   中英

Building a predictive model from scratch in python

I have a bunch of texts that I am analyzing with python in order to generate a predictive model capable of elaborating a human like text.

For this task I generate a dictionary containing each word that appears on the input and point it to another dictionary containing each word that follows and it's number of occurrences so I can do a weighted choice.

In pseudo code:

dict['foo']={'bar':3, 'barbar':1, 'baz':4}
prev_word=dict['foo']
nextword=random.choices(list(prev_word.keys()), weights=prev_word.values())

It works pretty good despite the rudimentary of the method so I tried to improve it by saving the predictions from the previous words to affect to predictions to the next:

dict[0]['foo']={'bar':3, 'barbar':1, 'baz':4}
while not word='///ending///':

     for n in range(len( dict)):
        remember=dict[n][prev_word]

     del remember[0]
     remember.append({})

     semantics=semantics/2 ###### Each turn every value gets reduced by half
     semantics=add_dict(remember,dict[word]) ####  And added to the predictions

     word=predict(semantics,word)
     output.append(word)
     remember=semantics
print(output)   





####so if I have the word cat and the next word can be jumps and the next can be to:
dict['cat']=[{'jumps':5},{'to':4}]
####and the next words to jumps are to and the:
dict['jumps']=[{'to':3},{'the':6}]
####the weights used to the prediction for jumps would be:

semantics=[{'to':7},{'the':6}]

But surprisingly this does not work as well as taking account of just the next word. In the last case expected output would be

"cat jumps to the"

but it often produces

"cat jumps to at"

What didn't happened so often with the previous more rudimentary implemention. Then is there something bad in my new approach or could it be just something bad in my code?

I mean taking acount on more than the next word for prediction is a bad approach?

Mostly solved: The problem was that I was including ALL the predictions from the penultimate word to the last and that added noise, the solution was to only count the predictions that the penultimate have in common to the last word and add all of the resting predictions from the last word.

next[1]['black']
{'jumps':3,'writes':2}
word
'cat'
next[0]['cat']
{'jumps':2, 'scratch':1}
add(next[1]['black'],next[0]['cat'])
{'jumps':5, 'scratch':1}
result
'black cat jumps'

Instead of:

add(next[1]['black'],next[0]['cat'])
{'jumps':5, 'scratch':1, 'writes':1}
result
'black cat writes' ###Which has less sense but could have no sense at all

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM