简体   繁体   中英

What are suggested solutions for text prediction using python in Google App Engine?

I am working on a website using Google App Engine and Python. I am looking to add a feature to the website where a user can type in a word and the system will give the closest matching words/sentences to the word(based on usage) as suggestions to the user. Now I have already implemented an algorithm based on Peter Norvig's approach spell checking algorithm. But I feel that is not be a very scalable solution in the long run. I am looking for what are the suggested ways of implementing such a feature on Google App Engine. Is prediction Api the way to go? Or writing my own algorithm is the best way? If writing my own algorithm is the way can anyone give me some pointers on how to make the solution robust?

Code Snippet:

import re, collections
from bp_includes.models import User, SocialUser
from bp_includes.lib.basehandler import BaseHandler
from google.appengine.ext import ndb
import utils.ndb_json as ndb_json

class TextPredictionHandler(BaseHandler):
  alphabet_list = 'abcdefghijklmnopqrstuvwxyz' #list of alphabets

  #Creates corpus with frequency/probability distribution
  def trainForText(self,features):
    search_dict = collections.defaultdict(lambda: 1)
    for f in features:
      search_dict[f] += 1
    return search_dict

  #Heart of the code. Decides how many words can be formed by modifying a given word by one letter
  def edit_dist_one(self,word):
    splits      = [(word[:i],word[i:]) for i in range(len(word) + 1)]
    deletes     = [a + b[1:] for a,b in splits if b]
    transposes  = [a + b[1] + b[0] + b[2:] for a,b in splits if (len(b) > 1)]
    replaces = [a + c + b[1:] for a, b in splits for c in self.alphabet_list if b]
    inserts  = [a + c + b     for a, b in splits for c in self.alphabet_list]
    return set(deletes + transposes + replaces + inserts)

  #Checks for exact matches in Corpus for words 
  def existing_words(self,words,trainSet):
    return set(w for w in words if w in trainSet)

  #Checks for partial matches in Corpus for a word.
  def partial_words(self,word,trainSet):
    regex = re.compile(".*("+word+").*")    
    return set(str(m.group(0)) for l in trainSet for m in [regex.search(l)] if m)

  def found_words(self,word):
    word = word.lower()
    data = []
    q = models.SampleModel.query()    #This line will not work as I had to mask out the model I am using
    #Really bad way of making a Corpus. Needs to modified to be scalable. So many loops. Corpus can be stored in google cloud storage to reduce processing time.
    for upost in q.fetch(): 
        if upost.text!="":
          tempTextData = re.sub("[^\w]", " ",  upost.text).split()
          for t in range(len(tempTextData)):
            data.append(tempTextData[t].lower())
          # data.append(upost.text.lower())
        if upost.definition!="":
          tempData = re.sub("[^\w]", " ",  upost.definition).split()
          for t in range(len(tempData)):
            data.append(tempData[t].lower())
        if upost.TextPhrases:
         for e in upost.TextPhrases:
          for p in e.get().phrases: 
              data.append(p.lower())
        if upost.Tags:
          for h in upost.Tags:
            if h.get().text.replace("#","")!="" :
              data.append(h.get().text.replace("#","").lower())
    trainSet = self.trainForText(data)
    set_of_words = self.existing_words([word],trainSet).union(self.existing_words(self.edit_dist_one(word),trainSet))
    set_of_words = set_of_words.union(self.partial_words(word,trainSet))
    set_of_words = set_of_words.union([word])
    return set_of_words

  def get(self, search_text):
    outputData = self.found_words(search_text)
    data = {"texts":[]}
    for dat in outputData:
      pdata = {}
      pdata["text"] = dat;
      data["texts"].append(pdata)
    self.response.out.write(ndb_json.dumps(data))

Using the Prediction API is the most reliable and scaleable than making your own. There is no need to reinvent the wheel.
If you were to code your own it would likely be a long complex process with lots of bumps in the road, unless you have an avid interest in learning and coding that system I'd suggest you use the existing tools.
Here's an example from google themselves.
Here's the documentation for the Prediction API .
The Hello World program with Prediction API.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM