简体   繁体   English

基于Bigram频率,Python的单词替换

[英]Replace Words on the basis of Bigram Frequency,Python

I have a series type object where i have to apply a function that uses bigrams to correct the word in case it occurs with another one. 我有一个系列类型的对象,在该对象中,我必须应用一个使用二元函数来纠正单词的功能,以防它与另一个单词一起出现。 I created a bigrams list , sorted it according to frequency (highest comes first) and called it fdist . 我创建了一个bigrams列表,根据频率对其进行排序(最高者在前),并将其称为fdist。

bigrams = [b for l in text2 for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
freq = nltk.FreqDist(bigrams) #computes freq of occurrence
fdist = freq.keys() # sorted according to freq

Next ,I created a function that accepts each line ("or sentence","object of a list") and uses the bigram to decide whether to correct it further or not. 接下来,我创建了一个接受每一行(“或句子”,“列表的对象”)的函数,并使用bigram决定是否进一步更正它。

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for word1, word2 in zip(words[:-1], words[1:]): #generate 2 words at a time words 1,2 followed by 2,3 3,4 and so on
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
               word1=i #replace
               return word1 #return word

The problem is that only a single word is returned for an entire sentence , eg : 问题是整个句子只返回一个单词,例如:
"Lts go twards the east is" replaced by lets . 让“取代东方前进”。 It looks that further iterations arent working. 看起来,进一步的迭代一直没有奏效。
The for loop for word1, word2 works this way : "Lts go" in 1st iteration, which will be eventually replaced by "lets" as lets occurs more frequently with "go" word1,word2的for循环以这种方式工作:在第一个迭代中“ Lts go”,最终将被“ lets”代替,因为let经常以“ go”出现

"go towards" in 2nd iteration. 在第二次迭代中“前进”。

"towards the" in 3rd iteration.. and so on. 在第三次迭代中“朝着”方向前进,依此类推。

There is a minor error which i cant figure out , please help. 我无法找出一个小错误,请帮忙。

Sounds like you're doing word1 = i with the expectation that this will modify the contents of words . 听起来您在执行word1 = i时期望这样做会修改words的内容。 But this won't happen. 但这不会发生。 If you want to modify words , you'll have to do so directly. 如果您想修改words ,则必须直接进行修改。 Use enumerate to keep track of word1 's index. 使用enumerate跟踪word1的索引。

As 2rs2ts pointed out, you're returning early. 正如2rs2ts所指出的,您要早点回来。 If you want the inner loop to terminate once you find the first good replacement, break instead of returning. 如果您希望内循环在找到第一个好的替换后终止,请break而不是返回。 Then return at the end of the function. 然后在函数末尾返回。

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for idx, (word1, word2) in enumerate(zip(words[:-1], words[1:])):
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
                words[idx] = i
                break
    return " ".join(words)

The return statement halts the function entirely. return语句完全停止功能。 I think what you want is: 我认为您想要的是:

def bigram_corr(line):
    words = line.split()
    words_to_return = []
    for word1, word2 in zip(words[:-1], words[1:]):
        for i,j in fdist:
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3):
               words_to_return.append(i)
    return ' '.join(words_to_return)

This puts each of the words which you have processed into a list, then rejoins them with spaces and returns that entire string, since you said something about returning "the entire sentence." 这会将您处理过的每个单词放入一个列表中,然后用空格将它们重新连接并返回整个字符串,因为您说过有关返回“整个句子”的内容。

I am not sure if the semantics of your code are correct, since I don't have the jf library or whatever it is that you're using and therefore I can't test this code, so this may or may not solve your problem entirely. 我不确定您的代码的语义是否正确,因为我没有jf库或您使用的是什么库,因此我无法测试此代码,因此这可能会解决您的问题,也可能无法解决您的问题完全。 But this will help. 但这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM