使用字典修復單詞在python中查找？

Question

我從文檔中提取了句子列表。 我正在預處理這個句子列表，使其更加明智。 我遇到了以下問題

我有句話，比如"more recen t ly the develop ment, wh ich is a po ten t "

我想用查找詞典來糾正這些句子？ 刪除不需要的空格。

最終的輸出應該是"more recently the development, which is a potent "

我認為這是預處理文本的直接任務？ 我需要一些幫助來尋找這些方法。 謝謝。

Answer 1

看一下單詞或文本分段。 問題是找到最可能的字符串拆分成一組單詞。 例：

 thequickbrownfoxjumpsoverthelazydog

最可能的細分當然應該是：

 the quick brown fox jumps over the lazy dog

這是一篇文章，包括使用Google Ngram語料庫解決問題的原型源代碼：

http://jeremykun.com/2012/01/15/word-segmentation/

該算法工作的關鍵是獲取有關世界的知識，在這種情況下是某種語言的詞頻。 我實現了這篇文章中描述的算法版本：

https://gist.github.com/miku/7279824

用法示例：

$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']

使用數據，即使這可以重新排序：

$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']

請注意，該算法非常慢 - 這是原型。

另一種使用NLTK的方法：

http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/

至於你的問題，你可以連接所有的字符串部分，以獲得一個字符串，並在其上運行分段算法。

Answer 2

你的目標是改進文本，而不是使文本完美; 所以你認為的方法在我看來是有道理的。 我會保持簡單並使用“貪婪”的方法：只要結果在字典中，就從第一個片段開始並將片段粘到它上面; 如果結果不是，那就吐出你到目前為止的東西並重新開始下一個片段。 是的，偶爾你會對像the me thod這樣the me thod案子犯錯，所以如果你要經常使用它，你可以尋找更復雜的東西。 但是，它可能足夠好了。

主要是你需要的是一本大字典。 如果您將使用它很多，我會將其編碼為“前綴樹”（也稱為trie ），以便您可以快速找出片段是否是真實單詞的開頭。 nltk提供了Trie實現。

由於這種虛假的單詞斷點是不一致的，我還會用當前文檔中已處理過的單詞擴展我的字典; 你可能早先看過完整的單詞，但現在已經分手了。

Answer 3

- 解決方案1：

讓我們把你的句子中的這些塊想象成算盤上的珠子，每個珠子由一個部分的弦組成，珠子可以向左或向右移動以產生排列。 每個片段的位置固定在兩個相鄰片段之間。 在目前的情況下，珠子將是：

(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)

這解決了2個子問題：

a）珠子是單個單元，所以我們不關心珠子內的排列，即“更多”的排列是不可能的。

b）珠子的順序是恆定的，只是它們之間的間距發生變化。 即“更多”將始終在“重新”之前，依此類推。

現在，生成這些珠子的所有排列，這將產生如下輸出：

morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent

然后根據它們包含的相關詞典中的單詞數來對這些排列進行評分，可以輕松過濾掉大多數正確的結果。 more recently the development, which is a potent將得分高於more recently the development, which is a potent morerecentlythedevelop ment, wh ich is a po ten t

執行珠子排列部分的代碼：

import re

def gen_abacus_perms(frags):
    if len(frags) == 0:
        return []
    if len(frags) == 1:
        return [frags[0]]

    prefix_1 = "{0}{1}".format(frags[0],frags[1])
    prefix_2 = "{0} {1}".format(frags[0],frags[1])
    if len(frags) == 2:
        nres = [prefix_1,prefix_2]
        return nres

    rem_perms = gen_abacus_perms(frags[2:])
    res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] +  \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
    return res



broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))

演示： http ： //ideone.com/pt4PSt

- 解決方案＃2：

我建議采用另一種方法，利用已經由處理類似問題的人們開發的文本分析智能，並處理依賴於字典和語法的大數據庫。例如搜索引擎。

我不太清楚這樣的公共/付費api，所以我的例子是基於谷歌的結果。

讓我們嘗試使用谷歌：

您可以繼續將無效條款放入Google，進行多次通過，並根據您的查找字典繼續評估某些得分的結果。 這里有兩個相關的輸出，使用你的文字2次傳遞：

在此輸入圖像描述

此輸出用於第二次傳遞：

在此輸入圖像描述

這讓你轉換為“”最近的開發，這是一個強大的“。

要驗證轉換，您必須使用一些相似性算法和評分來過濾掉無效/不太好的結果。

一種原始技術可能是使用difflib比較標准化字符串。

>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0

Answer 4

我建議剝離空格並尋找字典單詞以將其分解為。 您可以采取一些措施使其更加准確。 為了使它在沒有空格的文本中獲得第一個單詞，嘗試獲取整個字符串，並從文件中查找字典單詞（可以從http://wordlist.sourceforge.net/下載幾個這樣的文件），最長的單詞首先，要從要分段的字符串末尾取下字母。 如果你想讓它在大字符串上工作，你可以讓它自動從背面取下字母，這樣你找到第一個單詞的字符串只有最長的字典單詞。 這應該會導致您找到最長的單詞，並且不太可能將“異步”分類為“同步”。 下面是一個使用原始輸入來接收要更正的文本和一個名為dictionary.txt的字典文件的示例：

dict = open("dictionary.txt",'r')                                #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip()                                            #strips away spaces
spaced = []                                                      #this is the list of newly broken up words
parsing = True                                                   #this represents when the while loop can end
while parsing:
    if len(words) == 0:                                          #checks if all of the text has been broken into words, if it has been it will end the while loop
        parsing = False
    iterating = True
    for iteration in range(45):                                  #goes through each of the possible word lengths, starting from the biggest
        if iterating == False:
            break
        word = words[:45-iteration]                              #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
        for line in dict:
            line = line[:-1]                                     #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
            if line == word:                                     #this finds if this is the word we are looking for
                spaced.append(word)
                words = words[-(len(word)):]                     #takes away the word from the text list
                iterating = False
                break
print ' '.join(spaced)                                           #prints the output

如果你想要它更准確，你可以嘗試使用自然語言解析程序，有幾個可用於python免費在線。

Answer 5

這里有一些非常基本的東西：

chunks = []
for chunk in my_str.split():
    chunks.append(chunk)
    joined = ''.join(chunks)
    if is_word(joined):
        print joined,
        del chunks[:]

# deal with left overs
if chunks:
    print ''.join(chunks)

我假設你有一組可用於實現is_word的有效單詞。 你還必須確保它處理標點符號。 這是一種方法：

def is_word(wd):
    if not wd:
        return False
    # Strip of trailing punctuation. There might be stuff in front
    # that you want to strip too, such as open parentheses; this is
    # just to give the idea, not a complete solution.
    if wd[-1] in ',.!?;:':
        wd = wd[:-1]
    return wd in valid_words

Answer 6

您可以遍歷單詞詞典以找到最合適的單詞。 未找到匹配時將單詞添加到一起。

def iterate(word,dictionary):
   for word in dictionary:
      if words in possibleWord:
        finished_sentence.append(words)
        added = True
      else:
        added = False
      return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
  added,new_word = interate(word,dictionary)
  while True:   
    if added == False:
      word += possible[sentence.find(possibleWord)]
      iterate(word,dictionary)
    else:
      break
  finished_sentence.append(word)

這應該工作。 對於變量dictionary ，下載每個英文單詞的txt 文件，然后在程序中打開它。

Answer 7

我的index.py文件就像

from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))

我的index.php文件就像

<html>

<head>
  <title>py script</title>
</head>

<body>
  <h1>Hey There!Python Working Successfully In A PHP Page.</h1>
  <?php
    $python = `python index.py`;
    echo $python;
    ?>
</body>

</html>

希望這會奏效

使用字典修復單詞在python中查找？

問題描述

7 個解決方案

解決方案1
6 已采納 2013-10-30 06:20:54

解決方案2
4 2013-11-01 23:28:45

解決方案3
3 2013-10-30 06:41:21

解決方案4
3 2013-11-20 02:11:10

解決方案5
2 2013-10-30 06:37:39

解決方案6
2 2013-11-20 18:23:55

解決方案7
0 2019-03-20 08:29:21

使用字典修復單詞在python中查找？

問題描述

7 個解決方案

解決方案1 6 已采納 2013-10-30 06:20:54

解決方案2 4 2013-11-01 23:28:45

解決方案3 3 2013-10-30 06:41:21

解決方案4 3 2013-11-20 02:11:10

解決方案5 2 2013-10-30 06:37:39

解決方案6 2 2013-11-20 18:23:55

解決方案7 0 2019-03-20 08:29:21

解決方案1
6 已采納 2013-10-30 06:20:54

解決方案2
4 2013-11-01 23:28:45

解決方案3
3 2013-10-30 06:41:21

解決方案4
3 2013-11-20 02:11:10

解決方案5
2 2013-10-30 06:37:39

解決方案6
2 2013-11-20 18:23:55

解決方案7
0 2019-03-20 08:29:21