從單詞中刪除空格並生成准確的單詞

Question

我正在使用python，並且正在尋找一種方法，可以使單詞完全有意義地排列並可以提高可讀性。 示例詞是

H o w  d o  s m a l l  h o l d e r  f a r m e r s  f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d  p r o d u c t i o n

輸出量
小農戶如何適應世界糧食生產的大局

這是去除空白一次的一種方法，其中行有兩個空格，它將保留一個空白。

任何人都可以提出更多建議。

編輯

看到此文字行

Inn ovative  b usines s  m odels  and  financi ng  m e chanisms  for  pv  de ploym ent  in  em ergi ng  regio ns

這是我的問題，所以我根本無法刪除空格。 一個“想法”將每組字符與字典匹配，並找到寫詞。 也許

Answer 1

import re 

a = 'H o w   d o   sm a l l h o l d e r   f a r m e r s  f i t   i n t o   t h e   b i g   p i c t u r e   o f   w o r l d   f o o d p r o d u c t i o n'

s = re.sub(r'(.) ',r'\1',a)

print(s)

How do smallholder farmers fit into the big picture of world foodproduction

Answer 2

您可以每2個字符輸入一個字符，然后刪除空格或為應該是空格的字符添加一個空格。

>>>''.join([string[i:i+2].strip() or ' ' for i in range(0, len(string), 2)])
'How do smallholder farmers fit into the big picture of world foodproduction'

Answer 3

Edit_2：**問題已更改，並且比較棘手。 我讓這個答案解決最后一個問題，但這不是實際的問題

當前問題

緊急情況下使用旅館的照相機和財務機制進行應急預案

我建議您使用一些實詞詞典。 這是一個SO線程。

然后，您要Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns一句話（在這里， Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns ），並使用空格split其split （看來，您只有這個字符是共同的）。

這是偽代碼解決方案：

iterating through the string list:
    keeping the currWord index
    while realWord not found:
        checking currWord in dictionnary.
        if realWord is not found:
            join the nextWord to the currWord
        else:
            join currWord to the final sentence

這樣做並保持您所在的currWord索引，您可以log遇到問題的位置，並為分詞添加一些新規則。 如果達到某個閾值，您可能會知道自己有問題（例如：單詞長30個字符？）。

最后一個問題

編輯：你是對的@Adelin，我應該發表評論。

如果可以的話，可以使用一個更簡單的程序來了解正在發生的事情和/或如果您不喜歡將regex用於簡單的統一情況：

def raw_char_to_sentence(seq):
    """ Splits the "seq" parameter using 'space'. As words are separated with two spaces,
        "raw_char_to_sentence" transforms this list of characters into a full string
        sentence.
    """
    char_list = seq.split(' ')

    sentence = ''
    word = ''
    for c in char_list:
        # Adding single character to current word.
        word += c
        if c == '':
            # If word is over, add it to sentence, and reset the current word.
            sentence += (word + ' ')
            word = ''

    # This function adds a space at the end, so we need to strip it.
    return sentence.rstrip()

temp = "H o w  d o  s m a l l h o l d e r  f a r m e r s f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d p r o d u c t i o n"
print raw_char_to_sentence(temp)
# outputs : How do smallholder farmersfit into the big picture of world

Answer 4

首先獲取單詞列表（又稱詞匯表）。 例如nltk.corpus.words ：

>>> from nltk.corpus import words
>>> vocab = words.words()

要么

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> vocab_freq = Counter(brown.words()

將輸入轉換為無空格字符串

>>> text = "H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
>>> ''.join(text.lower().split())                                                                                                      'howdosmallholderfarmersfitintothebigpictureofworldfoodproduction'

假設：

一個單詞越長，看起來就越像一個單詞
不在詞匯表中的單詞不是單詞

碼：

from collections import Counter 

from nltk.corpus import brown

text = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())

vocab_freq = Counter(brown.words())

max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[出]：

how do small holder farmers fit into the big picture of world food production

假設2在很大程度上取決於您擁有的語料庫/詞匯，因此您可以組合更多的語料庫以獲得更好的結果：

from collections import Counter 

from nltk.corpus import brown, gutenberg, inaugural, treebank

vocab_freq = Counter(brown.words()) + Counter(gutenberg.words()) +  Counter(inaugural.words()) + Counter(treebank.words()) 

text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())


max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        print(s[i:i+j])
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[出]：

innovative business models and financing mechanisms for p v deployment in emerging regions

從單詞中刪除空格並生成准確的單詞

問題描述

4 個解決方案

解決方案1
7 2018-01-03 07:17:29

解決方案2
1 2018-01-03 07:24:47

解決方案3
0 2018-01-03 07:26:53

解決方案4
0 已采納 2018-01-03 08:18:35

從單詞中刪除空格並生成准確的單詞

問題描述

4 個解決方案

解決方案1 7 2018-01-03 07:17:29

解決方案2 1 2018-01-03 07:24:47

解決方案3 0 2018-01-03 07:26:53

解決方案4 0 已采納 2018-01-03 08:18:35

解決方案1
7 2018-01-03 07:17:29

解決方案2
1 2018-01-03 07:24:47

解決方案3
0 2018-01-03 07:26:53

解決方案4
0 已采納 2018-01-03 08:18:35