[英]Remove spaces from words and generate exact words
我正在使用python,並且正在尋找一種方法,可以使單詞完全有意義地排列並可以提高可讀性。 示例詞是
H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n
輸出量
小農戶如何適應世界糧食生產的大局
這是去除空白一次的一種方法,其中行有兩個空格,它將保留一個空白。
任何人都可以提出更多建議。
編輯
看到此文字行
Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns
這是我的問題,所以我根本無法刪除空格。 一個“想法”將每組字符與字典匹配,並找到寫詞。 也許
import re
a = 'H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n'
s = re.sub(r'(.) ',r'\1',a)
print(s)
How do smallholder farmers fit into the big picture of world foodproduction
您可以每2個字符輸入一個字符,然后刪除空格或為應該是空格的字符添加一個空格。
>>>''.join([string[i:i+2].strip() or ' ' for i in range(0, len(string), 2)])
'How do smallholder farmers fit into the big picture of world foodproduction'
Edit_2:**問題已更改,並且比較棘手。 我讓這個答案解決最后一個問題,但這不是實際的問題
當前問題
緊急情況下使用旅館的照相機和財務機制進行應急預案
我建議您使用一些實詞詞典 。 這是一個SO線程。
然后,您要Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns
一句話(在這里, Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns
),並使用空格split
其split
(看來,您只有這個字符是共同的) 。
這是偽代碼解決方案:
iterating through the string list:
keeping the currWord index
while realWord not found:
checking currWord in dictionnary.
if realWord is not found:
join the nextWord to the currWord
else:
join currWord to the final sentence
這樣做並保持您所在的currWord索引,您可以log
遇到問題的位置,並為分詞添加一些新規則。 如果達到某個閾值,您可能會知道自己有問題(例如:單詞長30個字符?)。
最后一個問題
編輯:你是對的@Adelin,我應該發表評論。
如果可以的話,可以使用一個更簡單的程序來了解正在發生的事情和/或如果您不喜歡將regex用於簡單的統一情況:
def raw_char_to_sentence(seq):
""" Splits the "seq" parameter using 'space'. As words are separated with two spaces,
"raw_char_to_sentence" transforms this list of characters into a full string
sentence.
"""
char_list = seq.split(' ')
sentence = ''
word = ''
for c in char_list:
# Adding single character to current word.
word += c
if c == '':
# If word is over, add it to sentence, and reset the current word.
sentence += (word + ' ')
word = ''
# This function adds a space at the end, so we need to strip it.
return sentence.rstrip()
temp = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
print raw_char_to_sentence(temp)
# outputs : How do smallholder farmersfit into the big picture of world
首先獲取單詞列表(又稱詞匯表)。 例如nltk.corpus.words
:
>>> from nltk.corpus import words
>>> vocab = words.words()
要么
>>> from collections import Counter
>>> from nltk.corpus import brown
>>> vocab_freq = Counter(brown.words()
將輸入轉換為無空格字符串
>>> text = "H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
>>> ''.join(text.lower().split()) 'howdosmallholderfarmersfitintothebigpictureofworldfoodproduction'
假設:
碼:
from collections import Counter
from nltk.corpus import brown
text = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())
vocab_freq = Counter(brown.words())
max_word_len = 10
words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
for j in reversed(range(max_word_len+1)):
# Check if word in vocab and frequency is > 0.
if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
words.append(s[i:i+j])
i = i+j
break
[出]:
how do small holder farmers fit into the big picture of world food production
假設2在很大程度上取決於您擁有的語料庫/詞匯,因此您可以組合更多的語料庫以獲得更好的結果:
from collections import Counter
from nltk.corpus import brown, gutenberg, inaugural, treebank
vocab_freq = Counter(brown.words()) + Counter(gutenberg.words()) + Counter(inaugural.words()) + Counter(treebank.words())
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())
max_word_len = 10
words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
for j in reversed(range(max_word_len+1)):
print(s[i:i+j])
# Check if word in vocab and frequency is > 0.
if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
words.append(s[i:i+j])
i = i+j
break
[出]:
innovative business models and financing mechanisms for p v deployment in emerging regions
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.