I am using python and I am looking a way where I can arrange the words in a meaning full seance and can improve the readability. Sample words are
H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n
Output
How do small holder farmers fit into the big picture of world food production
This one way to remover one time white spaces, where the line has two spaces it will keep the one.
Can anyone suggest more ways .
Edit
See this text line
Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns
This is my problem so I simply can't remove spaces. One Idea match every set of characters with dictionary and found the write words. May be
import re
a = 'H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n'
s = re.sub(r'(.) ',r'\1',a)
print(s)
How do smallholder farmers fit into the big picture of world foodproduction
You can take every 2 characters and then either strip the spaces or append a space for those that are supposed to be a space....
>>>''.join([string[i:i+2].strip() or ' ' for i in range(0, len(string), 2)])
'How do smallholder farmers fit into the big picture of world foodproduction'
Edit_2 : **Question has changed and is a bit more tricky. I let this answer to the last problem, but it is not the actual one
CURRENT PROBLEM
Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns
I am advising you use some real word dictionnary . This is a SO thread.
You would, then, take your sentence (here Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns
), and split
it using spaces (seemingly, you only have this character in common).
Here is the pseudo-code solution :
iterating through the string list:
keeping the currWord index
while realWord not found:
checking currWord in dictionnary.
if realWord is not found:
join the nextWord to the currWord
else:
join currWord to the final sentence
Doing this, and keeping the currWord index you're at, you can log
where you have a problem and add some new rules for your word splitting. You might know you have a problem if a certain threshold is reached (for instance : word 30 characters long ?).
LAST PROBLEM
Edit : You're right @Adelin, I should have commented.
If I may, a simpler program where you understand what's going on and/or if you dislike the use of regex for simple uniform cases:
def raw_char_to_sentence(seq):
""" Splits the "seq" parameter using 'space'. As words are separated with two spaces,
"raw_char_to_sentence" transforms this list of characters into a full string
sentence.
"""
char_list = seq.split(' ')
sentence = ''
word = ''
for c in char_list:
# Adding single character to current word.
word += c
if c == '':
# If word is over, add it to sentence, and reset the current word.
sentence += (word + ' ')
word = ''
# This function adds a space at the end, so we need to strip it.
return sentence.rstrip()
temp = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
print raw_char_to_sentence(temp)
# outputs : How do smallholder farmersfit into the big picture of world
First get a list of words (aka vocabulary). Eg nltk.corpus.words
:
>>> from nltk.corpus import words
>>> vocab = words.words()
Or
>>> from collections import Counter
>>> from nltk.corpus import brown
>>> vocab_freq = Counter(brown.words()
Convert the input into space-less string
>>> text = "H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
>>> ''.join(text.lower().split()) 'howdosmallholderfarmersfitintothebigpictureofworldfoodproduction'
Assumptions:
Code:
from collections import Counter
from nltk.corpus import brown
text = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())
vocab_freq = Counter(brown.words())
max_word_len = 10
words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
for j in reversed(range(max_word_len+1)):
# Check if word in vocab and frequency is > 0.
if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
words.append(s[i:i+j])
i = i+j
break
[out]:
how do small holder farmers fit into the big picture of world food production
Assumption 2 is heavily dependent on the corpus/vocabulary you have so you can combine more corpora to get better results:
from collections import Counter
from nltk.corpus import brown, gutenberg, inaugural, treebank
vocab_freq = Counter(brown.words()) + Counter(gutenberg.words()) + Counter(inaugural.words()) + Counter(treebank.words())
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())
max_word_len = 10
words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
for j in reversed(range(max_word_len+1)):
print(s[i:i+j])
# Check if word in vocab and frequency is > 0.
if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
words.append(s[i:i+j])
i = i+j
break
[out]:
innovative business models and financing mechanisms for p v deployment in emerging regions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.