简体   繁体   中英

Remove spaces from words and generate exact words

I am using python and I am looking a way where I can arrange the words in a meaning full seance and can improve the readability. Sample words are

H o w  d o  s m a l l  h o l d e r  f a r m e r s  f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d  p r o d u c t i o n

Output
How do small holder farmers fit into the big picture of world food production

This one way to remover one time white spaces, where the line has two spaces it will keep the one.

Can anyone suggest more ways .

Edit

See this text line

Inn ovative  b usines s  m odels  and  financi ng  m e chanisms  for  pv  de ploym ent  in  em ergi ng  regio ns

This is my problem so I simply can't remove spaces. One Idea match every set of characters with dictionary and found the write words. May be

import re 

a = 'H o w   d o   sm a l l h o l d e r   f a r m e r s  f i t   i n t o   t h e   b i g   p i c t u r e   o f   w o r l d   f o o d p r o d u c t i o n'

s = re.sub(r'(.) ',r'\1',a)

print(s)

How do smallholder farmers fit into the big picture of world foodproduction

You can take every 2 characters and then either strip the spaces or append a space for those that are supposed to be a space....

>>>''.join([string[i:i+2].strip() or ' ' for i in range(0, len(string), 2)])
'How do smallholder farmers fit into the big picture of world foodproduction'

Edit_2 : **Question has changed and is a bit more tricky. I let this answer to the last problem, but it is not the actual one

CURRENT PROBLEM

Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns

I am advising you use some real word dictionnary . This is a SO thread.

You would, then, take your sentence (here Inn ovative b usines sm odels and financi ng me chanisms for pv de ploym ent in em ergi ng regio ns ), and split it using spaces (seemingly, you only have this character in common).

Here is the pseudo-code solution :

iterating through the string list:
    keeping the currWord index
    while realWord not found:
        checking currWord in dictionnary.
        if realWord is not found:
            join the nextWord to the currWord
        else:
            join currWord to the final sentence

Doing this, and keeping the currWord index you're at, you can log where you have a problem and add some new rules for your word splitting. You might know you have a problem if a certain threshold is reached (for instance : word 30 characters long ?).


LAST PROBLEM

Edit : You're right @Adelin, I should have commented.

If I may, a simpler program where you understand what's going on and/or if you dislike the use of regex for simple uniform cases:

def raw_char_to_sentence(seq):
    """ Splits the "seq" parameter using 'space'. As words are separated with two spaces,
        "raw_char_to_sentence" transforms this list of characters into a full string
        sentence.
    """
    char_list = seq.split(' ')

    sentence = ''
    word = ''
    for c in char_list:
        # Adding single character to current word.
        word += c
        if c == '':
            # If word is over, add it to sentence, and reset the current word.
            sentence += (word + ' ')
            word = ''

    # This function adds a space at the end, so we need to strip it.
    return sentence.rstrip()

temp = "H o w  d o  s m a l l h o l d e r  f a r m e r s f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d p r o d u c t i o n"
print raw_char_to_sentence(temp)
# outputs : How do smallholder farmersfit into the big picture of world

First get a list of words (aka vocabulary). Eg nltk.corpus.words :

>>> from nltk.corpus import words
>>> vocab = words.words()

Or

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> vocab_freq = Counter(brown.words()

Convert the input into space-less string

>>> text = "H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
>>> ''.join(text.lower().split())                                                                                                      'howdosmallholderfarmersfitintothebigpictureofworldfoodproduction'

Assumptions:

  • The longer a word, the more it looks like a word
  • Words that are not in the vocabulary is not a word

Code:

from collections import Counter 

from nltk.corpus import brown

text = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())

vocab_freq = Counter(brown.words())

max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[out]:

how do small holder farmers fit into the big picture of world food production

Assumption 2 is heavily dependent on the corpus/vocabulary you have so you can combine more corpora to get better results:

from collections import Counter 

from nltk.corpus import brown, gutenberg, inaugural, treebank

vocab_freq = Counter(brown.words()) + Counter(gutenberg.words()) +  Counter(inaugural.words()) + Counter(treebank.words()) 

text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())


max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        print(s[i:i+j])
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[out]:

innovative business models and financing mechanisms for p v deployment in emerging regions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM