简体   繁体   中英

Splitting words using nltk module in Python

I am trying to find a way for splitting words in Python using the nltk module. I am unsure how to reach my goal given the raw data I have which is a list of tokenized words eg

['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']

As you can see many words are stuck together (ie 'to' and 'produce' are stuck in one string 'toproduce'). This is an artifact of scraping data from a PDF file and I would like to find a way using the nltk module in python to split the stuck-together words (ie split 'toproduce' into two words: 'to' and 'produce'; split 'standardoperatingprocedures' into three words: 'standard', 'operating', 'procedures').

I appreciate any help!

I believe you will want to use word segmentation in this case, and I am not aware of any word segmentation features in the NLTK that will deal with English sentences without spaces. You could use pyenchant instead. I offer the following code only by way of example. (It would work for a modest number of relatively short strings--such as the strings in your example list--but would be highly inefficient for longer strings or more numerous strings.) It would need modification, and it will not successfully segment every string in any case.

import enchant  # pip install pyenchant
eng_dict = enchant.Dict("en_US")

def segment_str(chars, exclude=None):
    """
    Segment a string of chars using the pyenchant vocabulary.
    Keeps longest possible words that account for all characters,
    and returns list of segmented words.

    :param chars: (str) The character string to segment.
    :param exclude: (set) A set of string to exclude from consideration.
                    (These have been found previously to lead to dead ends.)
                    If an excluded word occurs later in the string, this
                    function will fail.
    """
    words = []

    if not chars.isalpha():  # don't check punctuation etc.; needs more work
        return [chars]

    if not exclude:
        exclude = set()

    working_chars = chars
    while working_chars:
        # iterate through segments of the chars starting with the longest segment possible
        for i in range(len(working_chars), 1, -1):
            segment = working_chars[:i]
            if eng_dict.check(segment) and segment not in exclude:
                words.append(segment)
                working_chars = working_chars[i:]
                break
        else:  # no matching segments were found
            if words:
                exclude.add(words[-1])
                return segment_str(chars, exclude=exclude)
            # let the user know a word was missing from the dictionary,
            # but keep the word
            print('"{chars}" not in dictionary (so just keeping as one segment)!'
                  .format(chars=chars))
            return [chars]
    # return a list of words based on the segmentation
    return words

As you can see, this approach (presumably) mis-segments only one of your strings:

>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]

You can then use chain to flatten this list of lists:

>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']

You can easily install the following library and use it for your purpose:

pip install wordsegment
import wordsegment
help(wordsegment)

from wordsegment import load, segment
load()
segment('usingvariousmolecularbiology')

The output will be like this:

Out[4]: ['using', 'various', 'molecular', 'biology']

Please refer to http://www.grantjenks.com/docs/wordsegment/ for more details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM