簡體   English   中英

在 Python 中使用 nltk 模塊拆分單詞

[英]Splitting words using nltk module in Python

我試圖找到一種使用 nltk 模塊在 Python 中拆分單詞的方法。 鑒於我擁有的原始數據是標記詞列表,我不確定如何達到我的目標,例如

['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']

如您所見,許多單詞粘在一起(即“to”和“produce”粘在一個字符串“toproduce”中)。 這是從PDF文件中抓取數據的神器,我想找到一種使用python中的nltk模塊來拆分粘在一起的單詞的方法(即將“toproduce”拆分為兩個單詞:“to”和“produce”;將“standardoperatingprocedures”拆分為三個詞:“standard”、“operating”、“procedures”)。

我感謝任何幫助!

我相信您會在這種情況下使用分詞,並且我不知道NLTK中的任何分詞功能都可以處理沒有空格的英語句子。 您可以改用pyenchant 我僅以示例方式提供以下代碼。 (它適用於少量的較短字符串(例如示例列表中的字符串),但對於較長的字符串或更多的字符串效率極低。)它將需要進行修改,並且不會成功地分割每個字符串在任何情況下都為字符串。

import enchant  # pip install pyenchant
eng_dict = enchant.Dict("en_US")

def segment_str(chars, exclude=None):
    """
    Segment a string of chars using the pyenchant vocabulary.
    Keeps longest possible words that account for all characters,
    and returns list of segmented words.

    :param chars: (str) The character string to segment.
    :param exclude: (set) A set of string to exclude from consideration.
                    (These have been found previously to lead to dead ends.)
                    If an excluded word occurs later in the string, this
                    function will fail.
    """
    words = []

    if not chars.isalpha():  # don't check punctuation etc.; needs more work
        return [chars]

    if not exclude:
        exclude = set()

    working_chars = chars
    while working_chars:
        # iterate through segments of the chars starting with the longest segment possible
        for i in range(len(working_chars), 1, -1):
            segment = working_chars[:i]
            if eng_dict.check(segment) and segment not in exclude:
                words.append(segment)
                working_chars = working_chars[i:]
                break
        else:  # no matching segments were found
            if words:
                exclude.add(words[-1])
                return segment_str(chars, exclude=exclude)
            # let the user know a word was missing from the dictionary,
            # but keep the word
            print('"{chars}" not in dictionary (so just keeping as one segment)!'
                  .format(chars=chars))
            return [chars]
    # return a list of words based on the segmentation
    return words

如您所見,此方法(大概)僅對字符串之一進行了錯誤分段:

>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]

然后,您可以使用chain來展平此列表列表:

>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']

您可以輕松安裝以下庫並將其用於您的目的:

pip install wordsegment
import wordsegment
help(wordsegment)

from wordsegment import load, segment
load()
segment('usingvariousmolecularbiology')

輸出將是這樣的:

Out[4]: ['using', 'various', 'molecular', 'biology']

有關更多詳細信息,請參閱http://www.grantjenks.com/docs/wordsegment/

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM