繁体   English   中英

在 Python 中使用 nltk 模块拆分单词

[英]Splitting words using nltk module in Python

我试图找到一种使用 nltk 模块在 Python 中拆分单词的方法。 鉴于我拥有的原始数据是标记词列表,我不确定如何达到我的目标,例如

['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']

如您所见,许多单词粘在一起(即“to”和“produce”粘在一个字符串“toproduce”中)。 这是从PDF文件中抓取数据的神器,我想找到一种使用python中的nltk模块来拆分粘在一起的单词的方法(即将“toproduce”拆分为两个单词:“to”和“produce”;将“standardoperatingprocedures”拆分为三个词:“standard”、“operating”、“procedures”)。

我感谢任何帮助!

我相信您会在这种情况下使用分词,并且我不知道NLTK中的任何分词功能都可以处理没有空格的英语句子。 您可以改用pyenchant 我仅以示例方式提供以下代码。 (它适用于少量的较短字符串(例如示例列表中的字符串),但对于较长的字符串或更多的字符串效率极低。)它将需要进行修改,并且不会成功地分割每个字符串在任何情况下都为字符串。

import enchant  # pip install pyenchant
eng_dict = enchant.Dict("en_US")

def segment_str(chars, exclude=None):
    """
    Segment a string of chars using the pyenchant vocabulary.
    Keeps longest possible words that account for all characters,
    and returns list of segmented words.

    :param chars: (str) The character string to segment.
    :param exclude: (set) A set of string to exclude from consideration.
                    (These have been found previously to lead to dead ends.)
                    If an excluded word occurs later in the string, this
                    function will fail.
    """
    words = []

    if not chars.isalpha():  # don't check punctuation etc.; needs more work
        return [chars]

    if not exclude:
        exclude = set()

    working_chars = chars
    while working_chars:
        # iterate through segments of the chars starting with the longest segment possible
        for i in range(len(working_chars), 1, -1):
            segment = working_chars[:i]
            if eng_dict.check(segment) and segment not in exclude:
                words.append(segment)
                working_chars = working_chars[i:]
                break
        else:  # no matching segments were found
            if words:
                exclude.add(words[-1])
                return segment_str(chars, exclude=exclude)
            # let the user know a word was missing from the dictionary,
            # but keep the word
            print('"{chars}" not in dictionary (so just keeping as one segment)!'
                  .format(chars=chars))
            return [chars]
    # return a list of words based on the segmentation
    return words

如您所见,此方法(大概)仅对字符串之一进行了错误分段:

>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]

然后,您可以使用chain来展平此列表列表:

>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']

您可以轻松安装以下库并将其用于您的目的:

pip install wordsegment
import wordsegment
help(wordsegment)

from wordsegment import load, segment
load()
segment('usingvariousmolecularbiology')

输出将是这样的:

Out[4]: ['using', 'various', 'molecular', 'biology']

有关更多详细信息,请参阅http://www.grantjenks.com/docs/wordsegment/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM