简体   繁体   English

在Python中将单词解析为(前缀,根,后缀)

[英]Parsing words into (prefix, root, suffix) in Python

I'm trying to create a simple parser for some text data. 我正在尝试为一些文本数据创建一个简单的解析器。 (The text is in a language that NLTK doesn't have any parsers for.) (该文本使用NLTK没有任何解析器的语言。)

Basically, I have a limited number of prefixes, which can be either one or two letters; 基本上,我的前缀数量有限,可以是一个或两个字母; a word can have more than one prefix. 一个单词可以有多个前缀。 I also have a limited number of suffixes of one or two letters. 我也有一两个字母的后缀数量有限。 Whatever's in between them should be the "root" of the word. 它们之间的任何东西都应该是这个词的“根”。 Many words will have more the one possible parsing, so I want to input a word and get back a list of possible parses in the form of a tuple (prefix,root,suffix). 许多单词将有更多可能的解析,所以我想输入一个单词并以元组(前缀,根,后缀)的形式返回可能的解析列表。

I can't figure out how to structure the code though. 我无法弄清楚如何构造代码。 I pasted an example of one way I tried (using some dummy English data to make it more understandable), but it's clearly not right. 我粘贴了一个我试过的方法的例子(使用一些虚拟英语数据使其更容易理解),但显然不对。 For one thing it's really ugly and redundant, so I'm sure there's a better way to do it. 首先,它真的很丑陋和多余,所以我确信有更好的方法来做到这一点。 For another, it doesn't work with words that have more than one prefix or suffix, or both prefix(es) and suffix(es). 另一方面,它不适用于具有多个前缀或后缀,或前缀(es)和后缀(es)的单词。

Any thoughts? 有什么想法吗?

prefixes = ['de','con']
suffixes = ['er','s']

def parser(word):
    poss_parses = []
    if word[0:2] in prefixes:
        poss_parses.append((word[0:2],word[2:],''))
    if word[0:3] in prefixes:
        poss_parses.append((word[0:3],word[3:],''))
    if word[-2:-1] in prefixes:
        poss_parses.append(('',word[:-2],word[-2:-1]))
    if word[-3:-1] in prefixes:
        poss_parses.append(('',word[:-3],word[-3:-1]))
    if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
    if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
        poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
    return poss_parses



>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
...   parses = parser(w)
...   print w
...   for p in parses:
...     print p
... 
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')

Here is my solution: 这是我的解决方案:

prefixes = ['de','con']
suffixes = ['er','s']

def parse(word):
    prefix = ''
    suffix = ''

    # find all prefixes
    found = True
    while found:
        found = False
        for p in prefixes:
            if word.startswith(p):
                prefix += p
                word = word[len(p):] # remove prefix from word
                found = True

    # find all suffixes
    found = True
    while found:
        found = False
        for s in suffixes:
            if word.endswith(s):
                suffix = s + suffix
                word = word[:-len(s)] # remove suffix from word
                found = True

    return (prefix, word, suffix)

print parse('construct')
print parse ('destructer')
print parse('deconstructs')
print parse('deconstructers')
print parse('deconstructser')
print parse('condestructser')

Result: 结果:

>>> 
('con', 'struct', '')
('de', 'struct', 'er')
('decon', 'struct', 's')
('decon', 'struct', 'ers')
('decon', 'struct', 'ser')
('conde', 'struct', 'ser')

The idea is to loop through all prefixes and aggregate them, and at the same time remove them from the word. 我们的想法是循环遍历所有前缀并聚合它们,同时将它们从单词中删除。 The tricky part is that the order in which the prefixes are defined may hide prefixes from being found, so the iterations must be re-invoked until all prefixes are found. 棘手的部分是定义前缀的顺序可能会隐藏找不到的前缀,因此必须重新调用迭代,直到找到所有前缀。

The same goes for suffixes, except that we build the suffix word in reverse order. 后缀也是如此,除了我们以相反的顺序构建后缀字。

CodeChords man beat me to this one, but as mine gives the prefixes and suffixes as tuples (which may be more or less useful given the context), and uses recursion, I thought I'd post it anyway. CodeChords的人打败了我,但是因为我的前缀和后缀是元组(根据上下文可能或多或少有用),并且使用递归,我想我还是会发布它。

class Parser():
    PREFIXES = ['de', 'con']
    SUFFIXES = ['er', 's']
    MINUMUM_STEM_LENGTH = 3

    @classmethod
    def prefixes(cls, word, internal=False):
        stem = word
        prefix = None
        for potential_prefix in cls.PREFIXES:
            if word.startswith(potential_prefix):
                prefix = potential_prefix
                stem = word[len(prefix):]
                if len(stem) >= cls.MINUMUM_STEM_LENGTH:
                    break
                else:
                    prefix = None
                    stem = word
        if prefix:
            others, stem = cls.prefixes(stem, True)
            others.append(prefix)
            return (others, stem) if internal else (reversed(others), stem)
        else:
            return [], stem

    @classmethod
    def suffixes(cls, word):
        suffix = None
        stem = word
        for potential_suffix in cls.SUFFIXES:
            if word.endswith(potential_suffix):
                suffix = potential_suffix
                stem = word[:-len(suffix)]
                if len(stem) >= cls.MINUMUM_STEM_LENGTH:
                    break
                else:
                    suffix = None
                    stem = word
        if suffix:
            others, stem = cls.suffixes(stem)
            others.append(suffix)
            return others, stem
        else:
            return [], stem

    @classmethod
    def parse(cls, word):
        prefixes, word = cls.prefixes(word)
        suffixes, word = cls.suffixes(word)
        return(tuple(prefixes), word, tuple(suffixes))

words = ['con', 'deAAers', 'deAAAers', 'construct', 'destructer', 'constructs', 'deconstructs', 'deconstructers']

parser = Parser()
for word in words:
    print(parser.parse(word))

Which gives us: 这给了我们:

((), 'con', ())
(('de',), 'AAer', ('s',))
(('de',), 'AAA', ('er', 's'))
(('con',), 'struct', ())
(('de',), 'struct', ('er',))
(('con',), 'struct', ('s',))
(('de', 'con'), 'struct', ('s',))
(('de', 'con'), 'struct', ('er', 's'))

This works by taking the word, and using the str.startswith() function to find prefixes. 这可以通过获取单词,并使用str.startswith()函数来查找前缀。 It does this recursively until it is reduced to a word with no prefixes, then passes back the list of prefixes. 它以递归方式执行此操作,直到将其缩减为没有前缀的单词,然后传回前缀列表。

It then does a similar thing for suffixes, except using str.endswith() for obvious reasons. 然后它为后缀做了类似的事情,除了使用str.endswith()出于显而易见的原因。

Pyparsing wraps the string indexing and token extracting into its own parsing framework, and allows you to use simple arithmetic syntax to build up your parsing definitions: Pyparsing将字符串索引和标记提取包装到自己的解析框架中,并允许您使用简单的算术语法来构建解析定义:

wordlist = ['construct','destructer','constructs','deconstructs']

from pyparsing import StringEnd, oneOf, FollowedBy, Optional, ZeroOrMore, SkipTo

endOfString = StringEnd()
prefix = oneOf("de con")
suffix = oneOf("er s") + FollowedBy(endOfString)

word = (ZeroOrMore(prefix)("prefixes") + 
        SkipTo(suffix | endOfString)("root") + 
        Optional(suffix)("suffix"))

for wd in wordlist:
    print wd
    res = word.parseString(wd)
    print res.dump()
    print res.prefixes
    print res.root
    print res.suffix
    print

The results are returned in a rich object called ParseResults, which can be accessed as a simple list, as an object with named attributes, or as a dict. 结果在一个名为ParseResults的富对象中返回,该对象可以作为简单列表,具有命名属性的对象或作为dict来访问。 The output from this program is: 该程序的输出是:

construct
['con', 'struct']
- prefixes: ['con']
- root: struct
['con']
struct


destructer
['de', 'struct', 'er']
- prefixes: ['de']
- root: struct
- suffix: ['er']
['de']
struct
['er']

constructs
['con', 'struct', 's']
- prefixes: ['con']
- root: struct
- suffix: ['s']
['con']
struct
['s']

deconstructs
['de', 'con', 'struct', 's']
- prefixes: ['de', 'con']
- root: struct
- suffix: ['s']
['de', 'con']
struct
['s']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM