简体   繁体   English

Python正则表达式测试句子是否有效

[英]Python regex test the sentence is valid

ACTIVE_LIST = ACTOR | ACTIVE_LIST and ACTOR
ACTOR = NOUN | ARTICLE NOUN
ARTICLE = a | the
NOUN = tom | jerry | goofy | mickey | jimmy | dog | cat | mouse

By applying above rule I can generate 通过应用上述规则,我可以生成

a tom 
tom and a jerry 
the tom and a jerry 
the tom and a jerry and tom and dog

but not 但不是

Tom 
the Tom and me

can I check the sentence is correct by only using python re module. 我可以仅使用python re模块检查句子是否正确。 I know how to match certain char by [abc] but don't know about word. 我知道如何用[abc]匹配某些字符,但不知道单词。 Actually I am trying to solve this ACM problem . 实际上,我正在尝试解决此ACM问题 If someone assist me partially I can do the rest. 如果有人部分帮助我,我可以做剩下的事。 This is my 1st question at this arena. 这是我在这个舞台上的第一个问题。 Any suggestion or improvement highly appreciated. 任何建议或改进表示高度赞赏。

Use re.compile 使用重新编译

re.compile('tom', re.IGNORECASE)

In this following topic, you will have other way to do without re.compile. 在下面的主题中,您将具有其他方法而无需重新编译。 (search / match) (搜索/匹配)

Case insensitive Python regular expression without re.compile 不区分大小写的Python正则表达式,无需重新编译

This can be seen as an NLP (Natural Language Processing) problem. 这可以看作是NLP(自然语言处理)问题。 There is a special python module called NLTK (Natural Language Toolkit) that can be best used to solve this task, easier done than with regular expressions. 有一个称为NLTK(自然语言工具包)的特殊python模块,可以最好地解决该任务,比使用正则表达式更容易完成。

1) First you need to download the NLTK ( http://www.nltk.org/install.html ) 1)首先,您需要下载NLTK( http://www.nltk.org/install.html

2) Import NLTK: 2)导入NLTK:

import nltk

3) Create a small grammar, a context free grammar containing your four rules ( https://en.wikipedia.org/wiki/Context-free_grammar ). 3)创建一个小的语法,一个上下文无关的语法,其中包含您的四个规则( https://en.wikipedia.org/wiki/Context-free_grammar )。 By means of the CFG module from NLTK, you can easily do that with one line of code: 借助NLTK的CFG模块,您可以使用一行代码轻松地完成此操作:

acm_grammar = nltk.CFG.fromstring("""
ACTIVE_LIST -> ACTOR | ACTIVE_LIST 'and' ACTOR
ACTOR -> NOUN | ARTICLE NOUN
ARTICLE -> 'a' | 'the'
NOUN -> 'tom' | 'jerry' | 'goofy' | 'mickey' | 'jimmy' | 'dog' | 'cat' | 'mouse' """)

4) Create a parser that will use the acm_grammar: 4)创建一个将使用acm_grammar的解析器:

parser = nltk.ChartParser(acm_grammar)

5) Test it on some input. 5)在某些输入上进行测试。 Input sentences must be in the form of a list with comma-separated words (strings). 输入句子必须为列表形式,并以逗号分隔的单词(字符串)。 The split() method can be used for this: split()方法可用于此目的:

input= ["a tom", "tom and a jerry", "the tom and a jerry","the tom and a jerry and tom and dog","Tom", "the Tom and me"]

for sent in input:
    split_sent = sent.split()
    try:
        parser.parse(split_sent)
        print(sent,"-- YES I WILL")
    except ValueError:
        print(sent,"-- NO I WON'T")

In this last step, we check if the parser can parse a sentence according to the acm_grammar. 在最后一步中,我们检查解析器是否可以根据acm_grammar解析句子。 If it cannot, the call to the parser will result in a ValueError. 如果不能,则对解析器的调用将导致ValueError。 Here is the output of this code: 这是此代码的输出:

a tom -- YES I WILL
tom and a jerry -- YES I WILL
the tom and a jerry -- YES I WILL
the tom and a jerry and tom and dog -- YES I WILL
Tom -- NO I WON'T
the Tom and me -- NO I WON'T

Yes, you can write that as a regex pattern, because the grammar is regular. 是的,您可以将其编写为正则表达式模式,因为语法是常规的。 The regular expression will be pretty long, but it could be generated in a fairly straight-forward way; 正则表达式将很长,但是可以以非常简单的方式生成。 once you have the regex, you just compile it and apply it to each input. 一旦有了正则表达式,就可以对其进行编译并将其应用于每个输入。

The key is to turn regular rules into repetitions. 关键是将规则转化为重复规则。 For example, 例如,

STATEMENT = ACTION | STATEMENT , ACTION

can be turned into 可以变成

ACTION (, ACTION)*

Of course, that's just a part of the problem, because you'd first have to have transformed ACTION into a regular expression in order to create the regex for STATEMENT . 当然,这只是问题的一部分,因为您首先必须将ACTION转换为正则表达式才能为STATEMENT创建正则表达式。

The problem description glosses over an important issue, which is that the input does not just consist of lower-case alphabetic characters and commas. 问题描述掩盖了一个重要的问题,即输入不仅包含小写字母字符和逗号。 It also contains spaces, and the regular expression needs to insist on spaces at appropriate points. 它还包含空格,正则表达式需要在适当的位置坚持空格。 For example, the , above probably must (and certainly might) be followed by one (or more) spaces. 例如,在,上面可能必须(当然可能),后跟一个(或多个)空格。 It might be ok if it were preceded by a one or more spaces, too; 如果前面也有一个或多个空格也可以。 the problem description isn't clear. 问题描述不清楚。

So the correction regular expression for NOUN will actually turn out to be: 因此, NOUN的校正正则表达式实际上将为:

((a|the) +)?(tom|jerry|goofy|mickey|jimmy|dog|cat|mouse)

(I also found it interesting that the grammar as presented lets VERB match "hatesssssssss". I have no idea whether that was intentional.) (我还发现所呈现的语法使VERB匹配“ hatesssssssss”很有趣。我不知道这是否是故意的。)

After thinking a lot I have solved it at my own 想了很多之后,我自己解决了

ARTICLE = ( 'a', 'the')
NOUN = ('tom' , 'jerry' , 'goofy' , 'mickey' , 'jimmy' , 'dog' , 'cat' , 'mouse')

all_a = NOUN +tuple([' '.join([x,y]) for x in ARTICLE for y in NOUN])


def aseKi(str):
    return str in all_a

st = 'the tom and jerry'
st1 = 'tom and a jerry'

st2 = 'tom and jerry and the mouse'

st = 'tom and goofy and goofy and the goofy and a dog and cat'

val = st.split('and')

nice_val = [x.strip() for x in val]


s = [aseKi(x) for x in nice_val]

if all(s):
    print 'YES I WILL'
else:
    print "NO I WON'T"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM