简体   繁体   English

使用RegEx这个单词'但'的句子

[英]Chunking sentences using the word 'but' with RegEx

I am attempting to chunk sentences using RegEx at the word 'but' (or any other coordinating conjunction words). 我试图使用RegEx在单词'but'(或任何其他协调连词)中使用RegEx来判断句子。 It's not working... 它不起作用......

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

I need to split the sentence "There are no large collections present but there is spinal canal stenosis." 我需要将句子分开"There are no large collections present but there is spinal canal stenosis." into two: 分为两个:

1. "There are no large collections present"
2. "there is spinal canal stenosis."

I also wish to use the same code to split sentences at 'and' and other coordinating conjunction (CC) words. 我还希望使用相同的代码来分隔'和'以及其他协调连接(CC)字的句子。 But my code isn't working. 但我的代码不起作用。 Please help. 请帮忙。

I think you can simply do 我想你可以做到

import re
result = re.split(r"\s+(?:but|and)\s+", sentence)

where 哪里

 `\\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy) `(?:` Match the regular expression below, do not capture Match either the regular expression below (attempting the next alternative only if this one fails) `but` Match the characters "but" literally `|` Or match regular expression number 2 below (the entire group fails if this one fails to match) `and` Match the characters "and" literally ) `\\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy) 

You can add more conjunction words in there separated by a pipe-character | 您可以在其中添加更多由管道符号|分隔的连词 . Take care though that these words do not contain characters that have special meaning in regex. 请注意,这些单词不包含在正则表达式中具有特殊含义的字符。 If in doubt, escape them first with re.escape(word) 如果有疑问,请先使用re.escape(word)它们转义

If you want to avoid hardcoding conjunction words like 'but' and 'and', try chinking along with chunking: 如果你想避免硬编码像'but'和'and'这样的单词,请尝试与chunking一起使用:


import nltk
Digdug = nltk.RegexpParser(r""" 
CHUNK_AND_CHINK:
{<.*>+}          # Chunk everything
}<CC>+{      # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))

result = Digdug.parse(sentence)

for subtree in result.subtrees(filter=lambda t: t.label() == 
'CHUNK_AND_CHINK'):
            print (subtree)

Chinking basically excludes what we dont need from a chunk phrase - 'but' in this case. Chinking基本上排除了我们不需要的大块短语 - 在这种情况下'但是'。 For more details , see: http://www.nltk.org/book/ch07.html 有关更多详细信息,请参阅: http//www.nltk.org/book/ch07.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM