如何在 Python 上使用正则表达式在不忽略括号的情况下标记文本

Question

How to tokenize the text without ignoring their parenthesis and the () using regex?如何使用正则表达式在不忽略括号和 () 的情况下标记文本？

For example:例如：

I want to tokenize this sentence:我想标记这句话：

I don't like to eat Cici's food (it is true).

I used this regex:我使用了这个正则表达式：

pattern = r'''(?x)([A-Z]\.)+|\w+(-\w+)*|\$?\d+(\.\d+)?%?|\.\.\.|[][.,;"'?():-_`]'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)

But the output is not like what I want:但输出不是我想要的：

I
don
'
t
like
to
eat
Cici
'
s
food
(
it
is
true
)
.

The output that I want should be like this , which is consider the parenthesis and not tokenize ( and a word after it and also not tokenize ) and a word before it:我想要的输出应该是这样的，它考虑括号而不是标记化（以及它后面的一个词，也不是标记化）和它之前的一个词：

I
don't
like
to
eat
Cici's
food
(it
is
true)
.

Anyone can helps me?任何人都可以帮助我吗？ Thank you.谢谢你。

Answer 1

You can use a regex like this:您可以使用这样的正则表达式：

(['()\w]+|\.)

Working demo工作演示

Match information匹配信息

MATCH 1
1.  [0-1]   `I`
MATCH 2
1.  [2-7]   `don't`
MATCH 3
1.  [8-12]  `like`
MATCH 4
1.  [13-15] `to`
MATCH 5
1.  [16-19] `eat`
MATCH 6
1.  [20-26] `Cici's`
MATCH 7
1.  [27-31] `food`
MATCH 8
1.  [32-35] `(it`
MATCH 9
1.  [36-38] `is`
MATCH 10
1.  [39-44] `true)`
MATCH 11
1.  [44-45] `.`

如何在 Python 上使用正则表达式在不忽略括号的情况下标记文本

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-10-07 15:05:13

如何在 Python 上使用正则表达式在不忽略括号的情况下标记文本

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-10-07 15:05:13

解决方案1
2 已采纳 2015-10-07 15:05:13