简体   繁体   English

如何在 Python 上使用正则表达式在不忽略括号的情况下标记文本

[英]How to tokenize the text without ignoring their parenthesis using regex on Python

How to tokenize the text without ignoring their parenthesis and the () using regex?如何使用正则表达式在不忽略括号和 () 的情况下标记文本?

For example:例如:

I want to tokenize this sentence:我想标记这句话:

I don't like to eat Cici's food (it is true).

I used this regex:我使用了这个正则表达式:

pattern = r'''(?x)([A-Z]\.)+|\w+(-\w+)*|\$?\d+(\.\d+)?%?|\.\.\.|[][.,;"'?():-_`]'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)

But the output is not like what I want:但输出不是我想要的:

I
don
'
t
like
to
eat
Cici
'
s
food
(
it
is
true
)
.

The output that I want should be like this , which is consider the parenthesis and not tokenize ( and a word after it and also not tokenize ) and a word before it:我想要的输出应该是这样的,它考虑括号而不是标记化(以及它后面的一个词,也不是标记化)和它之前的一个词:

I
don't
like
to
eat
Cici's
food
(it
is
true)
.

Anyone can helps me?任何人都可以帮助我吗? Thank you.谢谢你。

You can use a regex like this:您可以使用这样的正则表达式:

(['()\w]+|\.)

Working demo工作演示

Match information匹配信息

MATCH 1
1.  [0-1]   `I`
MATCH 2
1.  [2-7]   `don't`
MATCH 3
1.  [8-12]  `like`
MATCH 4
1.  [13-15] `to`
MATCH 5
1.  [16-19] `eat`
MATCH 6
1.  [20-26] `Cici's`
MATCH 7
1.  [27-31] `food`
MATCH 8
1.  [32-35] `(it`
MATCH 9
1.  [36-38] `is`
MATCH 10
1.  [39-44] `true)`
MATCH 11
1.  [44-45] `.`

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM