[英]regex tokenizer period vs. ellipsis
I want to separate strings by sentence-ending punctuation marks (so ".", "?", ".") excluding ellipsis ("..."), Note that the ellipsis in the text I'm dealing with are three periods.我想通过句尾标点符号(所以“。”,“?”,“。”)分隔字符串,不包括省略号(“...”),请注意,我正在处理的文本中的省略号是三个句点. not a dedicated Unicode string.
不是专用的 Unicode 字符串。
Currently what I do is目前我所做的是
tokenizer = nltk.RegexpTokenizer(r"[?.,]+", gaps=True)
But this still splits the string at ...
.但这仍然会在
...
处拆分字符串。 However , I want to keep splitting at !!!
但是,我想继续分裂
!!!
or ??
还是
??
, just not for multiple consecutive instances of .
,只是不适用于 的多个连续实例
.
. . What's the easiest way to distinguish between the ellipsis and the period, if I want to use RegexpTokenizer?
如果我想使用 RegexpTokenizer,区分省略号和句点的最简单方法是什么?
Something like this could work [??]+|(.<.\,)\?{1.2}(?!\.)
像这样的东西可以工作
[??]+|(.<.\,)\?{1.2}(?!\.)
We match either any non-zero amount of ?
我们匹配任何非零数量的
?
and !
和
!
or 1 or 2 dots that are not followed by the dot and not preceded by the dot.或 1 或 2 个点,其后没有点,也没有点前面。
But lookbehind and lookahead have bad performance.但是lookbehind和lookahead的性能很差。
BTW, I found this site https://pythex.org to check python regexes顺便说一句,我发现这个网站https://pythex.org检查 python 正则表达式
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.